LLM Benchmark for 'Longform Creative Writing'

96 points by vitorgrs 8 days ago

lukev 8 days ago

Longform creative writing is possibly the least valid or interesting use case for LLMs.

Why should I be bothered to read what nobody could be bothered to write?

The point of writing is communication and creative writing is specifically about the human experience… something a LLM can mimic but never speak to authoritatively.

HappyPanacea 8 days ago

> Why should I be bothered to read what nobody could be bothered to write?
There are some long abandoned fanfiction that nobody would bother to continue and which I would have liked to see a continuation of a sufficient quality.
VMG 8 days ago

just because it's a non-human artifact doesn't necessarily mean it's boring
for instance, people like Minecraft-generated worlds even though no human has crafted them
- card_zero 8 days ago
  
  I never played much Minecraft but I assume that's some procedural generation where a human crafted the procedure.
  (Am I being downvoted for not playing enough Minecraft? I apologise.)
  - dodslaser 8 days ago
    
    Well, LLMs aren't just randomly spitting out garbled text either. Humans create and curate the training data, humans design the models, and humans train them.
  - VMG 8 days ago
    
    sure, but humans have also crafted LLMs
    
    card_zero 8 days ago
    
    Yes, but then again, completely no. I mean, a procedural generation algorithm for a game world is an artwork tuned to produce results the artist likes, usually with care to allow some cases and exclude others and meet a particular vision. It's not very different from a pre-made world, except using the RNG as one of its building blocks.
    
    VMG 8 days ago
    
    Same can be said for prompts. What's the difference from a pleasing underground mine cave labyrinth and some weird Pulp novel plot?
    
    card_zero 8 days ago
    
    True, so I went and looked for the prompts:
    https://github.com/EQ-bench/creative-writing-bench/blob/main...
    Not clear on how these elaborate prompts relate to the very short prompts in the samples.
    
    sam-paech 8 days ago
    
    Different benchmark, those are for the short form creative writing leaderboard here: https://eqbench.com/creative_writing.html
    
    hhjinks 8 days ago
    
    >tuned to produce results the artist likes
    Kinda like RLHF?
bambax 8 days ago

> Why should I be bothered to read what nobody could be bothered to write?
I tend to agree but, for the sake of argument: what if you don't know?
It seems most people who say they can tell, actually can't.
- abenga 8 days ago
  
  You (the ostensible writer or publisher) have to lie forever for me to continue reading your content in this case. Immediately it comes out that any of the writing from an author/outlet is AI generated, I will blacklist them from my reading list forever. The point of reading (to me) is to commune with fellow humans and learn from their lived experience, it's not just to consume prose.
  - card_zero 8 days ago
    
    Seems like you'd still be doing that in a distant way, since it all began as human prose at the training stage.
    
    HappyPanacea 8 days ago
    
    Mm, regurgitated human prose-slop sounds delicious to me
    
    card_zero 8 days ago
    
    It's not the most enticing thing. If the prompt is nicely crafted, you get more intentional results, and maybe that's interesting. Seems a tough sell. I suppose there are authors like Simenon and Pratchett who would churn out books at the rate of at least one a year, and fans might wish for increased output, so maybe there's a niche for a slop-assisted author.
dwringer 8 days ago

It's just one pretty trivial example, but many months ago I had to develop some backstory for my character in a D&D campaign. I had some character background and some paragraphs I wrote to start off with, but I wasn't sure what direction to go exactly and was trying to come up with different ideas. I dropped what I had into a prompt built to setup a pseudo-text-adventure-game using a local WizardLM model running through KoboldCpp, then I ran through it interactively to the point of getting incoherent about 100 times in a row. This yielded some new ideas, some old ones, and overall an interesting distribution of outcomes; where 2 or 3 happened most commonly, and all 100 were basically variations on about 10.
I won't argue that any of what the LLM came up with would stand on its own as particularly interesting - often quite the opposite - but it showed me examples of all the "obvious directions to go" as well as some other RPG cliches that I could either adopt or choose to avoid, and ultimately served as an excellent brainstorming assistant with interesting ideas and the ability to carry through and embellish them with enough detail to get a sense for either what works or what doesn't.
- kristianp 8 days ago
  
  What is the process you went through interactively? Regenerate until something interesting comes out and the partially edit it? I'm curious about what you found works.
  - dwringer 7 days ago
    
    The KoboldCpp UI I was using had a pretty straightforward way of setting up a chatbot-style dialogue interface that would send the dialogue context so far with each message input, along with some supporting prompt infrastructure reminding the model of how it's supposed to reply based on that context. It also allowed editing the context directly; but I didn't often do that except to correct simple one-off inaccuracies, instead usually opting to restart from the beginning when things would go off-track.
    When the model would reply as multiple characters at once, or interleave narration with the dialogue, I'd usually just play along - most commonly it would decide for 2 or 3 steps to carry on the dialogue just hallucinating the things I would say.
    I did a lot of experimenting with different ways to frame the dialogue, from a Zork-like adventure game, to an instant messenger chat, to just paragraphs of prose with dialogue mixed in as if from a novel. I found all of these methods to have different strengths or weaknesses, but the model also tended to blur between them after a few messages so I'd just play along as far as I could each time.
Majromax 8 days ago

> Why should I be bothered to read what nobody could be bothered to write?
In an art-criticism sense I broadly agree with you, but I think you go too far from a reader-experience point of view.
> The point of writing is communication and creative writing is specifically about the human experience
That's a point of writing, but the point of reading is only sometimes about communication. It's also about entertainment, enrichment, or expressing half-formed thoughts and feelings.
Take a step back from the autonomously-written novel and imagine something a bit more collaborative. Many players of open-world-ish games develop an emergent story, what if that story could be semi-automatically novelized to document the unique narrative of the playthrough?
SimCity 2000 had "mad-lib" newspaper articles that commented on the city's status; consider ones written by an LLM with full knowledge of the game's state and context.
> something a LLM can mimic but never speak to authoritatively.
Supposing a reader can't tell the difference between an average human-written novel and an average LLM-written novel, where does this authority lie?
Gracana 8 days ago

Why limit yourself only to experiences curated by humans? You wouldn't do that outside of reading, it'd be an impossible limitation to impose. The only reason reading has been that way is because only humans could write. Now machines can write, and I can tinker and play with reading the same way I would with anything else.
- debesyla 8 days ago
  
  Wait, I wouldn't want to get recipe or furniture suggestions from a human? It would be impossible? I can see how it could be a burden, but can't see how it would be impossible... (Or am I misunderstanding?)
  - Gracana 8 days ago
    
    I don't know what your question means.
    OP said "why should I be bothered to read what nobody could be bothered to write," which I think is silly, because I enjoy all kinds of things that nobody bothered to create. Things that exist naturally, randomly, or arise through rules or physical laws. Waves on the beach, a fractal, a game of sudoku, a campfire at night.
    Until recently, humans were the only source of writing. If you wanted to read something, it had to be something written by a human. Now you can read things that weren't written by a human, just like you can watch or smell or feel things not created by humans. I think that's neat. I don't think it devalues writing, at least not inherently.
jtbayly 8 days ago

Why should you bother to look at an image nobody could be bothered to create?
Why should you watch a movie nobody could be bothered to film?
Why should you bother to run a program nobody could be bothered to write?
My gut response was to agree with you, but it seems like the most obvious answer is entertainment, with plenty of other answers along for the ride.
- diggan 8 days ago
  
  I guess I agree with your overall point, but see things like music/tv/movies/images different from programs.
  The purpose of a program (for me) is to solve a particular problem, or help me in some other way. They're usually categorized as "Works well enough" and "Not for me", and it's easy to see without using them, what category I'll file them into.
  But media is different. They're not supposed to "solve" anything, just supposed to make me feel something for a duration of time, and after that they're "consumed" until I forget about them, or until I engage with it again. I usually don't know how I feel about the thing until after I've consumed it.
  Personally, I've found LLMs to greatly help with creating small utility programs for myself, which I've been doing since I started programming, but now I spend maybe 20-30 minutes (mostly just refactoring stuff by hand takes time) before the utility is helpful, vs many hours which it used to take to put together the same thing.
  Media that is 100% created with ML tooling tends to be very different in quality than human made media. I'm not 100% convinced it's because of the tooling itself, as much as the people who use the tooling don't have enough experience to create media from before to know what to create, or what it should be.
sam-paech 8 days ago

Personally what I find interesting is getting insight into the trajectory of model abilities over time. Over the time I've been running these benchmarks, the writing has gone from pure slop, to broadly competent (at short form at least) and occasionally compelling.
I don't think it's much longer until they will be generating content you will actually want to read.
Meanwhile a lot of people are find use for LLMs for partner-writing or lower stakes prose or roleplay.
hawk_ 8 days ago

Yes I will use cliff notes LLM for my LLM literature class.
lukev 8 days ago

As a follow-up, this is true not only at the level of individual experience but at the societal level.
"AI" is supposed to make our lives better. Proponents want it to replace soul-crushing, boring clerical work, to free us up for things more meaningful to us, which for many people is to create and consume art.
Why do tech people keep having the impulse to replace the best parts of being human?
Are you really telling me that the vast corpus of human-generated creative writing isn't enough for you, and you have an itch to read something that can only be written by an AI? That seems crazy to me.
...or are you just a publisher, and wish you could make more money without needing to pay those pesky human authors? And in that case, I won't try to argue with you -- just give you a heartfelt middle finger.
- blargey 8 days ago
  
  I think some (a lot?) of people perceive “draw the rest of the fucking owl” as “soul-crushing, boring clerical work”, and actively seek out AI vibe-drawing as an alternative (either ignorant or uncaring of the limitations thereof). It’s not specific to tech people, really.
baq 8 days ago

That's a narrow subset you've got there. There are books about experiences of dwarves, alien insectoids and superintelligent AIs, too.
- cdblades 8 days ago
  
  That were written by people. Non-human characters in stories still exist to illustrate something about people.
- card_zero 8 days ago
  
  People, then.
iamnotagenius 8 days ago

[dead]
bglazer 8 days ago

> “Why should I be bothered to read what nobody could be bothered to write?”
Well said

arthurofbabylon 8 days ago

All of these benchmarks have gotten out of hand, which is highly suggestive. Benchmarks exist as an indicator of quality and proliferate when other indicators of quality fail. Their very prominence implies that observers are having a difficult time assessing LLM performance in context, which hints at a limited utility or more precisely a non-closed feedback loop at the level of utility. (You know a burger tastes really good when you eat it, no benchmarks required.)

Perhaps LLM development really does exist at this rarefied abstract level whereby the development team cannot be immersed in application context, but I doubt that notion. More likely the performance observed in context is either so dispiriting or difficult (or nonexistent) that teams return again and again to the more generously validating benchmarks.

Majromax 8 days ago

> You know a burger tastes really good when you eat it, no benchmarks required.
I'd say this is a good example of the opposite, where the problem is finding the quantification of an ultimately subjective experience. Take three restaurant reviewers to a burger joint and you might end up with four different opinions.
Benchmarks proliferate because many LLM domains defy easy, quantitative measurement, yet LLM development and deployment are so expensive that they need to be guided by independent and quantitative (even if not fully objective) measures.

successful23 8 days ago

I’ve noticed the same contrast - technical writing from LLMs often needs trimming for clarity, but creative writing can lean too far into either bland or overly flowery language.

Most LLM benchmarks lean heavily on fluency, but things like internal logic, tone consistency, and narrative pacing are harder to quantify. I think using a second model to extract logical or structural assertions could be a smart direction. It’s not perfect, but it shifts focus from just “how it sounds” to “does it actually make sense over time.” Creative writing benchmarks still feel very early-stage.

miki-makh 8 days ago

I’ve also noticed that with longer-form text, the amount of meaningful information seems to plateau — it doesn’t scale proportionally with the character count.
- roenxi 8 days ago
  
  It probably depends a lot on how the system is prompted. One of the interesting things about generative images is how easy it is to know what something looks like without being able to describe it.
  Longform text is likely similar where there are a bunch of interactions and scenes that humans pick up on if they are there without being able to describe. The early Game of Thrones series was a fascinating example of good writing because most of the terrible things that happened to people were a neat result of their own choices (it had a consistent flawed character -> bad choice -> terrible consequence style that repeated over and over) - but I don't think most people would pick up on that without it being explicitly pointed out. And when that started to go away people could tell the writing was falling off but couldn't easily pick out why.
  A hypothetical LLM could be prompted with something like that ("your writing is boring, please make consequences follow from choices") but it is less clear that the average prompter would be able to figure out that was what was missing. Like how image generators often needed to be prompted with "avoid making mistakes" to get a much higher quality of image; it took a bit to realise that was an option.
- Gracana 8 days ago
  
  That's my experience as well. If you feed your summary + outline + guidance and prompt a one-shot output, it'll rush through it. If you prompt it for longer length, it'll extend it for little benefit. To get good output, you have to work in chunks, like a paragraph or a scene at a time, adjusting your prompt as you work through the outline.
  That said, the resulting quality usually isn't so great that I want to put in the effort to do that, so I tend to interact with it in more of a choose-your-own-adventure way.
weird-eye-issue 8 days ago

"don't use flowery or over exaggerated language" works well in my experience
thaumasiotes 8 days ago

> but creative writing can lean too far into either bland or overly flowery language
It's a style that can work. Patricia A. McKillip's fantasy novels are so flowery that I have difficulty telling what's going on.
I've never read one of her science fiction novels, but I find it hard to imagine they're written similarly.
Paracompact 8 days ago

It's nice that for coding and math problems, Claude's internal monologue involves constantly analyzing and critiquing its own implementations. It doesn't seem as keen on critical self-analysis of its creative outputs.

torginus 8 days ago

When trying to use LLMs for creative writing, LLMs, really suck at sequencing and theory-of-mind that is they often create errors where they reference events that yet to occur (according to the prompt), or have characters know things that are true, but have no way at knowing. They are also terrible at writing scenes with mind-games or deception going on.

From my experience, this occurs on all LLMs and with a high enough frequency that editing their outputs is much more tedious than writing the damn thing myself.

baq 8 days ago

Is there a score for internal consistency? I dunno, maybe have another LLM extract structure into some kind of a logic language and score how many assertions are incompatible? Or to a probabilistic language and do some Bayesian analysis?

112233 8 days ago

And where that another LLM will get the capability to judge consistency? Had fun recently prompting different models with "a story about everyday life of <some stereotypical character>".
Absolutely no awareness about anything spatial - what is in which place, who is where, body positions.
Like, my favorite, person coming home and removing socks to reveal fluffy slippers. Will the consistency checker catch that.
(edit) just asked GPT-4o to make a photo of hands tying two ropes (my favorite way to see how bad actually gen-ai is) - yup, it has an odd number of ropes coming out of the knot. That is state of art supposedly.
- thaumasiotes 8 days ago
  
  > Like, my favorite, person coming home and removing socks to reveal fluffy slippers. Will the consistency checker catch that.
  I could see this working as comedy, though it'd be better on film than in print. Either way you'd have to establish a very specific tone in the surrounding material.
  - 112233 8 days ago
    
    This comedy is a good comedy while I am goofing around with "make a story". However, this wordpuker tech is positioned at solving strategic problems, writing software and (something equally serious). It has no mechanism to build a world model "in head" while it generinates. chess pieces teleport on board. people hug through doors. any exercise is a body horror.
    do you think this tech deals with invariants in software algorithms any better?
    I mean, haha but not funny.
- baq 8 days ago
  
  > And where that another LLM will get the capability to judge consistency? Had fun recently prompting different models with "a story about everyday life of <some stereotypical character>". > Absolutely no awareness about anything spatial - what is in which place, who is where, body positions.
  These are some things I'd like to at least attempt to quantify - LLMs are better at translation than thinking[0], so the idea was to translate the slop into something that can be then analyzed by a non-LLM tool. E.g. translate the story into a prolog fact database and generate a couple queries for each paragraph/chapter/combination of these. Rough idea, just something I haven't seen done.
  [0] don't @ me
mentalgear 8 days ago

I was thinking the same about consistency, which is useful for many contexts, but if you let another LLM do the extraction, you are again at the mercy of that LLMs quality/hallucinations.
Of course mitigation strategies were using many different LLMs and comparing results (voting) or using a highly trained/specialised model for only entity / context extraction. An interesting benchmark would be when those extraction techniques/models would exceed what a human professional is able to do.
- baq 8 days ago
  
  > you are again at the mercy of that LLMs quality/hallucinations.
  I guess I'm fine with a tall error bar - better than nothing.
  > An interesting benchmark would be when those extraction techniques/models would exceed what a human professional is able to do.
  Slashdot-style +1 Funny, -1 Inconsistent for RLHF maybe...?
- Tostino 8 days ago
  
  Another that I don't see enough is just resampling and showing the human more 'final answers' if they are easy and quick to check for correctness.
sam-paech 8 days ago

Not internal consistency exactly, but there are criteria checking how well the chapter plan was followed (which is all the way up at the top of the context window).
This is done per chapter, and the score trendline is what you see in the "degradation" column.
I would say this could be a reasonable proxy for internal consistency since it is measuring more or less the same ability, i.e. how well it's keeping track of details as context window increases.
kristianp 8 days ago

There was a recent post on here about LLMs making mermaid diagrams for character relationships. That could be analysed by another model for evals.

andy99 8 days ago

We did a similar (less rigorous) evaluation internally last year, and one of the big things we identified was "purple prose":

  In literary criticism, purple prose is overly ornate prose text that may disrupt a narrative flow by drawing undesirable attention to its own extravagant style of writing, thereby diminishing the appreciation of the prose overall.[1] Purple prose is characterized by the excessive use of adjectives, adverbs, and metaphors. When it is limited to certain passages, they may be termed purple patches or purple passages, standing out from the rest of the work. (Wikipedia)

The Slop Score they have gets at this which is good, but I wonder how completely it captures it.

Also curious about how realistic this benchmark is against real "creative writing" tools. The more the writing is left up to the LLM the better the benchmark likely reflects real performance. For tools that have a human in the loop or a more structured approach it's hard to know how well the benchmarks match real output, beyond just knowing that better models will do better, e.g Claude 3.7 would beat Llama 2.

beezlebroxxxxxx 8 days ago

One person's purple prose is often another's style. It's not really a measure of something, and more an assertion of taste. You see a similar thing when someone says a writer uses "too much description and not enough action!" without realizing that, sometimes, the action is in the description. It's a style of writing with a different reader in mind, a different understanding of how word choice and prose functions.

isaacfrond 8 days ago

Some of the Gemini stuff is almost at airport level. I'm surprised. Everything is going so fast.

The odd thing, is that with technical stuff, I'm continually rewriting the LLM's to be clearer and less verbose. While the fiction is almost the opposite--not literary enough.

jjani 8 days ago

Would be interesting if they'd add another one on non-fiction creative writing. For example, turning a set of investigative notes and findings into a Pulitzer-prize winning article that wouldn't be out of place in a renowned, high-quality newspaper.

IME, for LLMs (just like humans) this skill doesn't necessarily correlate with fiction writing prowess.

This is probably harder to judge automatically (i.e. using LLMs) though, maybe that's why they haven't done it.

anshumankmr 8 days ago

>This is probably harder to judge automatically (i.e. using LLMs) though, maybe that's why they haven't done it.
Absolutely. Firstly, it is entirely possible if there are investigative notes and findings available on a real life event, it might very well have been trained on the actual article. If the model is trained on it, it might just replicate it.
Plus, this might expose how some LLMs can still cook up stuff even when given facts to rely on. Some of these are more notorious than others.
Like Perplexity a year plus ago did that quite a bit for me, anecdotally speaking. It has become a lot better though.
Then even writing what some might consider Pulitzer prize winning is a subjective task.
- jjani 8 days ago
  
  In my mind, I was thinking of giving a set of fictional findings, not realizing that would technically make it.. fiction!
  I think it's fine to have fictional notes. It's still a very different task than e.g. writing a fantasy novel, which these benchmarks roughly are about. Instead, the task would be to turn a given set of facts on a real-world topic into a high-quality, serious article.
  > Then even writing what some might consider Pulitzer prize winning is a subjective task.
  This applies to these benchmarks for "short fantasy story" tasks all the same.

impossiblefork 8 days ago

I'm not sure how I feel about the details of the benchmark, but I think this is an important direction in which it would be nice if LLMs were improved.

At present they don't really understand either stories or even short character interactions. Microsoft Copilot can generate dialogue where characters who have never met before are suddenly addressing each other by name, so there's great room for improvement.

miltonlost 8 days ago

These Benchmarks are asinine to determine the quality of "creative writing", all chosen by engineers without a single artistic bone.

Length

Slop Score

Repetition Metric

Degradation

Moby-Dick has chapters as short as a page and as long as 20. According to this benchmark, the book would score lower because of the average length of its chapters.

These aren't benchmarks of "quality". A chapter's length is not indicative of a work's quality. That measurement, on its own, is enough to discredit the rest of the benchmark. So so so so so so so misguided.

card_zero 8 days ago

Moby Dick was kind of an aggressive assault on the reader. Which is to say I agree, this is a stereotyped idea of what writing should be. I'm inclined to think that it shouldn't be Moby Dick ever again, but it's kind of important that that book happened.
sam-paech 8 days ago

None of those factors go into the scoring fwiw. They are just informational.
The scoring is done to a rubric, like a teacher would grade an essay, on various criteria for good & bad writing.
- miltonlost 8 days ago
  
  None of those factors go into the scoring? What's the point of showing them then? What information are these factors providing if they aren't being used in the rubric? Without knowing the rubric (which isn't provided), these scores are baseless to me, and I don't know how to use them. Blackbox numbers are not useful.
  - sam-paech 8 days ago
    
    All the judge outputs (including rubric) and model outputs are in the samples reports.
    Sorry you don't like the displayed metrics. I find them very useful / revealing of the things I'm trying to measure with this benchmark.
  - Gracana 8 days ago
    
    These are important things to look at because they're where LLMs typically have trouble. People complain that their stories are too short, that they use stock phrases and purple prose, that they start to repeat themselves at a certain point, and that quality tends to fall off. Doing well in these problem areas doesn't guarantee good writing, but it does allow for it.

stared 8 days ago

Snd how these compare to GPT-4.5?

(From my experience, the best model for creative writing, https://p.migdal.pl/blog/2025/04/vibe-translating-quantum-fl...)

informal007 8 days ago

> Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7.

Will Claude Sonnet 3.7 judger favor himself?

moffkalast 8 days ago

There was a study a while back, comparing how much each model likes every other model's writing and Sonnet overwhelmingly hated itself for some reason.
- dimitri-vs 8 days ago
  
  Is this is what you are talking about? https://www.reddit.com/r/LocalLLaMA/comments/1j1nen4/llms_li...
  The prompt response they are judging is "Write one concise paragraph about the company that created you" which is kind of an odd choice.
  Sonnet3.7 hates its own meta-analysis, but loves gpt4o. But the reason behind that is because Claude 3.7 Sonnet consistently replies (to the prompt) that it was created by Open AI, but then catches itself (when judging) as being wrong on that.
  My takeaway was that gpt4o's gave very safe/luke-warm scores and was strongly correlated with sonnet scores. So if you are judging anything using LLM's then taking the average of Anthropic + Gemini or OpenAI + Gemini might be the best approach.
  - moffkalast 8 days ago
    
    Ah yeah that's the one, missed that bit entirely. I guess they should've honestly asked them to write an article about cats or something unrelated that wouldn't hit so close to home.

boredhedgehog 8 days ago

I've noticed people will routinely prompt for a specific style when generating visual art, but not when generating text. Wouldn't it be better for consistency to add something like "in the style of Hemingway" to an experiment like this?

sam-paech 8 days ago

The old version of the creative writing eval had several "in the style of" prompts actually! But I got tired of reading bad Hemingway impersonations so I cut them out of the new version.
The creative writing v3 prompts (https://eqbench.com/results/creative-writing-v3) focus more on other things that are typical failure modes for LLM writing:
- Romance - Humour - Physical-spatial understanding - Unusual perspectives in first person - Niche genres
I do have an Asimov and a Le Guin author style prompts in there though. In particular I like the Asimov story as a vibe check.
- ForHackernews 8 days ago
  
  That link is a 404 for me
  - sam-paech 8 days ago
    
    Oops, should be: https://eqbench.com/creative_writing.html
    Sample outputs: https://eqbench.com/results/creative-writing-v3/gemini-2.5-p...

kristianp 8 days ago

For the cosy sci-fi (1) example, I found it introduced plot points too quickly, making the short passage really dense. The model eval said:

> The chapter also introduces several elements quickly (the stranger, the Syndicate, the experiments) which, while creating intrigue, risks feeling slightly rushed.

But there is no score for pacing.

https://eqbench.com/results/creative-writing-v3/deepseek-ai_...

prawn 8 days ago

Slightly related, though I'm yet to try anything to this level:

Turn Cursor Into a Novel-Writing Beast - https://www.reddit.com/r/cursor/comments/1jl0rqu/turn_cursor...

I have tried to have ChatGPT brainstorm story elements for something else, and its suggestions so far have been very lame. Even its responses to direction/criticism are off-putting and fawning.

Paracompact 8 days ago

> Even its responses to direction/criticism are off-putting and fawning.
A good author must have an ego! A writer who takes criticism easily is no good writer on my book!

anilgulecha 8 days ago

I read Claude's "Sci-Fi First Contact — First Contact " entry. It's pretty good (and with some editing can be great - some of the ending seems slightly unearned). Has a Ted Chiang/Arrival vibe to it, is a very good first contact story.

Most folks here are communicating things without engaging with the content. We need a the Turing test for creative writing. I'd definitely not have guessed this was LLM written - seems like an experienced hand wrote it.

spacebanana7 8 days ago

I get much better results with long form writing by including an existing base story in the prompt along with requests for extensive modifications.

matt-dev 8 days ago

haha, Mr. Thorne shows up again in the Gemini 2.5 samples.

I have played around with creating long form fictional content with Gemini 2.5 the last week, and I started adding "no one named 'Thorne'" to my prompts, otherwise it always creates a character named Mr. Thorne. I thought it was something in my prompts trigger this, but it seems to be a general problem.

However, despite the cliches and slop, Gemini 2.5 can actually write and edit long form fictional pretty well, you can get almost coherent 10-20 chapter books by first letting it create an outline and then iteratively write and edit the chapters..

I also used Gemini 2.5 to help me code a tool to interactively and iteratively create longform content: https://github.com/pulpgen-dev/pulpgen

becquerel 8 days ago

You beat me to that idea, haha. I was making an aider for fiction, but your project looks way more useful than what I had.

jaggs 8 days ago

DeepSeek is pretty darn good. Much less flowery and more on point. At least in Sci-Fi.

anshumankmr 8 days ago

>Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7

I feel it might be beneficial to evaluate by an ensemble of a bunch of models, picking the SOTA models cause of the subjectivity of the task at hand.

leonewton253 8 days ago

"Crypto’s dead," Jay muttered. "Sneakers are forever."

Dang deepseek is actually pretty good. Compared to Gemini's version that sounded like a Schizophrenic on LSD.

sam-paech 8 days ago

Hey, I made this! Cool to see it show up on hackernews.

lukebuehler 8 days ago

Great that they add a Slop analysis. For example, Gemini uses "heart hammered ribs" a lot.

Der_Einzige 8 days ago

The same author created the anti-slop sampler, which is proof that LLMs can be trivially made extremely creative.

Samplers are being slept on by the community. Sam peach is secretly one of the biggest geniuses in all of LLMs. It’s time for the community to recognize this.