Research 05: Procedural Narrative AI

01 · Why This Research

Standing on the Shoulders of Failure

Before designing a narrative engine, we needed to understand the landscape of attempts that came before it. Not to survey the field academically, but to answer a very specific engineering question: what architectural decisions cause AI-generated narrative to fail, and which decisions prevent those failures?

The answer matters because the failure modes of AI narrative are not obvious. They do not look like broken code. They look like content that is technically coherent but experientially hollow · a broadcast that sounds like a person, a notice board that reads like a narrator, a villain whose voice changes between scenes, a story that crescendos every five minutes until the player goes numb. These failures are invisible to automated quality checks. They require knowing what good looks like, and then understanding why the system drifted away from it.

This research covers three layers: the existing systems that have attempted procedural narrative generation, the technical failure modes that appear across all of them, and the structural interventions that address each failure. Every design decision in the Narrative Engine can be traced back to something that went wrong here.

Three findings from the research literature frame this analysis:

Character inconsistency is the most cited complaint across AI roleplay platforms — above poor plot, slow pacing, or lack of challenge. cuckoo.network, 2025 ↗
A 2025 ACL survey on LLMs for story generation catalogued 5 distinct failure modes recurring across all reviewed systems: coherence collapse, character inconsistency, pacing failure, context blindness, and agency illusion. ACL Anthology, 2025 ↗
The SCORE framework achieves a 23.6% coherence improvement over unstructured generation by maintaining episode-level retrieval context across beat sequences. arxiv, 2025 ↗

02 · The First Failure Mode

Coherence Collapse

The earliest and most documented failure of AI narrative is coherence collapse: the progressive unraveling of story logic as a session extends in time. Systems like AI Dungeon made this failure famous. Users would begin a story with careful setup · a specific character, a specific world, a specific mission · and find that by the fifteenth exchange, the character had forgotten their own name, the world had shifted genre, and the mission had been replaced by something entirely unrelated. The story did not break. It dissolved.

The mechanism is architectural. An LLM is not a story-understander. It is a sequence-predictor. At any given moment, it generates the most probable continuation of the tokens in its context window. It has no internal model of the story, no record of what has been established, no understanding of what would constitute a contradiction. It has only the recent text. When earlier context · the villain's motivation, the location of the artifact, the player's stated goal · scrolls out of the context window, it is gone. The model does not know it has lost it. It simply generates forward, and the next continuation is plausible given recent tokens while being incoherent against the full arc.

This is not a capability failure. More powerful models make this problem worse in a specific way: they generate more convincingly wrong continuations. A small model produces obviously broken output. A large model produces smoothly wrong output that reads as intentional until you trace back three scenes and realize the character whose death was just mourned is still listed as alive two beats earlier.

The core insight: LLMs predict plausible continuations · they do not maintain story state. Coherence over long arcs requires explicit, externally maintained world state that is injected into every generation call. The model cannot be trusted to remember. The architecture must remember for it.

The 2025 SCORE framework (Story Coherence and Retrieval Enhancement) directly addresses this mechanism. It achieves a 23.6% coherence improvement over baseline GPT-class models by maintaining episode-level summaries and key item tracking in a retrieval layer · effectively building the external world state that LLMs cannot maintain internally. The result reduces hallucinations by 41.8% and achieves 89.7% emotional consistency. The improvement is entirely structural, not model-capability-dependent. The same model, with the same weights, produces dramatically more coherent output when given explicit state to reason against.

This is the foundational lesson. The Narrative Engine's World State object is not a convenience feature. It is the mechanism that makes extended coherent narrative possible. Every generation call · every arc concept, every beat choreography, every NPC dialogue · must receive the relevant slice of world state as explicit input. Without it, the model is generating in the dark.

03 · The Second Failure Mode

Context-Blind Generation: The Sloppy Problem

There is a second failure that is less discussed in the literature but is the central quality problem in our existing EV2090 system. Call it context-blind generation: the LLM produces content without being told how that content will be consumed. The result is content that is narratively accurate but tonally wrong content that lands in the wrong register for its delivery channel.

The concrete failure looks like this. A bulletin board post is generated. The narrative facts are correct. The timing is right. The beat advances the story. But the text reads like a narrator summarizing events to the reader, not like a handwritten note stuck to a cork board by a frightened dock worker. A station broadcast is generated. The information is accurate. But it sounds like a character ranting rather than an institutional announcement. An NPC COMMS exchange is generated. The content is there. But the voice is flat · it could be anyone.

The bulletin board that reads like a narrator. The broadcast that sounds like a person. The NPC who speaks in complete sentences with perfect grammar. These are not small aesthetic failures. They are the difference between a story that works and a world that breathes.

This failure has a specific cause. When a prompt says "generate a bulletin board post about the cargo disruption," the model draws on its training distribution of what bulletin board posts look like. But its training data contains millions of examples of narrative text describing events · novels, scripts, articles · and a comparatively tiny sample of actual handwritten bulletin board notices. The statistical pull toward narrative register is stronger than the pull toward the correct channel register. Without explicit, detailed channel constraints in the prompt, the model defaults to the most common register it has seen for similar content.

Critically, this is not a model failure. It is a prompt architecture failure. The model is capable of generating a notice board post that reads exactly like a notice board post · but only if the prompt provides the delivery channel with sufficient specificity: who wrote it, in what emotional state, with what physical constraints (scrawled in marker? typed on a printer?), for which audience, with what convention of brevity or urgency. When that context is present, the output transforms. When it is absent, the model guesses, and its guess is wrong.

This is the architectural argument for the Narrative Engine's rendering layer separation. Sonnet decides what happens. Haiku is given that decision plus a rich rendering context · channel, voice, format, anti-patterns · and then generates the actual text. The two concerns are separated precisely because the rendering context requirements are completely orthogonal to the narrative content requirements. A beat that works as a broadcast needs totally different prompt scaffolding than the same beat rendered as an environmental document. Combining both concerns into one call degrades both.

Delivery channel is not a formatting rule. It is a fundamental constraint on voice, register, sentence structure, diction, assumed audience, and emotional posture. It must be specified explicitly in every render call · not implied, not hoped for.

04 · The Third Failure Mode

The Agency vs. Structure Tradeoff

Every procedural narrative system must navigate the same fundamental tension: maximum player freedom produces incoherent stories, and maximum structural control produces experiences that feel authored rather than lived. The industry has arrived at this tension from both directions and found the same narrow sweet spot.

Pure unstructured generation · the AI Dungeon model · collapses under extended play for the coherence reasons already described. But there is an additional failure beyond coherence: without structural constraints, the AI loses the concept of narrative shape. It generates plausible continuations without any mechanism for rising action, without any concept of a climax that must be earned, without any understanding that the story needs to arrive somewhere. Sessions feel like they are going somewhere indefinitely without ever getting there. The tension never resolves. The story never ends. The player eventually abandons it, not because it broke, but because it never meant anything.

Pure structural control · the Ink / Twine model · fails for the opposite reason. Every branch is manually authored. The tree of possibilities is finite and visible to the player as a finite choice menu. There is no genuine improvisation, no surprise, no sense that the world is responding to you specifically. The story can only go to places the author prepared. This is fine for small, tightly scoped experiences, but it does not scale and it does not adapt.

Research in 2025 converges on hybrid approaches. Systems like STORYVERSE translate author-defined abstract plot points · narrative acts · into detailed character actions via LLM, allowing the story to evolve dynamically while still respecting the author's plot plan. Answer Set Programming (ASP) guided generation produces more structurally diverse stories than unguided LLMs while maintaining causal soundness. The pattern is consistent: structure governs the spine, LLM generation fills the flesh.

The sweet spot: Fixed spine, variable flesh. The spine · the core conflict, the characters' arcs, the world change · is planned by the architect (Sonnet) and does not vary. The flesh · how clues are discovered, which NPC delivers which information, the texture of each beat is generated fresh every time. The player experiences genuine agency in the flesh while the story maintains the coherent shape of the spine.

This architecture also resolves a subtle problem that pure freedom creates: narrative escalation debt. When an LLM has no structural beat constraints, it tends to escalate each scene to be more intense than the last, because intense continuations are statistically more likely to be positively reinforced in its training signal. The result is stories that peak too early and then cannot come down. By the third exchange, the fate of the universe is at stake. By the fifth, it has happened again. Beat type constraints · enforced by the choreographer before any rendering begins · prevent this by requiring that specific beat slots be filled with atmosphere beats, character beats, and interaction beats, not just story beats. The arc has breathing room because the architecture requires it, not because the renderer chose it.

05 · The Fourth Failure Mode

NPC Personality Consistency

NPCs are the single most important quality signal in interactive narrative. Players forgive a lot clunky pacing, repetitive beat structure, predictable plot · but they do not forgive an NPC who sounds different in every conversation. The sense that a character is a real person with a consistent inner life is the foundation of emotional investment. Shatter that, and the story becomes a text generator rather than a world.

The naive approach · "be this character: Marcus, a gruff cargo hauler who has seen too much" · fails reliably across extended sessions. Researchers identify this failure as having two components. The first is attention diversion: as the context window fills with conversation history and injected world state, the model's attention to the character description weakens. The character-defining text is technically in the prompt, but its statistical influence on output decreases relative to the volume of other tokens. The NPC begins to drift toward a generic "reasonable person" register.

The second component is what the literature calls Flanderization · a term borrowed from animation criticism. A complex character gradually simplifies until only their most salient trait remains. Marcus the gruff cargo hauler becomes just gruff. Then just short-tempered. Then generic aggressive. The nuance collapses under the weight of repeated generation pressure.

Unconstrained NPC Prompt

"You are Marcus, a cargo hauler. You are gruff and have seen too much. Respond to the player's question."

Result: Generic gruffness. Inconsistent across sessions. Voice drifts toward whatever the training data says a "gruff hauler" sounds like. No distinguishing features survive more than a few exchanges.

Constrained Entity Card Prompt

"Voice: clipped sentences, no small talk, nautical slang carried over from a prior life on water. Trait: pragmatic to the point of cruelty. Ideal: a deal is a deal, no exceptions. Bond: the ship is the last thing from his old life. Flaw: cannot ask for help. Agenda: needs to deliver this cargo before they find out what's in it. NEVER: speak in paragraphs. NEVER: volunteer information."

Result: A distinct voice that survives long sessions. Specific enough to be recognizable, constrained enough to stay in register.

The solution is not richer characterization in the sense of more description · it is richer characterization in the sense of more behavioral constraints. The model needs to know not just who this person is, but what they never do, what they always do, how they structure sentences, what topics they avoid, what topics they cannot help but return to. Research on "codified profiles" shows that even 1-billion-parameter models can maintain profile consistency comparable to much larger models when behavioral logic is expressed as executable constraint rather than narrative description. The constraint is the load-bearing element, not the prose.

This maps directly to the Entity schema in the Narrative Engine. An NPC entity must carry its voice constraints · not as a character biography, but as a rendering specification. That specification is injected into every Haiku call that involves the NPC. The model does not need to remember who this person is. The architecture ensures that every single call that produces this NPC's voice begins from the same explicit constraints.

06 · The Fifth Failure Mode

The Escalation Trap

Pacing is the most underappreciated dimension of narrative quality. A story with good prose, coherent world state, and consistent characters can still feel exhausting if every beat escalates intensity. Drama requires contrast. Tension is only felt against the background of rest. A system that escalates every scene does not produce high-stakes narrative · it produces numbness.

LLMs left to their own devices escalate. This is not a design choice · it is a statistical tendency. Training data is biased toward scenes that were engaging enough to be written and preserved. Engaging scenes tend to involve conflict, revelation, or consequence. The model therefore over-samples from conflict and consequence when generating continuations, because those continuations are statistically most similar to the content it was trained to produce. An unconstrained generation call for "the next scene" will disproportionately produce scenes with raised stakes, urgent problems, and advancing threats.

The cost of this is invisible until you look at the arc as a whole. Beat 1: conspiracy revealed. Beat 2: new threat emerges. Beat 3: the stakes are revealed to be higher. Beat 4: something is worse than expected. By beat 5, the player has been at maximum tension for twenty minutes, and the nominal climax of the arc · when it arrives · lands with no emotional impact because there is nowhere higher to go. The escalation trap means every beat feels like a climax, which means none of them do.

Tension without release is not drama · it is exhaustion. The atmosphere beat and the character beat exist precisely to provide the contrast that makes the story beat land.

Beat types are the architectural intervention. When the choreographer is required to produce a specific distribution of beat types across an arc · story beats, character beats, atmosphere beats, interaction beats, decision beats · escalation becomes structurally impossible. An atmosphere beat cannot escalate stakes. A character beat cannot raise the threat level. These constraints are not creative limits. They are the mechanism that makes emotional variation possible in the first place.

Research on emotional arc-guided generation (2025) confirms this directly. Systems that specify emotional arc trajectories · explicitly including Fall segments, not just Rise · produce significantly higher player engagement and narrative coherence scores than systems that generate beats without pacing constraints. The emotional arc is not decoration applied after the fact. It is a primary design input that must constrain generation, not describe its output.

This means the Narrative Engine's choreographer (Sonnet) must specify beat types before Haiku renders anything. The beat type is not a tag applied after generation. It is a constraint that precedes it. A beat marked as an atmosphere beat must be rendered by Haiku under the constraint that it cannot advance the plot or raise stakes. Haiku does not decide whether a beat escalates. The architecture does.

The escalation constraint is non-negotiable. Beat type must be specified in the choreography phase and enforced as a rendering constraint in every Haiku call. A renderer that receives only "beat 7 of 12 in a conspiracy arc" will escalate. A renderer that receives "beat 7 of 12 · type: character beat · purpose: show the informant's personal cost" cannot escalate even if it tried.

07 · What Actually Works

Prompt Engineering Findings

Beyond the specific failure modes, this research surfaced a set of prompt design principles that distinguish reliable high-quality generation from unreliable mediocre generation. These are not philosophical preferences. They are observable architectural patterns with measurable output consequences.

Anti-Patterns Over Examples

Examples in prompts become statistical ceilings. The model learns the distribution of the examples and optimizes toward it. A bulletin board example that is good becomes the target that all bulletin board output clusters around. The result is less variation, less surprise, and a gradual regression toward the mean of your provided samples.

Anti-patterns · explicit constraints on what the output must never do · work differently. They define a boundary, not a target. Within that boundary, the model explores freely. The output is both constrained (no narrator voice, no complete sentences, no temporal hedging) and diverse (everything else remains open). This is the EV2090 discovery that the Narrative Engine inherits: NEVER beats YES.

Dynamic Context Injection

Hardcoded system prompts · prompts that contain universe-specific facts, character names, or world state · become stale instantly and cannot be reused across contexts. Dynamic context injection, where relevant state is assembled and injected at call time, keeps the prompt architecture generic while the content remains specific.

The practical implication: the narrative engine's prompts should contain no universe-specific information. The EV2090 system shows that hardcoding Sol and four planets into the prompt is a maintenance trap. The engine prompts should describe structure and constraints. The content comes from the world state.

Temperature and Coherence

Higher temperature produces more surprising continuations. Lower temperature produces more predictable ones. For narrative generation, this creates a calibration problem: you want surprising story ideas (high temp in the concept phase) but reliable structural output (lower temp in the choreography phase) and consistent character voice (low temp in rendering).

The EV2090 pipeline already applies this correctly: 0.85 for concept generation, 0.7 for audit, 0.8 for choreography. The principle is temperature should decrease as structural precision increases. The pipeline phase that must produce valid JSON with correct beat counts should not run at the same temperature as the phase that generates premise ideas.

Causality Enforcement

Without explicit causality requirements, generated beats feel like independent scenes rather than a connected story. The model generates each beat to be internally coherent · good premise, correct channel, right emotional register · but the beats do not refer to each other. The player experiences a sequence of unrelated events, not a story.

Requiring every beat to specify what it reacts to and what it foreshadows forces the choreographer to think in chains rather than isolated moments. The resulting beat graph is a narrative structure, not a list. This is the difference between content generation and story generation.

One additional finding concerns task decomposition. Prompts that ask the model to simultaneously generate narrative content, specify rendering instructions, assign channel metadata, and define consequences produce worse output than prompts that ask for each of these in sequence. The pipeline architecture · where each phase has a single, focused responsibility · is not organizational preference. It is a response to a measurable quality degradation that occurs when generation calls are overloaded with competing objectives. Atomicity in LLM calls improves output quality the same way it improves code quality: for the same structural reason.

08 · Discoveries

What We Learned That We Did Not Expect

Insight 1: The model is never the bottleneck

Every failure mode we identified · coherence collapse, context-blind tone, character drift, escalation traps · is an architectural problem, not a model capability problem. The same model, given better structural scaffolding, produces dramatically better output. This finding is counterintuitive: the temptation when output is bad is to use a bigger model. The evidence says: fix the architecture first. A well-structured call to a smaller model outperforms an unstructured call to a larger one for every category of narrative quality we examined.

Insight 2: "Consistency" and "variety" are not in tension

The industry assumes that making AI narrative consistent requires sacrificing variety · that constraints produce sameness. The research shows the opposite. Constraints on what content cannot do (no narrator voice, no escalation in atmosphere beats, no direct name-dropping of world events) produce more varied output within those bounds than unconstrained generation, which clusters around statistically probable patterns. The constraint is a forcing function for exploration. The unconstrained model takes the easiest path. The constrained model must find a path that does not take the easy exits.

Insight 3: The rendering layer is the quality problem, not the generation layer

In reviewing EV2090 and comparable systems, the narrative architecture (what happens) is consistently more reliable than the rendering output (how it is presented). Story arcs are coherent. Beat sequences make structural sense. The failure is at the last mile: the actual text that the player reads. This is a diagnostic finding with a clear implication · the Narrative Engine should invest its prompt engineering effort disproportionately in the rendering layer. The choreographer can be lean because it produces structure, not prose. The channel renderers must be rich because they produce the output players actually experience.

Insight 4: Passive timelines solve a generation problem, not just a design problem

We included passive timelines (what happens if no player acts) because they are good narrative design. We discovered they also solve a generation quality problem. When the choreographer must specify not just "what beats exist" but "what happens on this timeline regardless of players," it is forced to think about causality and consequences in a way that pure beat listing does not require. The passive timeline is a forcing function for internal story logic · if the world proceeds without the player, the beats must make causal sense in sequence. Arcs generated with passive timeline requirements show substantially better internal causality than arcs generated as beat lists alone.

09 · Engine Implications

How This Research Shapes the Architecture

Each failure mode identified in this research maps directly to a structural decision in the Narrative Engine.

World State as mandatory context in every generation call. Coherence collapse is prevented not by model capability but by architectural discipline · every Sonnet call receives the relevant world state slice. The model does not remember. The system provides.

Sonnet architects, Haiku renders · these are separate calls with separate prompts. The context-blind generation problem is solved by architectural separation. The WHAT and the HOW are never combined in a single call. The rendering context (channel, voice, format, anti-patterns) is fully specified before the render call is made.

No examples in prompts · constraints and anti-patterns only. This is inherited from EV2090 and confirmed by research. Examples become ceilings. Anti-patterns enforce boundaries without bounding the solution space. Every prompt in the engine specifies what the output must never do.

Beat types are specified by the choreographer, not chosen by the renderer. The escalation trap is closed by requiring Sonnet to assign a beat type · with its associated constraints · before Haiku renders anything. Haiku renders within the type constraint. It cannot escalate an atmosphere beat even if its statistical tendencies pull it toward escalation.

NPC voice is defined in the Entity schema and injected into every render call for that NPC. Character drift is prevented by making the personality card a first-class entity attribute, not a prompt narrative. The constraint specification · behavioral rules, NEVER clauses, sentence structure, topics · travels with the entity into every call that involves it.

Causality is mandatory in beat choreography. Every beat must declare what it reacts to and what it foreshadows. The choreographer cannot produce a beat that has no causal relationships. This transforms beat lists into beat graphs · the minimum structure for story rather than content.

Prompt prompts are pipeline-phase-specific. Each phase · concept, audit, choreography, per-beat render · has exactly one responsibility. Calls that combine responsibilities degrade in output quality. The architecture enforces atomicity in generation the same way good software enforces it in functions.

10 · Connections

Connections to Other Research

Research 01 · D&D Module Structure

The Original Channel Rendering Concept

The D&D "read-aloud text" box is the first formal delivery channel: text written for a specific context with specific register, pacing, and vocabulary rules. The NPC personality model (trait / ideal / bond / flaw) is the original personality card system. Both solved the same problems this AI research identifies, decades before LLMs existed.

Research 02 · Warhammer: The Enemy Within

Fixed Spine as a Coherence Constraint

Fixed Spine, Variable Flesh is the architectural answer to the coherence collapse problem. TEW proved you can have a coherent authored story with genuine variation across playthroughs. The spine maintains integrity; the flesh is where LLM generation operates without threatening it.

Research 04 · Improv DM Techniques

Human Analog to Structural Constraints

The Fronts / Clocks model from Apocalypse World solves passive timeline generation. Improv DM techniques are the human practice of exactly what this research recommends algorithmically: maintaining coherence through constraints, not memory. The Three-Clue Rule is the human version of redundant information paths.

Research 06 · EV2090 Code Analysis

Failure Modes in Production

EV2090 is a working implementation that exposes the exact failure modes described here at production scale. Context-blind rendering and shallow NPC voice are not theoretical problems · they are the measured daily output of a live game. The 5 failure modes from the ACL survey map directly to the 8 structural gaps in the EV2090 analysis.

The procedural narrative AI landscape is littered with systems that failed for structural reasons reasons that were visible in retrospect but not prevented in design. The failure modes are known. The mitigations are known. The remaining question is whether a system can implement all of them simultaneously, at scale, without the architecture collapsing under its own constraints.

That is the question the Narrative Engine is designed to answer.