Beyond the next token: three architectures for three shortcomings of the LLM paradigm

The founding shortcoming

The dominant paradigm in 2026 remains the autoregressive model trained by maximum likelihood on a corpus of symbols. Its empirical success is considerable. But what in a large language model passes for understanding, memory or simulation of the world is not directly optimized as such: it is an emergent property of sequential learning on symbolic traces.

The standard LLM presents three distinct blind spots. The first is dynamics: the model can describe a consequence, but it has not learned the causal dynamics that produce it. The second is persistence: the context window is a computational device, not memory in the strong sense, and a larger scratchpad than in 2022 is not enough to say it possesses memory. The third is the prediction space: predicting the next token imposes tokenization as the target space, while much of what we seek to predict (physical states, biological trajectories, clinical evolution under intervention) does not properly reduce to a symbolic sequence.

Three families of architectures respond, each partially, to these gaps: world models attack dynamics, memory models attack persistence, JEPA-type architectures attack the prediction space. This term-by-term correspondence is an initial posture rather than an equilibrium state, since boundaries deform as soon as one looks at the most recent architectures. Above all, these lineages do not properly substitute for one another. Their probable trajectory is not mutual elimination but composition. And composition does not solve the problem: it relocates it to module coordination, training stability, action governance, output validation. A hybrid architecture is not a magical synthesis; it is a stack of better-localized problems.

Operational definitions and three cuts

A world model, in the strict sense, is a model that learns the dynamics of an environment from observations coupled with actions, and that allows the prediction of a future state conditionally on a sequence of actions. This strict definition emphasizes dynamic projection. There is, however, a broader use: generative video models like Sora can be described as implicit world models, since they learn dynamics without the conditionality on action being explicit. Conflating the two makes for fine announcements and poor architectural choices, which is by now a well-established industrial tradition.

A memory model is an architecture that distinguishes current computation from persistent, compressed or re-addressable storage. The criterion is not the length of the context but the differentiation between immediate processing and conservation. JEPA, for Joint Embedding Predictive Architecture, designates a family that predicts in the space of representations rather than in the raw space of observations.

Three cuts operate: predicting observations or predicting representations (separates generative world models from JEPA architectures); context or memory (separates long-context LLMs from memory models in the architectural sense); passive prediction or prediction conditioned on action (separates linguistic continuation from a dynamic model). These cuts are not absolute boundaries but instruments of analysis: their function is to prevent the confusion of levels.

Generative world models

The family of generative world models is the oldest. Ha and Schmidhuber published in 2018 an eponymous paper: a variational autoencoder compresses the observation into a latent vector, a recurrent dynamic model predicts future latent states, a controller chooses actions. The important idea is not only compression but the fact that the controller can be trained inside the model’s dream. The Dreamer lineage (V1, V2, V3) generalizes this intuition with a Recurrent State Space Model combining a deterministic state and a stochastic state. DayDreamer transposes the logic to physical robots, where learning by imagination reduces the cost and risk of real-world trial-and-error.

The limits are structural. Computational first: pixel reconstruction can become a poor teacher, spending the model’s capacity on details with no decisional bearing. Temporal next: prediction errors compound, and every useful horizon is a bounded horizon. Distributional finally: a world model learns dynamics local to the training distribution. This is the classic sim-to-real transfer problem, but in a more general form. Any dynamic model is reliable within a validity domain, not in the world as such.

JEPA and the representational break

JEPA responds to a precise weakness of generative models: the obligation to predict in the raw space of observations. Its proposition is not to reconstruct better but not to reconstruct what does not deserve to be reconstructed. Part of the input serves as context, the other as target; two encoders produce representations; a predictor learns to transform the context representation to approach that of the target. The loss is computed in representation space. No pixel reconstruction is required.

The objective is therefore not sensory fidelity but abstract predictability. JEPA can be understood (in an approximate sense) as an attempt to learn the constraints of the world rather than its appearances. The mechanism avoids trivial collapse via an exponential moving average (EMA) on the target encoder weights. I-JEPA applies the principle to images, V-JEPA to video, V-JEPA-2 adds an agentic dimension: the model predicts future representations conditionally on actions. As soon as it predicts future states under action conditioning, JEPA becomes a world model in the strict sense, but a non-generative one. The relevant distinction becomes: a generative world model in observation space, or a predictive world model in representation space.

The limits must be stated without indulgence. Scale: no scaling law comparable to that of large LLMs has been demonstrated to date; next-token prediction provides text with a universal task, JEPA does not yet have one. In-context learning: no equivalent mechanism. JEPA learns representations, not a reconfigurable local program. Modal scope: JEPA is natural for visual perception and robotics but does not replace a general language model. Presenting it as a global alternative to LLMs is a rhetorical convenience.

Memory models

Memory is probably the most mistreated term in the contemporary debate. Three sub-families deserve to be distinguished. Indexed external memory (RAG, MemGPT): governable, inspectable, traceable, but limited to indexable objects. A continuous physiological state or a probabilistic clinical trajectory fits poorly. Compressed recurrent state memory (Mamba, RWKV, state space models): linear cost with length, but to compress is to choose and to choose is to forget. The memory is continuous but lossy. Learned long-term memory (Titans): a neural module learns what to write, when to write, how to forget, and how to reuse. Memory becomes a trained component.

An episodic memory in the minimal architectural sense presupposes three conditions: a singular event indexed temporally, a recall oriented by the present situation, and a contextual update that is neither overwriting nor erasure. None of the three current industrial sub-families fully satisfies these three conditions. Memory models do not solve the world problem. They solve part of the persistence problem. They allow the conservation or compression of traces, not necessarily their understanding, hierarchization or causal use.

Comparative synthesis

Across six architectural axes (predicted object, target space, action, memory, evaluation, cost), the four families differ structurally. The useful reading of the matrix is not positive (what can each family do?) but inverse: which capacities are structurally absent, not through engineering default but by architectural construction? The autoregressive LLM cannot learn the dynamics of an environment from its training task alone. This absence is consubstantial with the objective. The generative world model cannot exploit its internal simulation for arbitrarily long sequences. Error composition is a mathematical property of predictive chaining. JEPA cannot, in its current state, predict in the symbolic space with the flexibility of next-token prediction. A memory model does not guarantee, by its architecture alone, that what it conserves will be relevant or legitimate.

Three axes escape the matrix and must be reintroduced before any industrial use: ecosystem maturity (LLMs industrialized, world models in academic R&D, JEPA at representational proof of concept, memory in heterogeneous industrialization by sub-family), multi-module integration cost (which grows faster than the number of modules, each interface being itself an object of validation), and governance, auditability, compliance constraints. An elegant matrix can lead to an unmanageable system. The question “which one will win?” is badly posed: it assumes a global competition where there are different functions, and turns an architectural decision into a tribal bet, which is admittedly a popular method of governance, though rarely a productive one.

Ongoing convergences

The three lineages converge. First movement: JEPA becomes agentic with V-JEPA-2. Representational prediction enters the territory of strict world models by a non-generative path. Second movement: transformers become memorial (Titans, MemGPT). Context is not enough; one must distinguish what is being manipulated now, what must be conserved, recalled, forgotten. Memory becomes an architectural component. Third movement: world models integrate memories and more abstract representations. Dreamer already has a recurrent latent state readable as working memory, and the open question is that of the coupling between a latent dynamic model, a learned long-term memory and a predictive representation encoder.

But this convergence relocates the difficulty. Composing an LLM, a world model, a memory and a predictive encoder does not automatically produce a superior system. It produces a system more difficult to train, to interpret, to validate, and to govern. Four problems appear immediately: error propagation (in a composite architecture, error does not stay local; it circulates), objective coordination (rarely spontaneously aligned), validation (the test surface grows faster than the number of modules, which is one of the many places where architectural enthusiasm dies, stabbed by quality assurance), and governance (who decides that a representation is reliable enough to feed an action?).

On terrains such as clinical digital twins, this conclusion is not theoretical. A system that projects patient trajectories must combine dynamics, memory of clinical history, abstract representation of unobserved states, and a governance layer. None of the three families is sufficient alone, and their composition is acceptable only if interfaces, hypotheses and validity limits are explicit. Failing this, the system inherits the blind spots of each module without inheriting their guarantees.

Authentic limits

This cartography itself has limits. No inter-family benchmark: LLMs, world models, JEPA and memory systems are evaluated by incommensurable protocols. Direct comparisons are rarely scientific; they are often editorial, including, on a smaller scale, the one in this note. No demonstrated scaling law for JEPA at the level of large LLMs: the difference between a promising direction and a dominant paradigm is called quantitative proof. It is tedious, but reality sometimes has that bad taste. Fragility of sim-to-real transfer for world models. Polysemy of “world model”: useful for selling a vision, dangerous for designing an architecture. None of these families natively integrates complete governance: an LLM does not naturally distinguish recommendation from action, a world model does not bound its validity domain, JEPA does not provide causal traceability of its representations, a memory module does not guarantee the legitimacy of what it recalls. Governance must be architected around the model, sometimes within it, but it does not emerge automatically from performance.

Conclusion

The right diagnosis is functional, not competitive. The autoregressive paradigm has revealed remarkable power in manipulating symbolic sequences, but it leaves three problems open: world dynamics, information persistence, choice of prediction space. World models, memory models and JEPA respond to them, each partially, each with its limits. An LLM speaks about the world. A world model projects possible states of the world. JEPA learns predictive representations of the world. A memory model conserves or recalls traces of the world. None constitutes, on its own, a complete architecture.

The strategic question is therefore not: which paradigm will win? It is: what minimal combination of capacities is necessary for the use case under consideration, under what hypotheses, with what validity domain, what cost, what governance and what proof? This reformulation forbids conflating laboratory announcement, product promise and operable architecture. That is less spectacular than a prophecy. It is, above all, the only level at which an architectural decision ceases to be a belief and becomes defensible.

Full text freely available in the PDF below (9 pages, with figures and footnotes).

Read the document

↓ Download PDF

Key takeaways

The autoregressive paradigm has three structural blind spots: world dynamics (action-conditioned prediction), information persistence (memory in the strong sense, not long context), and the choice of prediction space (beyond the token).
Three families of architectures respond to these gaps, each partially: world models attack dynamics, memory models attack persistence, JEPA-type architectures attack the prediction space. Boundaries deform, however, as soon as one looks at V-JEPA-2, Titans or recent Dreamer variants.
These families do not properly substitute for one another. Their probable trajectory is composition. But composing an LLM, a world model, a memory and a predictive encoder does not automatically produce a superior system: it produces one that is more difficult to train, to interpret, to validate, and to govern.
None of the families natively integrates complete governance: an LLM does not distinguish recommendation from action, a world model does not bound its validity domain, JEPA does not provide causal traceability, a memory module does not guarantee what it recalls is legitimate. Governance must be architected around the model, sometimes within it.
The strategic question is not 'which paradigm will win?' but: what minimal combination of capacities is necessary for the use case, under what hypotheses, with what validity domain, what cost, what governance, what proof?
On terrains such as clinical digital twins, composition is acceptable only if interfaces, hypotheses and validity limits are explicit — failing this, the system inherits the blind spots of each module without inheriting their guarantees.