Encoding, transduction and world models — Part 4. Why latent predictive architectures shift the problem of learning without crossing the threshold of biographical memory
Part 4 of the series “Encoding, transduction and world models”. Follows part 1/3, part 2/3 and part 3/3. Original article in French.
The trilogy argued that every cognitive architecture operates through representational mediation, and that the deepest difference between human cognition and contemporary architectures lies in the nature of the relations between representations: statistical co-occurrence on one side, biographical edge on the other. For what follows, it matters to stabilize a functional definition of this latter notion, independently of any phenomenological commitment. A biographical edge designates a relation between mnesic representations that simultaneously satisfies four operational conditions: indexation on the history of a continuous agent, co-activation within the same episode, preservation of an occurrence context, possibility of situated recall from several modal perspectives. The thesis defended here: self-supervised predictive architectures shift the objective of learning in an epistemically decisive manner, by passing from prediction of observations to prediction of constraints — but this shift operates at a level where the conditions of the biographical edge are not defined. Not that it fails to satisfy them; it does not address them.
The analysis operates on three levels that must be distinguished explicitly. The computational level concerns the mechanics of architectures: encoders, latent spaces, prediction functions, regularization mechanisms. The epistemic level concerns what is learned in the strong sense: structure of coherence, invariants, constraints captured in the representation. The phenomenological level concerns memory as lived by a situated subject. These three levels are not substitutable. A computational property does not mechanically imply an epistemic property; an epistemic property does not mechanically imply a phenomenological property. Conflating the levels produces two symmetric errors: over-attribution (“JEPA understands the world”) and under-attribution (“JEPA has no memory”). JEPA does not fail to produce a biographical memory. It operates at a level where this notion is not defined. This precision conditions what we are entitled to expect of these architectures in regulated environments, where out-of-distribution robustness is not a secondary objective but a compliance requirement.
JEPA is part of a lineage of self-supervised architectures: SimCLR, MoCo, then BYOL (which shows that asymmetric prediction between views can avoid contrastive learning without representational collapse), DINO, MAE, and finally I-JEPA and V-JEPA which formalize latent prediction over context-target pairs. The specificity of JEPA with respect to BYOL is that the target is spatially situated via a position signal, and the context is explicitly masked rather than defined by augmentation. A crucial technical precision must be made, otherwise the analysis transforms into ungrounded praise: a latent predictability objective alone converges toward collapse — all representations collapse into a single point. The properties of invariance and compression are guaranteed only by explicit anti-collapse mechanisms: variance and covariance regularization (VICReg), target encoder updated by exponential moving average (EMA), architectural asymmetry between context encoder and target encoder, or combinations of these mechanisms. What JEPA learns is defined by the objective combined with these structural constraints, not by the objective alone. The predictive virtues are inseparable from inductive engineering choices that must be documented as such.
The latent space learned by a JEPA is not a feature space in the classical sense — a dictionary of visual or semantic patterns useful for downstream tasks. It is a coherence space: a geometry in which certain configurations of representations are compatible with each other and others are not. A technical precision conditions everything that follows: standard JEPA, in its I-JEPA and V-JEPA formulations, does not explicitly encode the transformations of the world. It is not a dynamical model in the sense of a system simulating state trajectories. What learning by latent prediction over context-target pairs produces is a mapping function between latent representations, constrained such that pairs corresponding to natural co-occurrences in the data have mutually predictable representations. The dynamics of the world is not represented as such; it is implicitly constrained by the predictability structure of the latent space. JEPA does not encode transformations themselves. It encodes the constraints that make certain transformations predictable. This characterization applies to standard JEPA; the explicitly dynamical extensions, hierarchical (H-JEPA) or action-conditioned (A-JEPA), remain to date largely at the program stage. JEPA is neither a memory, nor a simulator, nor an agent: it is an architecture that learns a geometry of predictability.
An analogy circulates regularly: the latent space of a JEPA would function as a form of long-term memory, even as an analogue of proprioception. The analogy captures a partial intuition (representational stability under transformations that preserve the predictable structure) but misses the essential. Biological proprioception is the continuous emission, by the body, of an internal signal that informs the central nervous system of the effective state of the motor system. No current JEPA possesses anything of the kind; its coherence is purely representational. What remains is to say what JEPA does positively, without the biological crutch. Three properties deserve to be named in their own right: structural invariance under partial transformation (emergent, modulo the anti-collapse constraints); compression oriented toward predictability (informational filtering by predictive utility, distinct from filtering by information loss in classical autoencoders); selection of information constrained by predictability. The classical paradigm articulated observe → abstract → classify; the JEPA paradigm articulates observe → constrain → anticipate. The model constrains a space of plausible latent continuations, without making them explicit. The distinction with an explicit simulator is crucial: a dynamical world model generates trajectories; JEPA delimits the space within which these trajectories should be generable.
Labels do not describe the world. They constrain its use. The label “malignant tumor” affixed to a radiological image does not describe an intrinsic property of the image — it indicates the expected clinical use of this image within a given decision framework. The disease is not in the pixel; it is in the articulation between the pixel, the patient’s history, the diagnostic protocol and the therapeutic decision. The label compresses this articulation into a binary signal useful for supervised learning, but this compression is a domain-specific projection, not an ontological description. A JEPA learns a geometry of coherence independently of any domain-specific partition. When a supervised head is added to a pre-trained SSL encoder, learning is not being completed: an arbitrary projection dictated by the needs of a downstream task is being reinjected into a space optimized on other criteria. This reinjection is legitimate and even indispensable, but it must not be confused with a revelation of the structures the encoder would have discovered. Hence a strategic partition: self-supervision is a mechanism of structure discovery; supervision is a mechanism of use projection.
Three properties take on particular importance, to be formulated as engineering hypotheses supported by a substantial body of empirical evidence, not as intrinsic properties of the architecture. Out-of-distribution robustness: an SSL encoder pre-trained on a broad distribution, then projected by supervised fine-tuning, generally degrades less than a supervised classifier optimized for correlations specific to the training site (Azizi et al.; CheXzero; RETFound). To be established case by case — no theoretical guarantee. Dependence on annotated datasets: the HDLSS situation severely penalizes pure supervision; SSL pretraining on unannotated corpora can displace part of the learning complexity outside the phase costly in annotation, on the condition that a structurally comparable corpus exists (Krishnan, Rajpurkar, Topol). Trajectory modeling: medicine is essentially temporal; an architecture that learns predictability constraints over temporal pairs can produce representations useful for modeling disease or treatment trajectories.
Illustrative box. In the TweenMe / OCTOPUS program on mNSCLC patients carrying the BRAF V600E mutation (n=184, 5 European countries), the work on trajectories led to mobilizing a combination of representation learning and modeling by SurvTRACE (Wang & Sun, 2022), a transformer architecture for survival analysis in the presence of competing events, with a TSTR fidelity measured at 95.2 percent on the validation cohort. This metric is not a proof of general statistical indistinguishability nor a demonstration of the intrinsic superiority of the learned representations; it is an indicator, within the considered evaluation framework, that the generated trajectory preserves the operational properties useful for downstream tasks. Implementation terrain, not universal demonstration.
Six limits must be firmly held. (1) No proof to date that a JEPA learns a complete physics of the world — the existing demonstrations (I-JEPA, V-JEPA) show invariances learned on natural distributions, but these invariances cover only a fraction of real physical constraints. (2) The learned latent space is, in the great majority of configurations, non-interpretable — a limit shared by the whole of SSL, which weighs particularly under MDR Regulation (EU) 2017/745 and the EU AI Regulation, where non-interpretability must be compensated by other guarantees (post-hoc explainability, drift monitoring, independent validation by external cohort). (3) Evaluation technically delicate: classical metrics do not directly measure what JEPA is supposed to learn. (4) Performance strongly depends on the design of masks and on context-target pair generation strategies — what sometimes resembles automatic discovery is partially encoded in inductive engineering choices. (5) SSL pretraining requires substantial compute: the reduction of dependence on annotations is paid for in GPU cycles on massive pretraining corpora; the argument transfers part of the cost rather than eliminating it. (6) Temporal drift: an encoder pretrained in 2025 has no guarantee of remaining valid in 2030 when imaging protocols, cohort demographics or acquisition modalities will have evolved. JEPA shifts the problem of learning. It does not entirely resolve it.
Three operational conditions distinguish a biographical memory from a latent coherence. Contextual reindexing: access to a mnesic content along several entry paths (modal, temporal, affective) and reactivation, from each of them, of a coherent configuration of the entire episode. Multi-episode integration: articulation of distinct episodes through relations structured by an agent’s history, neither purely statistical nor purely temporal. Agent persistence: continuity in time of a unique referent to which mnesic edges are indexed, with the additional property that this agent can treat past episodes as episodes lived by itself, and not as external data available for consultation. The distinction between indexation and ownership is crucial — a journal indexes events to an identifier; it does not make them belong to a subject. A serious objection deserves to be examined: the generative agents of Park et al. (memory stream, reflexion, multi-criteria retrieval), Voyager (persistent skill library), ReAct claim what the three conditions appear to describe. Without complacency: on contextual reindexing, the memory stream offers multi-axis indexation, but operates on homogeneous textual entries — it re-articulates the episode rather than reactivating it. On multi-episode integration, reflexion produces summaries (compressive, lossy), not relations preserving the individuality of episodes. On agent persistence, these architectures all have a persistent identifier and a journal indexed to it, but do not satisfy the ownership condition: retrieval is an indexing operation, not a situated reactivation. None of these architectures satisfies the three conditions simultaneously in the strict sense — they approach them by juxtaposition. The phenomenological → functional translation isolates an operational minimum below which the threshold is not crossed, independently of any commitment on consciousness. If an architecture does not satisfy it, the threshold is not crossed, whatever the metaphysical arbitrations. If it does, the phenomenological question remains open as structural surplus — a zone that current architectures do not reach.
What latent predictive architectures change is not the nature of artificial intelligence. It is the target of learning. Before them: learn answers, classify, reconstruct. After them: learn the constraints that make certain transformations predictable, encode the coherence of a space rather than fidelity to a signal. This shift is neither a revolution nor a detail. It is a precise epistemic movement, whose scope must be assessed at the level of what it does — reduce dependence on annotations in certain regimes, improve out-of-distribution robustness under certain conditions, structure trajectory modeling — and of what it does not do: satisfy the functional conditions of a biographical memory, neither by itself nor by simple addition of an episode journal. The strategic question for industrial architects deploying these systems in regulated environments is therefore not should we adopt JEPA? — the question is poorly posed. It is: which property are we trying to instantiate, at what level, and does the chosen architecture instantiate it, or does it merely simulate its surface? Both answers are valid depending on context, but they are not equivalent, and their confusion produces systems that appear intelligent up until the exact moment one moves them out of their training distribution. Intelligence does not reside in what is observed, but in what cannot vary. It remains to be seen whether what cannot vary is enough to constitute a subject who remembers.
Enter your details to access the document. Free access — no sales outreach.
Personalized document · Free access · No sales outreach