From Compression to Constraint Discovery: Why VL-JEPA Succeeds Where Language Models Struggle

I believe this was a more apt title for this post - The Residue and the Rule: Why Predicting Meaning Beats Predicting Words, to give you a hint of what is to follow in detail.

Recommend reading these two posts - Compression is not Cognition and Beyond the Compression Ceiling: Discovery over Imitation

The Curious Case of the Brilliant Parrot

There is a peculiar asymmetry at the heart of modern artificial intelligence. Systems trained on humanity's collected writings can translate languages and generate code. They can hold conversations that would have seemed miraculous a decade ago. Yet these same systems confidently assert falsehoods and collapse under novel constraints. They degrade systematically when trained on their own outputs, a phenomenon documented with increasing rigor in studies of recursive self-training. The standard response has been to scale further: more parameters, more data, more compute. But a growing body of evidence suggests the bottleneck is not capacity. It is the nature of the learning signal itself.

Consider what happens when a scientist writes a paper or a mathematician publishes a proof. The final text represents a polished artifact, optimized for efficient communication between agents who already know how to reason. The false starts have been removed. The contradictions that forced conceptual restructuring have been edited away. What remains is a linear narrative from premises to conclusions—the residue of reasoning, not the reasoning itself. Training systems on such artifacts teaches them how intelligence sounds, not how it operates. This distinction—between imitating the outputs of thought and instantiating the mechanisms of thought—may be the most consequential in artificial intelligence research. And a recent paper from Meta FAIR, introducing VL-JEPA, provides striking empirical evidence that taking this distinction seriously yields immediate practical benefits. The argument presented here synthesizes VL-JEPA's engineering results with a broader theoretical program centered on constraint discovery—a synthesis that, while building on several active research threads, offers a novel integrative framing for understanding these results.

The Compression Hypothesis and Its Discontents

The dominant paradigm in language modeling rests on what might be called the compression hypothesis: if we predict human-generated sequences efficiently enough, intelligence will emerge as a natural consequence. This hypothesis has been validated to a remarkable degree. Models trained purely on next-token prediction exhibit behaviors resembling translation, summarization, and even rudimentary planning.

Yet the very nature of this success reveals its limitations. A model that correctly predicts "falls" after "the apple" has succeeded at the training objective, whether it represents gravity as an invariant physical constraint or simply as a statistical regularity in how humans describe fruit behavior. The training signal provides no mechanism to distinguish these cases. The model has learned a correlation without necessarily learning a cause. This ambiguity becomes critical when we examine the failure modes of modern language models. They hallucinate with confidence, generating text that matches the statistical texture of valid reasoning while violating basic constraints of logic or fact. They struggle with tasks requiring novel combinations of familiar concepts. Most tellingly, when trained recursively on their own outputs, they degrade systematically, entering a feedback loop that amplifies biases and reduces diversity. Recent work has termed this "model collapse"—a phenomenon where iterative self-training leads to progressive loss of the tails of the original distribution, ultimately degrading output quality even in common cases. This is not the behavior of a system that understands constraints. It is the behavior of a compression algorithm converging on its own approximations.

The philosophical point is precise: compression is orthogonal to understanding. A system can achieve low perplexity by learning correlations without representing the causal structure that generated them. Motion blur in a photograph proves something moved, but studying blur patterns alone does not teach physics. Similarly, linguistic patterns prove that reasoning occurred, but optimizing likelihood over those patterns does not instantiate the machinery of reasoning itself.

Constraint Discovery as an Alternative Objective

What would it mean to train systems differently? One proposal, articulated with increasing clarity in recent theoretical work and drawing on foundations in causal inference and inverse constrained reinforcement learning, is to reframe the central objective of artificial intelligence from prediction or reward maximization to constraint discovery under uncertainty. In this view, intelligence is not fundamentally about compressing the past or maximizing future reward. It is about inferring the invariant structure that governs environments through active interaction.

Physics discovers constraints on motion. Mathematics discovers constraints on symbol manipulation. Planning discovers constraints on action sequences. Even language understanding, properly conceived, involves discovering constraints on meaning and pragmatics—the rules that make certain interpretations valid and others incoherent. These are not instrumental byproducts of achieving goals but the primary accomplishment from which goal-directed behavior emerges.

This reframing carries significant practical implications. It implies that learning systems should be evaluated based on epistemic accuracy, specifically how well the inferred model of the world aligns with the true generative process. A system may achieve high benchmark performance while retaining an incorrect or incomplete understanding of underlying constraints. In contrast, a system with an accurate constraint model can adapt rapidly to changing goals, while systems optimized solely for performance require extensive retraining.

The difference is clearest in transfer scenarios. When the objective shifts, for example, from reaching an exit to collecting scattered tokens, a policy learned for the original task may fail. The system has learned what to do for one goal, not what must be true about the environment. But an accurate constraint model transfers directly. If the system has inferred the domain's structure, it can plan for new goals using its existing understanding of how the world works. This is why human intelligence shows such transfer: we build models of domain structure that remain valid across different objectives within that domain.

VL-JEPA: An Empirical Proof of Concept

Against this theoretical backdrop, VL-JEPA represents something remarkable: a concrete engineering artifact that embodies key elements of this paradigm, though it is framed in the practical language of efficiency and performance rather than the philosophical language of constraint discovery.

VL-JEPA extends the Joint Embedding Predictive Architecture (JEPA) framework introduced by LeCun and colleagues. The original JEPA work showed that training vision models to predict latent representations of masked image or video regions, rather than reconstructing pixels, yields more robust and transferable features. I-JEPA for images and V-JEPA for video established that predicting in embedding space outperforms pixel-level reconstruction for self-supervised visual learning. VL-JEPA applies this principle to the vision-language domain.

The core innovation is straightforward but profound in its implications. Standard vision-language models generate text token by token, optimizing cross-entropy loss over a discrete vocabulary. VL-JEPA instead predicts continuous semantic embeddings, high-dimensional vectors in a learned latent space. The model is trained not to produce the words "the lamp turns off" but to produce an embedding that captures the meaning.

Why does this matter? In token space, different valid descriptions of the same event are nearly orthogonal. "The lamp turns off" and "the room goes dark" share a few tokens despite describing the same physical change. A model optimizing token prediction must learn separate probability masses for each phrasing, wasting capacity on what the VL-JEPA authors call "surface-level linguistic variability." But in embedding space, these phrases map to nearby points. The model learns to target a region of semantic space rather than a particular sequence of symbols.

This region is, in effect, an equivalence class of valid meanings. An equivalence class functions as a constraint, specifying what must remain invariant across acceptable variations. The model is no longer rewarded for reproducing a particular string. It is rewarded for landing inside the boundary that separates valid interpretations from invalid ones. Without explicitly framing its objective as constraint discovery, VL-JEPA has implemented something that behaves like it. This approach shares conceptual territory with other recent work pushing reasoning into latent space, notably COCONUT (Chain of Continuous Thought) and the Large Concept Models program, which explore reasoning in continuous representation spaces rather than discrete token sequences. VL-JEPA's contribution demonstrates that this principle yields concrete efficiency and transfer benefits in the multimodal vision-language setting.

The Evidence: Efficiency, Transfer, and World Modeling

The empirical results support this interpretation, though their scope should be noted. In controlled experiments using the same vision encoder, training data, and compute budget, VL-JEPA achieved stronger performance than a standard token-prediction model while using fifty percent fewer trainable parameters. This is not a marginal improvement. It suggests that much of the capacity in conventional models is devoted to learning something other than the semantic content of scenes, likely the stylistic and syntactic degrees of freedom that embedding-space training abstracts away.

More striking are the results on world modeling benchmarks. The WorldPrediction suite tests whether models understand physical causality: given two images showing initial and final states, which action video explains the transition? This task requires not pattern matching but a genuine understanding of how actions cause state changes. VL-JEPA, with 1.6 billion parameters, achieved 65.7% accuracy. GPT-4o achieved 53.3%. Gemini 2.0 achieved 55.6%. Claude 3.5 Sonnet achieved 52.0%.

These results demand explanation. Models trained on vastly more data with vastly more parameters, optimized to predict the next token in human-generated text, are outperformed by a smaller model trained to predict semantic embeddings. One interpretation is that compression-trained models have learned extensive correlations in how humans describe the physical world without learning the invariant structure of the physical world itself. They know what causal explanations sound like without knowing what makes causal explanations valid.VL-JEPA's selective decoding mechanism provides additional evidence. Because the model produces continuous embeddings rather than discrete tokens, it can monitor a video stream in latent space and invoke its text decoder only when the embedding changes significantly—that is, only when something semantically meaningful happens. This reduces decoding operations by nearly threefold with no loss in caption quality. The model treats language as a report on its understanding rather than the understanding itself, generating text only when there is something new to report.

A crucial caveat: these results, while striking, are concentrated in the vision-language and world-modeling domains where VL-JEPA was designed and tested. Whether embedding-prediction objectives match or surpass token objectives on all reasoning tasks, especially those requiring explicit chain-of-thought reasoning, tool use, or long-horizon planning with language as an essential intermediate representation, remains an open empirical question. The evidence supports the value of embedding prediction for certain task families; generalization beyond those families awaits further study.

The Gap That Remains

It would be excessive to claim that VL-JEPA fully instantiates the constraint discovery paradigm. Important elements are missing, and this distinction matters for interpreting the results.

VL-JEPA achieves what might be called constraint abstraction, learning to represent equivalence classes of meaning rather than individual token sequences. But constraint discovery in the strong sense requires something more: active intervention in the environment to distinguish correlation from causation. Multiple distinct causal structures can generate identical observational data. A system learning passively from a dataset, even with a sophisticated embedding objective, cannot resolve all such ambiguities. Discovering that "pushing objects causes them to move" rather than merely "pushed objects and moving objects co-occur" requires the capacity to experiment, to push and observe, to withhold pushing and observe the difference.

VL-JEPA learns from passive datasets rather than through active intervention. Its constraint representations are implicit in embedding geometry rather than explicit in symbolic or causal form. It faces no irreversible consequences that would force epistemic commitment before action. These are not criticisms of VL-JEPA, which accomplishes what it sets out to accomplish, but clarifications of what remains for future work. The partial implementation is nevertheless instructive. VL-JEPA demonstrates that changing the learning objective alone, moving from token prediction to embedding prediction, produces substantial gains in efficiency and transfer within its domain. If merely abstracting away surface variability yields these benefits, what might be possible with training objectives explicitly designed to reward structural inference through active experimentation?

Toward a Science of Reasoning-Inducing Environments

The theoretical framework suggests that environments forcing genuine constraint discovery share identifiable properties. They feature irreversibility, where actions have consequences that cannot be undone. They exhibit causal depth, where outcomes depend on extended chains of interdependent decisions. They systematically penalize naive heuristics, creating pressure to infer deeper structure when shortcuts fail.

These properties explain why certain benchmarks remain difficult for even the most sophisticated models. The Abstraction and Reasoning Corpus presents puzzles explicitly designed to resist memorization; solutions require genuine abstraction over transformation rules. Current systems struggle not because the computational requirements are high but because the task demands structural inference rather than statistical association.

Formalizing these intuitions remains an open problem. We lack a mature theory that predicts which environments induce reasoning and which permit shallow solutions. Such a theory would need to characterize constraint complexity in a way that connects to generalization: which constraint structures, once learned, transfer most broadly to new domains? This is analogous to computational complexity theory, which characterizes problems by the resources needed to solve them. We need a theory of epistemic complexity that characterizes environments by the structural inference they demand.

From Residue to Rule

The trajectory of artificial intelligence over the past decade has been defined by a single insight: compress human knowledge efficiently, and intelligence will emerge. This insight has been enormously productive. But its limits are becoming apparent in the systematic brittleness of models trained on the polished artifacts of human thought.

VL-JEPA, building on the JEPA framework and parallel work in latent reasoning, offers evidence that an alternative is possible within specific domains. By predicting semantic embeddings rather than token sequences, it implicitly targets the invariant structure underlying valid descriptions rather than the surface form of any particular description. The result is a model that is more efficient to train, more robust in transfer, and better at understanding physical causality. Though much smaller than compression-trained systems, it outperforms on world modeling benchmarks.

The broader implication is that we may be approaching an inflection point in AI research. The constraint discovery paradigm, training systems to infer invariant structure rather than compress observed sequences, offers a coherent alternative with testable predictions. Systems trained under this objective should transfer better across goal changes, resist distribution shifts more effectively, avoid degradation during self-training, and demonstrate compositional generalization. These predictions can be systematically tested, and the research program linking VL-JEPA's empirical results to this theoretical framework suggests concrete experimental paths forward.

Whether this direction fulfills its promise remains to be determined empirically. But VL-JEPA suggests the question is no longer merely theoretical. We have our first empirical glimpse of what happens when we optimize for the rule rather than the residue. The glimpse is promising.

The motion blur in a photograph proves that something moved. But studying blur patterns alone does not teach physics. The question for artificial intelligence is whether we want systems that recognize blur—or systems that understand motion.

References

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. CVPR.

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv:2404.08471.

Barrault, L., et al. (2024). Large concept models: Language modeling in a sentence representation space. arXiv:2412.08821.

Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., & Fung, P. (2024). VL-JEPA: Joint embedding predictive architecture for vision-language. arXiv:2512.10942.

Chollet, F. (2019). On the measure of intelligence. arXiv:1911.01547.

Hao, S., et al. (2024). Training large language models to reason in a continuous latent space. arXiv:2412.06769.

Malik, S., et al. (2021). Inverse constrained reinforcement learning. ICML.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

Shumailov, I., et al. (2023). The curse of recursion: Training on generated data makes models forget. arXiv:2305.17493.