Beyond Next-Token Prediction: How Yann LeCun’s LLM-JEPA Redefines AI Reasoning

This paper introduces LLM-JEPA, a novel training objective that integrates Joint Embedding Predictive Architectures (JEPA) into Large Language Models. By supplementing traditional next-token prediction with an embedding-space loss, the authors demonstrate significant gains in reasoning and generalization while successfully mitigating the common problem of overfitting in small-scale datasets.

Beyond Next-Token Prediction: How Yann LeCun’s LLM-JEPA Redefines AI Reasoning Vol. 2025 • No. 1 Slideify Arxiv Research New York / Providence Beyond Next-Token Prediction How Yann LeCun’s LLM-JEPA Redefines AI Reasoning Hai Huang (Atlassian) • Yann LeCun (New York University) • Randall Balestriero (Brown University) ☞ In Brief I. LLM-JEPA effectively bridges the gap between vision-style embedding learning and language-style token reconstruction. II. The architecture shows remarkable resistance to overfitting, continuing to improve when standard models begin to plateau or degrade. III. Empirical tests across Llama, Gemma, and OLMo families confirm that embedding-space regularizers enhance accuracy in high-reasoning tasks like GSM8K. Introduction: The Limits of Autoregression For the past decade, the dominant paradigm in Large Language Models (LLMs) has been remarkably simple: predict the next token. From GPT-2 to the latest Llama 3 systems, the objective function has remained rooted in input-space reconstruction . While this approach has yielded the wonders of ChatGPT and modern generative AI, it has been criticized by AI visionaries like Yann LeCun for creating "stochastic parrots" that lack true world models. In the realm of computer vision, a different approach has reigned supreme. Joint Embedding Predictive Architectures (JEPAs) have proven far superior to reconstruction-based methods by focusing on the relationship between different "views" of the same data in a latent, abstract space. The new research paper, LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures , finally brings these two worlds together. This work represents a seismic shift from predicting words to predicting meaning . § The Historical Context of JEPA Before diving into the language implementation, it is essential to understand why JEPAs exist. Traditional autoencoders (like Masked Autoencoders in vision) try to fill in missing pixels. This forces the model to learn low-level details (like the texture of a leaf) that might be irrelevant to understanding the high-level concept (a tree). JEPAs, introduced primarily by the Meta AI team, ignore the pixels and focus on embeddings . By ensuring that different views of an object (like a photo from two different angles) result in predictable embeddings, the model learns a more robust, semantic representation of the world. Input: Text (View 1) Target: Code (View 2) ENC ENC Predictor Minimize Distance in Embedding Space Fig 1. The Joint Embedding Predictive Architecture applied to Language Models. The LLM-JEPA Objective LLM-JEPA does not abandon the generative power of transformers. Instead, it augments the standard loss with a JEPA-style predictive term. The total loss is defined as: L_total = γ × L_LLM + λ × distance(Predictor(Enc(Text)), Enc(Code)) In this formula, L_LLM is the standard next-token cross-entropy loss, while the distance metric (often cosine similarity) ensures abstract alignment. The hyperparameters γ and λ balance the generative versus abstract capabilities. Implementing this objective requires a clever architectural trick. Because transformers are typically causal (tokens only look back), the authors developed a block-causal attention mask . This allows the model to process two distinct views (e.g., an English description and its corresponding SQL code) in a single context window without the second view contaminating the first view's representation. * * * Empirical Results The researchers tested LLM-JEPA across various model families including Llama 3, Gemma 2, and Apple’s OpenELM. The results across reasoning-heavy datasets were staggering. Performance Gains on Reasoning Tasks (Accuracy %) Figure 2: Comparative performance analysis showing LLM-JEPA (Red) consistently outperforming the Baseline (Navy). In tasks like Natural Language to Regular Expressions (NL-RX), LLM-JEPA improved the Llama-3.2-1B model significantly. More importantly, the model showed a high resistance to overfitting . While standard fine-tuning often sees a dip in performance after a few epochs on small datasets, LLM-JEPA continued to refine its understanding, converging on much higher final scores. Model Dataset Baseline (NTP) LLM-JEPA Llama-3.2-1B-Instruct NL-RX-SYNTH 57.29% 71.46% Gemma-2-2B-it NL-RX-SYNTH 33.65% 43.12% OpenELM-1_1B NL-RX-SYNTH 12.07% 25.40% Llama-3.2-1B-Instruct GSM8K (Reasoning) 32.36% 36.36% Llama-3.2-1B-Instruct NQ-Open 20.12% 21.59% Conclusion: A Step Toward World Models LLM-JEPA is more than just a new loss function; it is a proof of concept for the next generation of AI. By forcing models to predict the structure of information in a latent space, we move closer to the "World Models" envisioned by AI researchers. The primary bottleneck of LLM-JEPA is the 2-fold increase in compute cost during training. However, the authors introduced random JEPA-loss dropout (LD) . By dropping the JEPA term for some batches, they found they could achieve better accuracy than the baseline even with significantly fewer FLOPs. "The future of AI isn't just about bigger datasets—it's about smarter objectives that capture the deep, geometric structure of knowledge." As we scale this technique from fine-tuning to massive pre-training runs, the gap between human-like reasoning and machine output may finally begin to close. References Assran et al. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. LeCun (2022). A path towards autonomous machine intelligence. Baevski et al. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. Brown et al. (2020). Language models are few-shot learners. Gao et al. (2021). SimCSE: Simple contrastive learning of sentence embeddings. © 2025 Slideify.app • Generated by Slideify Arxiv