LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

AI-generated keywords: Machine Learning

AI-generated Key Points

**Machine Learning:**
Constant development of new techniques and frameworks to enhance performance and efficiency.
**Joint Embedding Predictive Architectures (JEPAs):**
Focus on learning world models in compact latent spaces.
**LeWorldModel (LeWM):**
Offers stable end-to-end training from raw pixels with minimal hyperparameter tuning requirements.
**Efficiency in Planning:**
LeWM can plan up to 48 times faster than foundation-model-based world models while remaining competitive across control tasks.
**Physical Structure Encoding:**
LeWM's latent space encodes meaningful physical structures through probing of physical quantities, making it valuable for various machine learning applications.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero

arXiv: 2603.19312v1 - DOI (cs.LG)

License: CC BY 4.0

Abstract: Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

Submitted to arXiv on 13 Mar. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2603.19312v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the field of machine learning, Joint Embedding Predictive Architectures (JEPAs) have emerged as a promising framework for learning world models in compact latent spaces. However, existing methods have been found to be fragile and often rely on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to prevent representation collapse. To address these challenges, a new approach called LeWorldModel (LeWM) has been introduced. <kw>Machine Learning:</kw> In the ever-evolving field of machine learning, new techniques and frameworks are constantly being developed to improve performance and efficiency. <kw>Joint Embedding Predictive Architectures (JEPAs):</kw> JEPAs are a specific type of machine learning framework that focuses on learning world models in compact latent spaces. <kw>LeWorldModel (LeWM):</kw> LeWM is a groundbreaking JEPA that offers stable end-to-end training from raw pixels with minimal hyperparameter tuning requirements. <kw>Efficiency in Planning:</kw> One of the key advantages of LeWM is its ability to plan up to 48 times faster than foundation-model-based world models while maintaining competitiveness across various control tasks. <kw>Physical Structure Encoding:</kw> Through probing of physical quantities, LeWM's latent space has been shown to encode meaningful physical structures, making it a valuable tool for various machine learning applications.

- **Machine Learning:**
- Constant development of new techniques and frameworks to enhance performance and efficiency.
- **Joint Embedding Predictive Architectures (JEPAs):**
- Focus on learning world models in compact latent spaces.
- **LeWorldModel (LeWM):**
- Offers stable end-to-end training from raw pixels with minimal hyperparameter tuning requirements.
- **Efficiency in Planning:**
- LeWM can plan up to 48 times faster than foundation-model-based world models while remaining competitive across control tasks.
- **Physical Structure Encoding:**
- LeWM's latent space encodes meaningful physical structures through probing of physical quantities, making it valuable for various machine learning applications.

Summary1. Machine Learning is about creating new ways to make things work better and faster. 2. JEPAs focus on understanding how things in the world fit together in a simple way. 3. LeWM helps computers learn from pictures without needing too many settings adjusted. 4. LeWM can think ahead much quicker than other models when solving problems. 5. LeWM understands important parts of things by looking at how they are put together. Definitions- **Machine Learning:** Using new ideas to improve how machines work efficiently. - **Joint Embedding Predictive Architectures (JEPAs):** Studying how things connect in a small space to predict outcomes. - **LeWorldModel (LeWM):** Teaching computers from images with less need for adjustments. - **Efficiency in Planning:** Being able to think and solve problems quickly and effectively. - **Physical Structure Encoding:** Understanding important features of objects by analyzing their physical properties.

Introduction

In the field of machine learning, Joint Embedding Predictive Architectures (JEPAs) have emerged as a promising framework for learning world models in compact latent spaces. These models aim to capture the underlying structure and dynamics of a given environment, allowing for efficient planning and decision-making. However, existing methods have been found to be fragile and often rely on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to prevent representation collapse. To address these challenges, a new approach called LeWorldModel (LeWM) has been introduced. This groundbreaking JEPA offers stable end-to-end training from raw pixels with minimal hyperparameter tuning requirements. In this article, we will delve into the details of LeWM and its potential impact on the field of machine learning.

The Need for Efficient Planning

Efficiency is a crucial factor in any machine learning model. The ability to plan quickly and accurately can greatly improve performance in various tasks such as robotics control or game playing. Traditional world models often struggle with efficiency due to their reliance on foundation-model-based approaches that require extensive computation. This is where JEPAs like LeWM come into play. By focusing on compact latent spaces and efficient planning algorithms, they offer significant improvements in speed without sacrificing performance.

The Features of LeWorldModel

One of the key advantages of LeWM is its ability to plan up to 48 times faster than foundation-model-based world models while maintaining competitiveness across various control tasks. This impressive feat is achieved through several unique features:

Stable End-to-End Training

Unlike other JEPAs that require complex multi-term losses or pre-trained encoders for stability during training, LeWM offers stable end-to-end training from raw pixels with minimal hyperparameter tuning requirements. This makes it easier to implement and train compared to other models.

Efficient Planning Algorithm

LeWM utilizes a novel planning algorithm that allows for fast and accurate decision-making. This is achieved through the use of compact latent spaces, which enable efficient exploration and prediction of future states.

Physical Structure Encoding

Through probing of physical quantities, LeWM's latent space has been shown to encode meaningful physical structures. This means that the model can capture important features and dynamics of an environment, making it a valuable tool for various machine learning applications.

Applications of LeWorldModel

The potential applications of LeWM are vast and varied. Its efficient planning algorithm makes it well-suited for tasks such as robotics control, game playing, or even real-time decision-making in complex environments. Additionally, its ability to encode physical structures opens up possibilities for use in fields such as physics simulations or predictive maintenance.

Conclusion

In conclusion, LeWorldModel (LeWM) offers a new approach to Joint Embedding Predictive Architectures that addresses many challenges faced by existing methods. Its stable end-to-end training from raw pixels, efficient planning algorithm, and ability to encode physical structures make it a promising framework for learning world models in compact latent spaces. With further research and development, we can expect to see LeWM being applied in various machine learning tasks with impressive results.

Created on 24 May. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

63.5%

Closing the Train-Test Gap in World Models for Gradient-Based Planning

cs.LG

61.4%

Language Models Represent Space and Time

cs.LG

59.6%

TD-MPC2: Scalable, Robust World Models for Continuous Control

cs.LG

59.1%

Offline Reinforcement Learning from Images with Latent Space Models

cs.LG

54.6%

Model Dementia: Generated Data Makes Models Forget

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.