Beyond Simulation: Benchmarking World Models for Planning and Causality in Autonomous Driving

AI-generated keywords: World models Traffic simulators Policy training Metrics Causal agents

AI-generated Key Points

  • World models are increasingly being used as learned traffic simulators for policy training.
  • Recent research is shifting towards using world models for policy training instead of traditional traffic simulators.
  • This study assesses the robustness of existing metrics for evaluating world models as traffic simulators and pseudo-environments for policy training.
  • The researchers analyze the metametric used in the Waymo Open Sim-Agents Challenge (WOSAC) to compare world model predictions in various scenarios.
  • The study extends the evaluation domain of WOSAC to include agents with a causal relationship with the ego vehicle, aiming to evaluate ego action-conditioned world models.
  • New metrics are proposed to highlight the sensitivity of world models to uncontrollable objects and gauge their performance as pseudo-environments for policy training.
  • Realistic simulation of causal agents influencing ego vehicle behavior is crucial for effective autonomous driving planning agent training.
  • Penalizing planning agents for mistakes made by traffic simulation can lead to less well-defined behavior, potentially resulting in overly cautious driving strategies.
  • This work introduces new metrics for assessing world models as data-driven traffic simulators, offering deeper insights into separate ego policy and traffic simulator performance compared to traditional evaluation methods like WOSAC's metametric.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hunter Schofield, Mohammed Elmahgiubi, Kasra Rezaee, Jinjun Shan

Accepted ICRA 2025
License: CC BY 4.0

Abstract: World models have become increasingly popular in acting as learned traffic simulators. Recent work has explored replacing traditional traffic simulators with world models for policy training. In this work, we explore the robustness of existing metrics to evaluate world models as traffic simulators to see if the same metrics are suitable for evaluating a world model as a pseudo-environment for policy training. Specifically, we analyze the metametric employed by the Waymo Open Sim-Agents Challenge (WOSAC) and compare world model predictions on standard scenarios where the agents are fully or partially controlled by the world model (partial replay). Furthermore, since we are interested in evaluating the ego action-conditioned world model, we extend the standard WOSAC evaluation domain to include agents that are causal to the ego vehicle. Our evaluations reveal a significant number of scenarios where top-ranking models perform well under no perturbation but fail when the ego agent is forced to replay the original trajectory. To address these cases, we propose new metrics to highlight the sensitivity of world models to uncontrollable objects and evaluate the performance of world models as pseudo-environments for policy training and analyze some state-of-the-art world models under these new metrics.

Submitted to arXiv on 03 Aug. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2508.01922v1

World models have gained popularity as learned traffic simulators. Recent research has focused on using them for policy training instead of traditional traffic simulators. This study delves into the robustness of existing metrics used to evaluate world models as traffic simulators and assesses their suitability for evaluating world models as pseudo-environments for policy training. The researchers specifically analyze the metametric utilized in the Waymo Open Sim-Agents Challenge (WOSAC) and compare world model predictions in standard scenarios where agents are either fully or partially controlled by the world model through partial replay. Additionally, this study extends the evaluation domain of WOSAC to include agents that have a causal relationship with the ego vehicle. This aims to assess the performance of ego action-conditioned world models. The evaluations uncover instances where top-ranking models excel under no perturbation but falter when the ego agent is compelled to replay its original trajectory. To address these challenges, new metrics are proposed to highlight the sensitivity of world models to uncontrollable objects and gauge their performance as pseudo-environments for policy training. It emphasizes the importance of evaluating causal agents, as discrepancies in traffic rollout performance are observed based on which domain of agents is being simulated. Realistic simulation of agents that influence the behavior of the ego vehicle throughout the simulation is crucial for effectively training an autonomous driving planning agent. Penalizing planning agents for mistakes made by the traffic simulation can lead to less well-defined behavior, potentially resulting in overly cautious or avoidant driving strategies. In conclusion, this work introduces new metrics for assessing world models as data-driven traffic simulators. It offers deeper insights into separate ego policy and traffic simulator performance compared to traditional evaluation methods like WOSAC's metametric. By exploring different evaluation domains and emphasizing realistic simulation of causal agents, this study contributes valuable perspectives on enhancing autonomous driving planning within world model simulations.
Created on 07 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.