GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

AI-generated keywords: Artificial Intelligence

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Large language models (LLMs) are increasingly used in artificial intelligence tasks through reinforcement learning methods like Group Relative Policy Optimization (GRPO).
  • Traditional methods require many rollouts to learn new tasks effectively.
  • GEPA, a novel prompt optimizer, leverages natural language reflection to extract high-level rules from trial and error experiences.
  • GEPA diagnoses issues, proposes and tests prompt updates, and integrates insights from its own attempts' Pareto frontier.
  • GEPA achieves substantial quality gains with fewer rollouts compared to GRPO (up to 35 times less), outperforming it by an average of 6% and up to 20%.
  • GEPA surpasses MIPROv2 by over 10%, showing notable achievements such as a 12% accuracy improvement on AIME-2025.
  • The research has been accepted at ICLR 2026 for an oral presentation.
  • The code for GEPA is openly available on GitHub with the authors including Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan,Meng Jiang , Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,and Omar Khattab.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

Accepted to ICLR 2026 (Oral). Code: https://github.com/gepa-ai/gepa

Abstract: Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across six tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +12% accuracy on AIME-2025), and demonstrates promising results as an inference-time search strategy for code optimization. We release our code at https://github.com/gepa-ai/gepa .

Submitted to arXiv on 25 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.19457v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the field of artificial intelligence, large language models (LLMs) are increasingly being utilized for various tasks through reinforcement learning methods such as Group Relative Policy Optimization (GRPO). However, these methods often require a significant number of rollouts to effectively learn new tasks. Recognizing the inherent interpretability of language as a rich learning medium for LLMs compared to sparse rewards, a team of researchers introduced , a novel prompt optimizer. GEPA leverages natural language reflection to extract high-level rules from trial and error experiences. By sampling trajectories and reflecting on them in natural language, GEPA diagnoses issues, proposes and tests prompt updates, and integrates valuable insights from the Pareto frontier of its own attempts. This unique design allows GEPA to achieve substantial quality gains even with just a few rollouts. Through rigorous testing across six tasks, GEPA has demonstrated superior performance compared to GRPO, outperforming it by an average of 6% and up to 20%, while using significantly fewer rollouts (up to 35 times less). Furthermore, GEPA surpasses MIPROv2, the leading prompt optimizer, by over 10%, showcasing notable achievements such as a 12% accuracy improvement on AIME-2025. Additionally, GEPA shows promise as an inference-time search strategy for code optimization. The research conducted by this team has been recognized with acceptance at ICLR 2026 (Oral presentation). The code for GEPA is openly available on GitHub at https://github.com/gepa-ai/gepa. The authors involved in this groundbreaking work include Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan,Meng Jiang , Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,and Omar Khattab.
Created on 13 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.