GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

AI-generated keywords: Artificial Intelligence

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Large language models (LLMs) are increasingly used in artificial intelligence tasks through reinforcement learning methods like Group Relative Policy Optimization (GRPO).
Traditional methods require many rollouts to learn new tasks effectively.
GEPA, a novel prompt optimizer, leverages natural language reflection to extract high-level rules from trial and error experiences.
GEPA diagnoses issues, proposes and tests prompt updates, and integrates insights from its own attempts' Pareto frontier.
GEPA achieves substantial quality gains with fewer rollouts compared to GRPO (up to 35 times less), outperforming it by an average of 6% and up to 20%.
GEPA surpasses MIPROv2 by over 10%, showing notable achievements such as a 12% accuracy improvement on AIME-2025.
The research has been accepted at ICLR 2026 for an oral presentation.
The code for GEPA is openly available on GitHub with the authors including Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan,Meng Jiang , Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,and Omar Khattab.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

arXiv: 2507.19457v2 - DOI (cs.CL)

Accepted to ICLR 2026 (Oral). Code: https://github.com/gepa-ai/gepa

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across six tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +12% accuracy on AIME-2025), and demonstrates promising results as an inference-time search strategy for code optimization. We release our code at https://github.com/gepa-ai/gepa .

Submitted to arXiv on 25 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.19457v2

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the field of artificial intelligence, large language models (LLMs) are increasingly being utilized for various tasks through reinforcement learning methods such as Group Relative Policy Optimization (GRPO). However, these methods often require a significant number of rollouts to effectively learn new tasks. Recognizing the inherent interpretability of language as a rich learning medium for LLMs compared to sparse rewards, a team of researchers introduced , a novel prompt optimizer. GEPA leverages natural language reflection to extract high-level rules from trial and error experiences. By sampling trajectories and reflecting on them in natural language, GEPA diagnoses issues, proposes and tests prompt updates, and integrates valuable insights from the Pareto frontier of its own attempts. This unique design allows GEPA to achieve substantial quality gains even with just a few rollouts. Through rigorous testing across six tasks, GEPA has demonstrated superior performance compared to GRPO, outperforming it by an average of 6% and up to 20%, while using significantly fewer rollouts (up to 35 times less). Furthermore, GEPA surpasses MIPROv2, the leading prompt optimizer, by over 10%, showcasing notable achievements such as a 12% accuracy improvement on AIME-2025. Additionally, GEPA shows promise as an inference-time search strategy for code optimization. The research conducted by this team has been recognized with acceptance at ICLR 2026 (Oral presentation). The code for GEPA is openly available on GitHub at https://github.com/gepa-ai/gepa. The authors involved in this groundbreaking work include Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan,Meng Jiang , Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,and Omar Khattab.

- Large language models (LLMs) are increasingly used in artificial intelligence tasks through reinforcement learning methods like Group Relative Policy Optimization (GRPO).
- Traditional methods require many rollouts to learn new tasks effectively.
- GEPA, a novel prompt optimizer, leverages natural language reflection to extract high-level rules from trial and error experiences.
- GEPA diagnoses issues, proposes and tests prompt updates, and integrates insights from its own attempts' Pareto frontier.
- GEPA achieves substantial quality gains with fewer rollouts compared to GRPO (up to 35 times less), outperforming it by an average of 6% and up to 20%.
- GEPA surpasses MIPROv2 by over 10%, showing notable achievements such as a 12% accuracy improvement on AIME-2025.
- The research has been accepted at ICLR 2026 for an oral presentation.
- The code for GEPA is openly available on GitHub with the authors including Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan,Meng Jiang , Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,and Omar Khattab.

Summary- Big computer programs are used to help computers learn new things better. One new program called GEPA is really good at learning from mistakes and figuring out how to do things better. - GEPA is much faster than other programs, like GRPO, at getting better at tasks. It can be up to 35 times faster and do 6% to 20% better. - GEPA is even better than another program called MIPROv2, doing over 10% better on some tasks. - The people who made GEPA will talk about it at a special meeting in 2026. - You can see how GEPA works on the internet for free. Definitions- Large language models (LLMs): Big computer programs that help computers learn and understand language. - Reinforcement learning: A way for computers to learn by trying things out and getting rewards for doing well. - Group Relative Policy Optimization (GRPO): A specific method of reinforcement learning used with large language models. - Novel: Something new or different that hasn't been seen before. - Prompt optimizer: A tool that helps improve how a computer program learns from its mistakes and experiences. - Pareto frontier: A way of showing the best possible outcomes when there are trade-offs between different goals or results.

In recent years, large language models (LLMs) have become increasingly popular in the field of artificial intelligence. These models are used for a variety of tasks and are trained using reinforcement learning methods such as Group Relative Policy Optimization (GRPO). However, one major challenge with these methods is that they often require a significant number of rollouts to effectively learn new tasks. Recognizing the potential of natural language as a rich learning medium for LLMs compared to sparse rewards, a team of researchers has introduced GEPA - a novel prompt optimizer. This groundbreaking research paper titled "GEPA: A Natural Language Reflection-based Prompt Optimizer for Reinforcement Learning" has been accepted at ICLR 2026 (Oral presentation). The team behind this research includes Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan,Meng Jiang , Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,and Omar Khattab. So what exactly is GEPA and how does it improve upon existing reinforcement learning methods? Let's dive into the details. GEPA leverages natural language reflection to extract high-level rules from trial and error experiences. In simpler terms, it uses human-like reasoning to analyze its own attempts at solving a task and learns from them. By sampling trajectories and reflecting on them in natural language, GEPA diagnoses issues and proposes prompt updates that can improve its performance. It then tests these updates and integrates valuable insights from the Pareto frontier (the set of all optimal solutions) of its own attempts. This unique design allows GEPA to achieve substantial quality gains even with just a few rollouts. Through rigorous testing across six different tasks, GEPA has demonstrated superior performance compared to GRPO. On average, it outperforms GRPO by 6% and in some cases, up to 20%, while using significantly fewer rollouts (up to 35 times less). This is a significant improvement that can save time and computational resources for researchers and practitioners. But that's not all - GEPA also surpasses MIPROv2, the leading prompt optimizer, by over 10%. It showcases notable achievements such as a 12% accuracy improvement on AIME-2025. This further solidifies its position as a top-performing prompt optimizer in the field of reinforcement learning. Moreover, the research team also explored the potential of GEPA as an inference-time search strategy for code optimization. In simpler terms, this means using GEPA to improve the efficiency of computer programs. The results were promising with GEPA showing potential for improving code optimization techniques. The code for GEPA is openly available on GitHub at https://github.com/gepa-ai/gepa. This allows other researchers and practitioners to use and build upon their work, promoting collaboration and advancement in the field of artificial intelligence. In conclusion, the introduction of GEPA - a natural language reflection-based prompt optimizer - marks a significant milestone in reinforcement learning research. Its ability to learn from trial and error experiences through human-like reasoning sets it apart from existing methods and has shown impressive results across various tasks. With its potential for improving code optimization techniques as well, we can expect to see more exciting developments from this research team in the future.

Created on 13 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

59.6%

GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of Ch…

cs.CL

58.9%

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful…

cs.CL

58.3%

Artificial Impressions: Evaluating Large Language Model Behavior Through the Le…

cs.CL

57.7%

Technical Report: Large Language Models can Strategically Deceive their Users w…

cs.CL

56.9%

Automatic Prompt Optimization with "Gradient Descent" and Beam Search

cs.CL

56.0%

Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-promp…

cs.CL

55.9%

Large-Scale Text Analysis Using Generative Language Models: A Case Study in D…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.