SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

AI-generated keywords: Large Language Model Agents

AI-generated Key Points

  • Large language model (LLM) agents require effective multi-turn interactions in real-world tasks
  • Existing multi-turn reinforcement learning (RL) algorithms struggle with credit assignment over multiple turns and generalization capabilities of LLMs
  • Introduction of ColBench benchmark for LLM agent collaboration with human partners in backend programming and frontend design tasks
  • Proposal of SWEET-RL algorithm utilizing step-level rewards to enhance policy model performance
  • SWEET-RL outperforms other state-of-the-art multi-turn RL algorithms by achieving a 6% absolute improvement in success and win rates on ColBench
  • Advancement enables Llama-3.1-8B model to match or surpass GPT4-o performance in collaborative content creation scenarios
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li

29 pages, 16 figures
License: CC BY 4.0

Abstract: Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

Submitted to arXiv on 19 Mar. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2503.15478v1

, , , , In the realm of large language model (LLM) agents, the need for effective multi-turn interactions in real-world tasks is paramount. However, existing multi-turn reinforcement learning (RL) algorithms designed to optimize LLM agents often struggle with credit assignment over multiple turns while harnessing the generalization capabilities of LLMs. This challenge has left researchers questioning how to develop algorithms that can effectively address these issues. To delve deeper into this problem, a new benchmark called ColBench has been introduced. In ColBench, an LLM agent collaborates with a human partner across multiple turns to tackle practical tasks in backend programming and frontend design. Building upon this benchmark, a novel RL algorithm known as SWEET-RL (RL with Step-WisE Evaluation from Training-time information) has been proposed. This algorithm utilizes a meticulously crafted optimization objective to train a critic model equipped with additional training-time information. The critic then provides step-level rewards to enhance the policy model. Experimental results have shown that SWEET-RL outperforms other state-of-the-art multi-turn RL algorithms by achieving a 6% absolute improvement in success and win rates on ColBench. This advancement enables the Llama-3.1-8B model to match or even surpass the performance of GPT4-o in collaborative content creation scenarios. The challenges faced by LLM agents in complex sequential decision-making tasks have prompted the exploration of various approaches, including leveraging successful single-turn RLHF algorithms like RAFT and PPO, as well as value function learning methods such as TD-learning. However, these methods often fall short in providing explicit credit assignment across turns or may struggle with sample complexity due to task complexity and long horizons. To address these challenges and pave the way for future research in multi-turn RL algorithms for realistic LLM agent scenarios, the development of ColBench has been instrumental. This benchmark focuses on artifact creation tasks where agents collaborate with humans to produce final artifacts like code or web pages that meet human expectations. By utilizing LLMs as human simulators and implementing functional evaluators for reliable assessments, researchers can now explore more effective strategies for training general, capable, and goal-directed agents within collaborative settings. Overall, the advancements made through SWEET-RL and ColBench represent significant progress towards enhancing the performance of LLM agents in multi-turn interactions across diverse real-world tasks.
Created on 28 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.