Demystifying GPT Self-Repair for Code Generation

AI-generated keywords: Large Language Models

AI-generated Key Points

  • Large Language Models (LLMs) are impressive in code generation but struggle with challenging programming tasks.
  • Self-repair has emerged as a popular technique to boost performance, but limited studies exist on its effectiveness.
  • GPT-3.5 and GPT-4 were analyzed for their ability to perform self-repair on APPS dataset, which consists of diverse coding challenges.
  • A new evaluation strategy called pass@t was established to measure the pass rate of tasks against the total number of tokens sampled from the model.
  • Only GPT-4 is capable of carrying out self-repair on challenging coding tasks effectively.
  • The effectiveness of self-repair is bottlenecked by the feedback stage; using expert human programmers to give feedback unlocks significant performance gains.
  • Research on LLMs comes at a high environmental cost due to their computational requirements.
  • This work contributes to program synthesis with large language models by evaluating models from the perspective of minimizing the number of samples needed instead of raw accuracy or pass@k metric commonly used in prior literature.
  • The authors demonstrate that replacing GPT-4's self-generated feedback with feedback provided by an experienced programmer increases the number of repaired programs passing all unit tests significantly.
  • This paper presents valuable insights into the effectiveness of self-repair in LLMs for challenging coding tasks and highlights broader impacts related to productivity improvements and environmental costs.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama

Submitted to NeurIPS 2023
License: CC BY 4.0

Abstract: Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.

Submitted to arXiv on 16 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.09896v1

Large Language Models (LLMs) have shown impressive capabilities in code generation, but they still struggle with challenging programming tasks. To address this issue, self-repair has emerged as a popular technique to boost performance. However, limited studies exist on how and when self-repair works effectively. This paper analyzes GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. The authors establish a new evaluation strategy called pass@t that measures the pass rate of tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. They find that only GPT-4 is capable of carrying out self-repair on challenging coding tasks effectively. The effectiveness of self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on programs generated by GPT-3.5 and using expert human programmers to give feedback on programs generated by GPT-4 unlocks significant performance gains. The paper also discusses broader impacts related to productivity improvements for both legitimate software development and malicious intent. Additionally, research on LLMs comes at a high environmental cost due to their computational requirements. This work contributes to program synthesis with large language models by evaluating models from the perspective of minimizing the number of samples needed instead of raw accuracy or pass@k metric commonly used in prior literature. It assumes access to full input-output examples, unlike some prior work that distinguishes between public and private tests for filtering purposes. Finally, this work differs from most prior research in code repair by using textual feedback provided by the model itself rather than relying solely on statistical or learning based techniques for repairing human written code. The authors demonstrate that replacing GPT-4's self generated feedback with feedback provided by an experienced programmer increases the number of repaired programs passing all unit tests significantly. In conclusion, this paper presents valuable insights into the effectiveness of self repair in LLMs for challenging coding tasks and highlights broader impacts related to productivity improvements and environmental costs.
Created on 20 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.