Program Synthesis with Large Language Models

AI-generated keywords: Program Synthesis Language Models MBPP MathQA-Python Dialog

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Large language models for program synthesis in general-purpose programming languages
  • Evaluation of models on two benchmarks: MBPP and MathQA-Python
  • MBPP dataset: 974 programming tasks for entry-level programmers
  • MathQA-Python dataset: 23,914 problems involving synthesizing code from complex text
  • Performance scales log-linearly with model size on both datasets
  • Largest models can synthesize solutions for 59.6% of MBPP problems using few-shot learning
  • Fine-tuning improves performance by approximately 10 percentage points across most model sizes
  • Largest fine-tuned model achieves an accuracy of 83.8% on MathQA-Python dataset
  • Incorporating human feedback reduces error rates by half compared to initial model predictions
  • Error analysis reveals areas where models struggle and challenging types of programs to generate
  • Models struggle to predict program outputs given specific inputs
  • Insights into limitations and potential applications of large language models for program synthesis
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton

Jacob and Augustus contributed equally

Abstract: This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

Submitted to arXiv on 16 Aug. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2108.07732v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

This paper investigates the capabilities of large language models for program synthesis in general-purpose programming languages. The authors evaluate several models with varying parameter sizes on two new benchmarks, MBPP and MathQA-Python, using both few-shot learning and fine-tuning approaches. The benchmarks are designed to assess the models' ability to generate short Python programs based on natural language descriptions. The MBPP dataset consists of 974 programming tasks aimed at entry-level programmers, while the MathQA-Python dataset contains 23,914 problems that involve synthesizing code from more complex text. The results show that the performance of program synthesis scales log-linearly with model size on both datasets. Even without finetuning on a code dataset, the largest models can successfully synthesize solutions for 59.6% of the MBPP problems using few-shot learning with well-designed prompts. Fine-tuning on a held-out portion of the dataset improves performance by approximately 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves an accuracy of 83.8%. Furthermore, the authors explore how these models can engage in dialog about code by incorporating human feedback to enhance their solutions. They find that natural language feedback from humans reduces error rates by half compared to initial model predictions. Additionally, an error analysis reveals areas where these models struggle and identifies which types of programs are particularly challenging to generate. Finally, the authors investigate the semantic grounding of these models by fine-tuning them to predict program execution results; however even their best performing models generally struggle to predict program outputs given specific inputs. Overall, this study provides insights into the limitations and potential applications of large language models for program synthesis in general purpose programming languages.
Created on 13 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.