Data Augmentation Approaches for Source Code Models: A Survey

AI-generated keywords: Data Augmentation Source Code Models Taxonomy Strategies Techniques

AI-generated Key Points

  • The paper presents a survey of data augmentation (DA) techniques for source code models.
  • DA methods are used to enhance training data and improve the robustness and generalizability of these models.
  • A taxonomy of DA for source code model approaches is constructed, followed by a discussion on prominent, methodologically illustrative approaches.
  • The authors highlight the general strategies and techniques to optimize the quality of DA, underscore techniques that find utility in widely-accepted source code scenarios and downstream tasks, and outline the prevailing challenges and potential opportunities for future research.
  • Specific areas such as Code Summarization, Code Search, and Code Completion are discussed with examples of how DA can be applied in each area.
  • The paper provides a valuable collection of general-purpose DA techniques for source code models and serves as an inspiration for further research in this area.
  • The authors also provide a continually updated GitHub repository that hosts a list of up-to-date papers on DA for source code models.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Terry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang, Li Li, Xiaoning Du, Zhenchang Xing, David Lo

Technical Report
License: CC BY 4.0

Abstract: The increasingly popular adoption of source code in many critical tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start by constructing a taxonomy of DA for source code models model approaches, followed by a discussion on prominent, methodologically illustrative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques that find utility in widely-accepted source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, this paper endeavors to demystify the corpus of existing literature on DA for source code models, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code models, accessible at \url{https://github.com/terryyz/DataAug4Code}.

Submitted to arXiv on 31 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.19915v1

This paper presents a comprehensive and integrative survey of data augmentation (DA) techniques for source code models. The increasing popularity of source code adoption in critical tasks has motivated the development of DA methods to enhance training data and improve various capabilities such as robustness and generalizability of these models. While several DA methods have been proposed and tailored for source code models, there lacks a comprehensive examination to understand their effectiveness and implications. The paper starts by constructing a taxonomy of DA for source code model approaches, followed by a discussion on prominent, methodologically illustrative approaches. The authors highlight the general strategies and techniques to optimize the quality of DA, underscore techniques that find utility in widely-accepted source code scenarios and downstream tasks, and outline the prevailing challenges and potential opportunities for future research. The authors also present specific areas such as Code Summarization where they discuss how MHM is applied to perturb training examples to improve adversarial training while generating new summaries using retrieval-augmentation frameworks based on similar code-summary pairs. In Code Search, they discuss how soft data augmentation (SoDa) can be used instead of rule-based techniques to manipulate input data representation while still predicting tokens based on dynamic masking or replacement when processing CodeSearchNet. In Code Completion, they suggest that generative source code models are vulnerable to adversarial examples that are perturbed with transformation rules; thus they propose customized transformations specifically for docstrings, function/variable names, syntax/formatting on datasets like PY150 and GitHub Java Corpus. In essence, this paper provides a valuable collection of general-purpose DA techniques for source code models and serves as an inspiration for further research in this area. The authors also provide a continually updated GitHub repository that hosts a list of up-to-date papers on DA for source code models.
Created on 08 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.