AstBERT: Enabling Language Model for Code Understanding with Abstract Syntax Tree

AI-generated keywords: AstBERT Abstract Syntax Trees BERT Code Parsing GitHub

AI-generated Key Points

  • AstBERT is a pre-trained language model that uses abstract syntax trees (AST) to enhance understanding of programming languages (PL)
  • BERT has been used for source code analysis but lacks domain knowledge, affecting its performance
  • AstBERT addresses this issue by collecting large amounts of source code from GitHub and utilizing code parsers to interpret and integrate AST information
  • The proposed model achieves state-of-the-art performance on code information extraction and code search tasks
  • It achieves 96.4% accuracy for code information extraction and 57.12% accuracy for code search
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rong Liang, Yujie Lu, Zhen Huang, Tiehua Zhang, Yuze Liu

License: CC BY 4.0

Abstract: Using a pre-trained language model (i.e. BERT) to apprehend source codes has attracted increasing attention in the natural language processing community. However, there are several challenges when it comes to applying these language models to solve programming language (PL) related problems directly, the significant one of which is the lack of domain knowledge issue that substantially deteriorates the model's performance. To this end, we propose the AstBERT model, a pre-trained language model aiming to better understand the PL using the abstract syntax tree (AST). Specifically, we collect a colossal amount of source codes (both java and python) from GitHub and incorporate the contextual code knowledge into our model through the help of code parsers, in which AST information of the source codes can be interpreted and integrated. We verify the performance of the proposed model on code information extraction and code search tasks, respectively. Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks (with 96.4% for code information extraction task, and 57.12% for code search task).

Submitted to arXiv on 20 Jan. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2201.07984v1

The AstBERT model is a pre-trained language model that utilizes abstract syntax trees (AST) to improve the understanding of programming languages (PL). BERT has been used to analyze source codes but lacks domain knowledge which hampers its performance. To address this issue, AstBERT collects a vast amount of source codes from GitHub and leverages code parsers to interpret and integrate AST information. This proposed model achieves state-of-the-art performance on code information extraction and code search tasks with 96.4% accuracy for code information extraction and 57.12% accuracy for code search.
Created on 28 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.