Evaluating and Explaining Large Language Models for Code Using Syntactic Structures
AI-generated Key Points
- Large Language Models (LLMs) are evaluated and explained for their effectiveness in code
- LLMs are transformer-based neural networks pre-trained on large datasets of natural and programming languages
- ASTxplainer is introduced as an explainability method specific to LLMs for code, aligning token predictions with Abstract Syntax Tree (AST) nodes
- ASTxplainer extracts and aggregates normalized model logits within AST structures to provide insights into LLM effectiveness
- Empirical evaluation conducted on 12 popular LLMs using a curated dataset of popular GitHub projects
- Demonstrates the practical benefit of ASTxplainer in improving LLM evaluation
- User study conducted to examine the usefulness of ASTxplainer-derived visualizations in aiding end-users' understanding of model predictions
- Results highlight the potential for ASTxplainer to improve LLM evaluation and aid in interpreting model predictions
- Contributes to the field of automated Software Engineering tools and provides valuable insights into the performance and interpretability of LLMs for code.
Authors: David N Palacio, Alejandro Velasco, Daniel Rodriguez-Cardenas, Kevin Moran, Denys Poshyvanyk
Abstract: Large Language Models (LLMs) for code are a family of high-parameter, transformer-based neural networks pre-trained on massive datasets of both natural and programming languages. These models are rapidly being employed in commercial AI-based developer tools, such as GitHub CoPilot. However, measuring and explaining their effectiveness on programming tasks is a challenging proposition, given their size and complexity. The methods for evaluating and explaining LLMs for code are inextricably linked. That is, in order to explain a model's predictions, they must be reliably mapped to fine-grained, understandable concepts. Once this mapping is achieved, new methods for detailed model evaluations are possible. However, most current explainability techniques and evaluation benchmarks focus on model robustness or individual task performance, as opposed to interpreting model predictions. To this end, this paper introduces ASTxplainer, an explainability method specific to LLMs for code that enables both new methods for LLM evaluation and visualizations of LLM predictions that aid end-users in understanding model predictions. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes, by extracting and aggregating normalized model logits within AST structures. To demonstrate the practical benefit of ASTxplainer, we illustrate the insights that our framework can provide by performing an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects. Additionally, we perform a user study examining the usefulness of an ASTxplainer-derived visualization of model predictions aimed at enabling model users to explain predictions. The results of these studies illustrate the potential for ASTxplainer to provide insights into LLM effectiveness, and aid end-users in understanding predictions.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.