Large Language Models in Fault Localisation
AI-generated Key Points
- ChatGPT-3.5 and ChatGPT-4 are investigated for fault localisation in large-scale open-source programs
- Performance of these models is compared to existing fault localisation techniques using the Defects4J dataset
- Stability and explanation of LLMs in fault localisation are examined
- Impact of prompt engineering and code context length on effectiveness is analyzed
- ChatGPT-4 outperforms existing methods within limited code context, achieving 46.9% higher accuracy than SmartFL baseline
- However, when code context expands to class level, ChatGPT models become less effective overall
- ChatGPT's explainability is found to be unsatisfactory with only approximately 30% accuracy
- Further research is needed to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
Authors: Yonghao Wu, Zheng Li, Jie M. Zhang, Mike Papadakis, Mark Harman, Yong Liu
Abstract: Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, code summarisation, test generation and code repair. Fault localisation is essential for facilitating automatic program debugging and repair, and is demonstrated as a highlight at ChatGPT-4's launch event. Nevertheless, there has been little work understanding LLMs' capabilities for fault localisation in large-scale open-source programs. To fill this gap, this paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Using the widely-adopted Defects4J dataset, we compare the two LLMs with the existing fault localisation techniques. We also investigate the stability and explanation of LLMs in fault localisation, as well as how prompt engineering and the length of code context affect the fault localisation effectiveness. Our findings demonstrate that within a limited code context, ChatGPT-4 outperforms all the existing fault localisation methods. Additional error logs can further improve ChatGPT models' localisation accuracy and stability, with an average 46.9% higher accuracy over the state-of-the-art baseline SmartFL in terms of TOP-1 metric. However, performance declines dramatically when the code context expands to the class-level, with ChatGPT models' effectiveness becoming inferior to the existing methods overall. Additionally, we observe that ChatGPT's explainability is unsatisfactory, with an accuracy rate of only approximately 30%. These observations demonstrate that while ChatGPT can achieve effective fault localisation performance under certain conditions, evident limitations exist. Further research is imperative to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.