Large Language Models in Fault Localisation

AI-generated keywords: Fault Localisation ChatGPT-3.5 ChatGPT-4 LLMs Defects4J

AI-generated Key Points

  • ChatGPT-3.5 and ChatGPT-4 are investigated for fault localisation in large-scale open-source programs
  • Performance of these models is compared to existing fault localisation techniques using the Defects4J dataset
  • Stability and explanation of LLMs in fault localisation are examined
  • Impact of prompt engineering and code context length on effectiveness is analyzed
  • ChatGPT-4 outperforms existing methods within limited code context, achieving 46.9% higher accuracy than SmartFL baseline
  • However, when code context expands to class level, ChatGPT models become less effective overall
  • ChatGPT's explainability is found to be unsatisfactory with only approximately 30% accuracy
  • Further research is needed to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yonghao Wu, Zheng Li, Jie M. Zhang, Mike Papadakis, Mark Harman, Yong Liu

License: CC BY 4.0

Abstract: Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, code summarisation, test generation and code repair. Fault localisation is essential for facilitating automatic program debugging and repair, and is demonstrated as a highlight at ChatGPT-4's launch event. Nevertheless, there has been little work understanding LLMs' capabilities for fault localisation in large-scale open-source programs. To fill this gap, this paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Using the widely-adopted Defects4J dataset, we compare the two LLMs with the existing fault localisation techniques. We also investigate the stability and explanation of LLMs in fault localisation, as well as how prompt engineering and the length of code context affect the fault localisation effectiveness. Our findings demonstrate that within a limited code context, ChatGPT-4 outperforms all the existing fault localisation methods. Additional error logs can further improve ChatGPT models' localisation accuracy and stability, with an average 46.9% higher accuracy over the state-of-the-art baseline SmartFL in terms of TOP-1 metric. However, performance declines dramatically when the code context expands to the class-level, with ChatGPT models' effectiveness becoming inferior to the existing methods overall. Additionally, we observe that ChatGPT's explainability is unsatisfactory, with an accuracy rate of only approximately 30%. These observations demonstrate that while ChatGPT can achieve effective fault localisation performance under certain conditions, evident limitations exist. Further research is imperative to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.

Submitted to arXiv on 29 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.15276v1

This paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, two state-of-the-art Large Language Models (LLMs), for fault localisation in large-scale open-source programs. The study compares the performance of these LLMs with existing fault localisation techniques using the Defects4J dataset. The researchers also examine the stability and explanation of LLMs in fault localisation, as well as the impact of prompt engineering and code context length on effectiveness. The findings reveal that within a limited code context, ChatGPT-4 outperforms existing methods, achieving an average 46.9% higher accuracy compared to the state-of-the-art baseline SmartFL. However, when the code context expands to the class level, ChatGPT models become less effective than existing methods overall. Additionally, ChatGPT's explainability is found to be unsatisfactory with only approximately 30% accuracy. These limitations highlight the need for further research to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
Created on 11 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.