Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

AI-generated keywords: Image Generation Technologies Interpretable Detection Methods Multi-modal Large Language Models (MLLMs) Fine-tuning FakeXplained Dataset

AI-generated Key Points

  • Growing demand for interpretable and robust detection methods due to rapid advancement of image generation technologies
  • MLLMs (Multimodal Language Models) have strong analytical and reasoning capabilities for forgery detection when fine-tuned
  • Challenges faced by existing MLLMs include hallucination and difficulty aligning visual interpretations with actual image content and human reasoning
  • Dataset of AI-generated images annotated with bounding boxes and descriptive captions created to highlight synthesis artifacts, enabling human-aligned visual-textual grounded reasoning
  • Multi-stage optimization strategy improves performance of MLLMs in detecting AI-generated images and localizing visual flaws
  • Introduction of dataset containing 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies represents a significant contribution to the field
  • Qwen-2.5-VL (32B) fine-tuned on dataset enables end-to-end system capable of detecting and explaining AI-generated images through binary authenticity decisions, predicted bounding boxes, and natural language justifications
  • Fine-tuning on FakeXplained allows MLLMs to perform fine-grained visual reasoning and articulate observations clearly, marking a new paradigm in human-interpretable AI-generated image detection
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang

License: CC BY-NC-SA 4.0

Abstract: The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

Submitted to arXiv on 08 Jun. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2506.07045v1

In response to the rapid advancement of image generation technologies, there is a growing demand for interpretable and robust detection methods. Existing approaches often achieve high accuracy but operate as black boxes without providing human-understandable justifications. have emerged as powerful tools with strong analytical and reasoning capabilities, despite not originally being intended for forgery detection. When properly fine-tuned, MLLMs can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still face challenges such as hallucination and difficulty aligning their visual interpretations with actual image content and human reasoning. To address this gap, a dataset of AI-generated images annotated with bounding boxes and descriptive captions has been constructed to highlight synthesis artifacts. This dataset serves as a foundation for human-aligned visual-textual grounded reasoning. Through a multi-stage optimization strategy that balances the objectives of accurate detection, visual localization, and coherent textual explanation, MLLMs have been fine-tuned to improve performance in detecting AI-generated images and localizing visual flaws. The resulting model significantly outperforms baseline methods by achieving superior performance in both tasks. The introduction of the dataset containing 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies and illogical details along with concise captions explaining each flaw represents a significant contribution to the field. By fine-tuning Qwen-2.5-VL (32B) on this dataset using a two-stage training approach, an end-to-end system capable of detecting and explaining AI-generated images has been developed. This system not only produces binary authenticity decisions but also provides predicted bounding boxes paired with natural language justifications for identified regions in fake images. Through grounded reasoning enabled by fine-tuning on FakeXplained, MLLMs can now perform fine-grained visual reasoning and articulate their observations clearly. This advancement marks a new paradigm in human-interpretable AI-generated image detection where models are able to provide comprehensive rationales comparable to human annotators.
Created on 17 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.