Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

AI-generated keywords: Image Generation Technologies Interpretable Detection Methods Multi-modal Large Language Models (MLLMs) Fine-tuning FakeXplained Dataset

AI-generated Key Points

Growing demand for interpretable and robust detection methods due to rapid advancement of image generation technologies
MLLMs (Multimodal Language Models) have strong analytical and reasoning capabilities for forgery detection when fine-tuned
Challenges faced by existing MLLMs include hallucination and difficulty aligning visual interpretations with actual image content and human reasoning
Dataset of AI-generated images annotated with bounding boxes and descriptive captions created to highlight synthesis artifacts, enabling human-aligned visual-textual grounded reasoning
Multi-stage optimization strategy improves performance of MLLMs in detecting AI-generated images and localizing visual flaws
Introduction of dataset containing 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies represents a significant contribution to the field
Qwen-2.5-VL (32B) fine-tuned on dataset enables end-to-end system capable of detecting and explaining AI-generated images through binary authenticity decisions, predicted bounding boxes, and natural language justifications
Fine-tuning on FakeXplained allows MLLMs to perform fine-grained visual reasoning and articulate observations clearly, marking a new paradigm in human-interpretable AI-generated image detection

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang

arXiv: 2506.07045v1 - DOI (cs.CV)

License: CC BY-NC-SA 4.0

Abstract: The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

Submitted to arXiv on 08 Jun. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2506.07045v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In response to the rapid advancement of image generation technologies, there is a growing demand for interpretable and robust detection methods. Existing approaches often achieve high accuracy but operate as black boxes without providing human-understandable justifications. have emerged as powerful tools with strong analytical and reasoning capabilities, despite not originally being intended for forgery detection. When properly fine-tuned, MLLMs can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still face challenges such as hallucination and difficulty aligning their visual interpretations with actual image content and human reasoning. To address this gap, a dataset of AI-generated images annotated with bounding boxes and descriptive captions has been constructed to highlight synthesis artifacts. This dataset serves as a foundation for human-aligned visual-textual grounded reasoning. Through a multi-stage optimization strategy that balances the objectives of accurate detection, visual localization, and coherent textual explanation, MLLMs have been fine-tuned to improve performance in detecting AI-generated images and localizing visual flaws. The resulting model significantly outperforms baseline methods by achieving superior performance in both tasks. The introduction of the dataset containing 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies and illogical details along with concise captions explaining each flaw represents a significant contribution to the field. By fine-tuning Qwen-2.5-VL (32B) on this dataset using a two-stage training approach, an end-to-end system capable of detecting and explaining AI-generated images has been developed. This system not only produces binary authenticity decisions but also provides predicted bounding boxes paired with natural language justifications for identified regions in fake images. Through grounded reasoning enabled by fine-tuning on FakeXplained, MLLMs can now perform fine-grained visual reasoning and articulate their observations clearly. This advancement marks a new paradigm in human-interpretable AI-generated image detection where models are able to provide comprehensive rationales comparable to human annotators.

- Growing demand for interpretable and robust detection methods due to rapid advancement of image generation technologies
- MLLMs (Multimodal Language Models) have strong analytical and reasoning capabilities for forgery detection when fine-tuned
- Challenges faced by existing MLLMs include hallucination and difficulty aligning visual interpretations with actual image content and human reasoning
- Dataset of AI-generated images annotated with bounding boxes and descriptive captions created to highlight synthesis artifacts, enabling human-aligned visual-textual grounded reasoning
- Multi-stage optimization strategy improves performance of MLLMs in detecting AI-generated images and localizing visual flaws
- Introduction of dataset containing 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies represents a significant contribution to the field
- Qwen-2.5-VL (32B) fine-tuned on dataset enables end-to-end system capable of detecting and explaining AI-generated images through binary authenticity decisions, predicted bounding boxes, and natural language justifications
- Fine-tuning on FakeXplained allows MLLMs to perform fine-grained visual reasoning and articulate observations clearly, marking a new paradigm in human-interpretable AI-generated image detection

Summary1. People want better ways to find fake pictures because technology is getting better at making them. 2. Smart computer models can help find fake pictures by learning and practicing. 3. These computer models have some problems like seeing things that aren't real and understanding images like people do. 4. A special set of pictures made by computers helps us see where mistakes happen, so we can understand them better. 5. Making these computer models better at finding mistakes in pictures helps us know if a picture is real or fake. Definitions- Interpretable: Able to be understood easily - Robust: Strong and able to handle challenges well - Forgery detection: Finding fake or altered images - Multimodal Language Models (MLLMs): Advanced computer programs that can understand both text and images - Hallucination: Seeing things that are not really there - Synthesis artifacts: Mistakes or errors in created images - Grounded reasoning: Using both visual and textual information to make decisions - Fine-tuned: Adjusted or optimized for specific tasks

In recent years, the rapid advancement of image generation technologies has led to a growing demand for interpretable and robust detection methods. While existing approaches often achieve high accuracy, they operate as black boxes without providing human-understandable justifications. This lack of interpretability has become a major concern in various applications such as fake news detection, deepfake identification, and content moderation on social media platforms. To address this gap, researchers have turned to multi-modal language models (MLLMs) as powerful tools with strong analytical and reasoning capabilities. Despite not originally being intended for forgery detection, MLLMs have shown promising results when properly fine-tuned. They can effectively identify AI-generated images and offer meaningful explanations for their decisions. However, existing MLLMs still face challenges such as hallucination and difficulty aligning their visual interpretations with actual image content and human reasoning. To overcome these limitations, a team of researchers from the University of California San Diego has developed a dataset called FakeXplained. The FakeXplained dataset contains 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies and illogical details along with concise captions explaining each flaw. This dataset serves as a foundation for human-aligned visual-textual grounded reasoning. Through a multi-stage optimization strategy that balances the objectives of accurate detection, visual localization, and coherent textual explanation, MLLMs have been fine-tuned on the FakeXplained dataset to improve performance in detecting AI-generated images and localizing visual flaws. The resulting model significantly outperforms baseline methods by achieving superior performance in both tasks. By fine-tuning Qwen-2.5-VL (32B) on this dataset using a two-stage training approach, an end-to-end system capable of detecting and explaining AI-generated images has been developed. This system not only produces binary authenticity decisions but also provides predicted bounding boxes paired with natural language justifications for identified regions in fake images. Through grounded reasoning enabled by fine-tuning on FakeXplained, MLLMs can now perform fine-grained visual reasoning and articulate their observations clearly. This advancement marks a new paradigm in human-interpretable AI-generated image detection where models are able to provide comprehensive rationales comparable to human annotators. This is a significant contribution to the field as it not only improves the accuracy of detection but also provides meaningful explanations for the decisions made by the model. The development of FakeXplained dataset and its use in fine-tuning MLLMs highlights the importance of interpretability in AI systems. With this dataset, researchers can now train models that not only achieve high accuracy but also offer transparent justifications for their decisions. In conclusion, with the rapid growth of AI-generated images, there is an urgent need for interpretable and robust detection methods. The introduction of FakeXplained dataset and its use in fine-tuning MLLMs has brought us one step closer to achieving this goal. This research opens up new possibilities for future work in human-aligned visual-textual grounded reasoning and paves the way towards more trustworthy and transparent AI systems.

Created on 17 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

66.6%

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

cs.CV

63.5%

Exploring the Naturalness of AI-Generated Images

cs.CV

63.3%

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

cs.CV

62.9%

Visual Instruction Tuning

cs.CV

62.6%

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

cs.CV

61.8%

Agriculture-Vision Challenge 2022 -- The Runner-Up Solution for Agricultural …

cs.CV

61.7%

Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretabili…

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.