In response to the rapid advancement of image generation technologies, there is a growing demand for interpretable and robust detection methods. Existing approaches often achieve high accuracy but operate as black boxes without providing human-understandable justifications. have emerged as powerful tools with strong analytical and reasoning capabilities, despite not originally being intended for forgery detection. When properly fine-tuned, MLLMs can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still face challenges such as hallucination and difficulty aligning their visual interpretations with actual image content and human reasoning. To address this gap, a dataset of AI-generated images annotated with bounding boxes and descriptive captions has been constructed to highlight synthesis artifacts. This dataset serves as a foundation for human-aligned visual-textual grounded reasoning. Through a multi-stage optimization strategy that balances the objectives of accurate detection, visual localization, and coherent textual explanation, MLLMs have been fine-tuned to improve performance in detecting AI-generated images and localizing visual flaws. The resulting model significantly outperforms baseline methods by achieving superior performance in both tasks. The introduction of the dataset containing 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies and illogical details along with concise captions explaining each flaw represents a significant contribution to the field. By fine-tuning Qwen-2.5-VL (32B) on this dataset using a two-stage training approach, an end-to-end system capable of detecting and explaining AI-generated images has been developed. This system not only produces binary authenticity decisions but also provides predicted bounding boxes paired with natural language justifications for identified regions in fake images. Through grounded reasoning enabled by fine-tuning on FakeXplained, MLLMs can now perform fine-grained visual reasoning and articulate their observations clearly. This advancement marks a new paradigm in human-interpretable AI-generated image detection where models are able to provide comprehensive rationales comparable to human annotators.
- - Growing demand for interpretable and robust detection methods due to rapid advancement of image generation technologies
- - MLLMs (Multimodal Language Models) have strong analytical and reasoning capabilities for forgery detection when fine-tuned
- - Challenges faced by existing MLLMs include hallucination and difficulty aligning visual interpretations with actual image content and human reasoning
- - Dataset of AI-generated images annotated with bounding boxes and descriptive captions created to highlight synthesis artifacts, enabling human-aligned visual-textual grounded reasoning
- - Multi-stage optimization strategy improves performance of MLLMs in detecting AI-generated images and localizing visual flaws
- - Introduction of dataset containing 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies represents a significant contribution to the field
- - Qwen-2.5-VL (32B) fine-tuned on dataset enables end-to-end system capable of detecting and explaining AI-generated images through binary authenticity decisions, predicted bounding boxes, and natural language justifications
- - Fine-tuning on FakeXplained allows MLLMs to perform fine-grained visual reasoning and articulate observations clearly, marking a new paradigm in human-interpretable AI-generated image detection
Summary1. People want better ways to find fake pictures because technology is getting better at making them.
2. Smart computer models can help find fake pictures by learning and practicing.
3. These computer models have some problems like seeing things that aren't real and understanding images like people do.
4. A special set of pictures made by computers helps us see where mistakes happen, so we can understand them better.
5. Making these computer models better at finding mistakes in pictures helps us know if a picture is real or fake.
Definitions- Interpretable: Able to be understood easily
- Robust: Strong and able to handle challenges well
- Forgery detection: Finding fake or altered images
- Multimodal Language Models (MLLMs): Advanced computer programs that can understand both text and images
- Hallucination: Seeing things that are not really there
- Synthesis artifacts: Mistakes or errors in created images
- Grounded reasoning: Using both visual and textual information to make decisions
- Fine-tuned: Adjusted or optimized for specific tasks
In recent years, the rapid advancement of image generation technologies has led to a growing demand for interpretable and robust detection methods. While existing approaches often achieve high accuracy, they operate as black boxes without providing human-understandable justifications. This lack of interpretability has become a major concern in various applications such as fake news detection, deepfake identification, and content moderation on social media platforms.
To address this gap, researchers have turned to multi-modal language models (MLLMs) as powerful tools with strong analytical and reasoning capabilities. Despite not originally being intended for forgery detection, MLLMs have shown promising results when properly fine-tuned. They can effectively identify AI-generated images and offer meaningful explanations for their decisions.
However, existing MLLMs still face challenges such as hallucination and difficulty aligning their visual interpretations with actual image content and human reasoning. To overcome these limitations, a team of researchers from the University of California San Diego has developed a dataset called FakeXplained.
The FakeXplained dataset contains 8,772 AI-generated images annotated with bounding boxes highlighting visual anomalies and illogical details along with concise captions explaining each flaw. This dataset serves as a foundation for human-aligned visual-textual grounded reasoning.
Through a multi-stage optimization strategy that balances the objectives of accurate detection, visual localization, and coherent textual explanation, MLLMs have been fine-tuned on the FakeXplained dataset to improve performance in detecting AI-generated images and localizing visual flaws.
The resulting model significantly outperforms baseline methods by achieving superior performance in both tasks. By fine-tuning Qwen-2.5-VL (32B) on this dataset using a two-stage training approach, an end-to-end system capable of detecting and explaining AI-generated images has been developed.
This system not only produces binary authenticity decisions but also provides predicted bounding boxes paired with natural language justifications for identified regions in fake images. Through grounded reasoning enabled by fine-tuning on FakeXplained, MLLMs can now perform fine-grained visual reasoning and articulate their observations clearly.
This advancement marks a new paradigm in human-interpretable AI-generated image detection where models are able to provide comprehensive rationales comparable to human annotators. This is a significant contribution to the field as it not only improves the accuracy of detection but also provides meaningful explanations for the decisions made by the model.
The development of FakeXplained dataset and its use in fine-tuning MLLMs highlights the importance of interpretability in AI systems. With this dataset, researchers can now train models that not only achieve high accuracy but also offer transparent justifications for their decisions.
In conclusion, with the rapid growth of AI-generated images, there is an urgent need for interpretable and robust detection methods. The introduction of FakeXplained dataset and its use in fine-tuning MLLMs has brought us one step closer to achieving this goal. This research opens up new possibilities for future work in human-aligned visual-textual grounded reasoning and paves the way towards more trustworthy and transparent AI systems.