PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction

AI-generated keywords: Visual selective attention

AI-generated Key Points

  • Individual preferences are crucial in determining how humans prioritize visual stimuli in the realm of visual selective attention.
  • Existing models often overlook the impact of subjective cognitive diversity on fixation behavior.
  • Conventional saliency prediction models have limitations in capturing personalized attention patterns due to reliance on low-resolution imagery and subsequent upscaling.
  • A new approach called SPA-ADV has been introduced, involving a large-scale multimodal dataset and a novel eye-tracking saliency model known as PRE-MAP.
  • The PRE-MAP model aims to characterize personalized visual disparities through Reinforcement learning-optimized Eye-tracking and predict format-correct and spatially accurate points guided by Multi-Attribute user profiles.
  • C-GRPO has been introduced to enhance MLLMs' performance in producing precise prediction points while considering variability in eye movement points and Multi-Attribute profiles.
  • Extensive experiments have demonstrated the effectiveness of these approaches in addressing challenges related to personalized gaze prediction within eye-tracking models.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hanbing Wu, Ping Jiang, Anyang Su, Chenxu Zhao, Tianyu Fu, Minghui Wu, Beiping Tan, Huiying Li

License: CC BY-NC-SA 4.0

Abstract: Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \href{https://github.com/mininglamp-MLLM/PRE-MAP}{this URL}.

Submitted to arXiv on 25 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.19213v1

, , , , In the realm of visual selective attention, individual preferences play a crucial role in determining how humans prioritize visual stimuli. By bridging subjective cognitive mechanisms with objective visual elements, individuals are able to regulate their prioritization of dynamic visual scenes, thereby influencing semantic interpretation and hierarchical processing. However, existing models and datasets often overlook the impact of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models typically rely on segmentation approaches with low-resolution imagery to generate saliency heatmaps that are subsequently upscaled to native resolutions, limiting their ability to capture personalized attention patterns. Moreover, Multimodal Language Models (MLLMs) face constraints such as hallucinations, making it challenging to adhere strictly to expected formats in tasks involving multiple point predictions. Achieving precise point positioning is also a significant challenge for these models. To address these limitations, a new approach called Subjective Personalized Attention for Advertisement Videos (SPA-ADV) has been introduced. This approach involves a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants across varying age groups and genders while analyzing 486 videos. Additionally, a novel eye-tracking saliency model known as PRE-MAP has been proposed to characterize personalized visual disparities through Reinforcement learning-optimized Eye-tracking. Built upon MLLMs and guided by Multi-Attribute user profiles to predict Points accurately, this model aims to ensure format-correct and spatially accurate prediction points are generated. To further enhance the performance of MLLMs in producing precise prediction points while considering the variability in eye movement points and Multi-Attribute profiles, Consistency Group Relative Policy Optimization (C-GRPO) has been introduced. Extensive experiments conducted on SPA-ADV and other benchmarks have demonstrated the effectiveness of this approach in addressing the aforementioned challenges. The code and dataset associated with this research are available at the provided URL. The study was authored by Hanbing Wu, Ping Jiang, Anyang Su, Chenxu Zhao, Tianyu Fu, Minghui Wu, Beiping Tan, and Huiying Li. This work represents a significant advancement in personalized gaze prediction within eye-tracking models.
Created on 28 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.