, , , ,
In the realm of visual selective attention, individual preferences play a crucial role in determining how humans prioritize visual stimuli. By bridging subjective cognitive mechanisms with objective visual elements, individuals are able to regulate their prioritization of dynamic visual scenes, thereby influencing semantic interpretation and hierarchical processing. However, existing models and datasets often overlook the impact of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models typically rely on segmentation approaches with low-resolution imagery to generate saliency heatmaps that are subsequently upscaled to native resolutions, limiting their ability to capture personalized attention patterns. Moreover, Multimodal Language Models (MLLMs) face constraints such as hallucinations, making it challenging to adhere strictly to expected formats in tasks involving multiple point predictions. Achieving precise point positioning is also a significant challenge for these models. To address these limitations, a new approach called Subjective Personalized Attention for Advertisement Videos (SPA-ADV) has been introduced. This approach involves a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants across varying age groups and genders while analyzing 486 videos. Additionally, a novel eye-tracking saliency model known as PRE-MAP has been proposed to characterize personalized visual disparities through Reinforcement learning-optimized Eye-tracking. Built upon MLLMs and guided by Multi-Attribute user profiles to predict Points accurately, this model aims to ensure format-correct and spatially accurate prediction points are generated. To further enhance the performance of MLLMs in producing precise prediction points while considering the variability in eye movement points and Multi-Attribute profiles, Consistency Group Relative Policy Optimization (C-GRPO) has been introduced. Extensive experiments conducted on SPA-ADV and other benchmarks have demonstrated the effectiveness of this approach in addressing the aforementioned challenges. The code and dataset associated with this research are available at the provided URL. The study was authored by Hanbing Wu, Ping Jiang, Anyang Su, Chenxu Zhao, Tianyu Fu, Minghui Wu, Beiping Tan, and Huiying Li. This work represents a significant advancement in personalized gaze prediction within eye-tracking models.
- - Individual preferences are crucial in determining how humans prioritize visual stimuli in the realm of visual selective attention.
- - Existing models often overlook the impact of subjective cognitive diversity on fixation behavior.
- - Conventional saliency prediction models have limitations in capturing personalized attention patterns due to reliance on low-resolution imagery and subsequent upscaling.
- - A new approach called SPA-ADV has been introduced, involving a large-scale multimodal dataset and a novel eye-tracking saliency model known as PRE-MAP.
- - The PRE-MAP model aims to characterize personalized visual disparities through Reinforcement learning-optimized Eye-tracking and predict format-correct and spatially accurate points guided by Multi-Attribute user profiles.
- - C-GRPO has been introduced to enhance MLLMs' performance in producing precise prediction points while considering variability in eye movement points and Multi-Attribute profiles.
- - Extensive experiments have demonstrated the effectiveness of these approaches in addressing challenges related to personalized gaze prediction within eye-tracking models.
Summary- People like different things and that helps them decide what to look at.
- Some ways of understanding how people look at things don't think about how different people think.
- Some computer programs that predict what people will look at have trouble because they use blurry pictures.
- A new way called SPA-ADV uses a big set of data and a special model to help understand how people look at things.
- This new model, PRE-MAP, tries to learn from watching where people look and make better predictions.
Definitions1. Preferences: Things that someone likes or wants more than others.
2. Models: Ways of representing or understanding something in a simplified way.
3. Saliency: How noticeable or important something is in a visual scene.
4. Multimodal: Involving multiple ways of sensing or perceiving information (like seeing and hearing).
5. Reinforcement learning: A type of learning where a system gets better by receiving feedback on its actions.
6. Eye-tracking: Monitoring and recording where someone looks with their eyes.
7. Prediction: Guessing or estimating what will happen in the future based on current information.
8. Gaze prediction: Trying to figure out where someone will look next based on their past behavior.
9. Variability: The degree to which something can change or be different from one instance to another.
10. Experiments: Tests or trials conducted to gather information and draw conclusions about a specific topic.
Introduction
Visual selective attention is a fundamental cognitive process that allows humans to prioritize relevant information in their environment. It involves the ability to filter out distractions and focus on specific visual stimuli, which is essential for efficient perception and decision-making. However, individual preferences play a crucial role in determining how we allocate our attention to different visual elements. This subjective aspect of selective attention has been largely overlooked in existing models and datasets.
In recent years, there has been a growing interest in understanding the impact of subjective cognitive diversity on fixation behavior. Researchers have attempted to bridge the gap between objective visual elements and subjective cognitive mechanisms by developing models that can capture personalized attention patterns. One such model is Subjective Personalized Attention for Advertisement Videos (SPA-ADV), which was introduced by Hanbing Wu et al.
The SPA-ADV Approach
The SPA-ADV approach involves a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants across varying age groups and genders while analyzing 486 videos. This dataset provides valuable insights into how individuals with different backgrounds prioritize visual stimuli when watching advertisement videos.
To further enhance the performance of MLLMs (Multimodal Language Models) in producing precise prediction points while considering the variability in eye movement points and Multi-Attribute profiles, Consistency Group Relative Policy Optimization (C-GRPO) has been introduced as part of the SPA-ADV approach. This new method aims to address some of the limitations faced by conventional saliency prediction models.
The PRE-MAP Model
One key component of the SPA-ADV approach is the novel eye-tracking saliency model known as PRE-MAP (Personalized Reinforcement learning-optimized Eye-tracking). Unlike traditional saliency prediction models that rely on low-resolution imagery, PRE-MAP uses high-resolution images to generate accurate saliency heatmaps at native resolutions. This allows for a more precise capture of personalized attention patterns.
Multi-Attribute User Profiles
Another important aspect of the SPA-ADV approach is the use of Multi-Attribute user profiles. These profiles take into account various factors such as age, gender, and cultural background to better understand how individuals prioritize visual stimuli. This information is then used to guide the MLLMs in predicting accurate fixation points.
Evaluation and Results
To evaluate the effectiveness of the SPA-ADV approach, extensive experiments were conducted on both the SPA-ADV dataset and other benchmarks. The results showed that this approach outperformed existing models in producing accurate prediction points while considering individual preferences and variability in eye movement points.
The researchers also made their code and dataset publicly available, allowing for further research in personalized gaze prediction within eye-tracking models.
Conclusion
In conclusion, Hanbing Wu et al.'s research paper introduces an innovative approach to address the limitations faced by traditional saliency prediction models when it comes to capturing personalized attention patterns. By combining a large-scale multimodal dataset with a novel eye-tracking saliency model and Multi-Attribute user profiles, they have demonstrated significant improvements in predicting fixation points accurately. This work represents a significant advancement in understanding how subjective cognitive diversity influences selective attention and has implications for various fields such as marketing, advertising, and human-computer interaction.