MixFormer: End-to-End Tracking with Iterative Mixed Attention

AI-generated keywords: Visual object tracking

AI-generated Key Points

  • MixFormer framework introduces a new approach to visual object tracking combining feature extraction with target information integration
  • Utilizes attention operations to propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration
  • Two types of MixFormer trackers: hierarchical tracker called MixCvT and non-hierarchical tracker known as MixViT
  • Various pre-training methods explored including supervised, self-supervised, and masked techniques
  • Asymmetric attention scheme in MAM reduces computational costs when handling multiple target templates during online tracking
  • Effective score prediction module proposed to select high-quality templates
  • Performance of MixFormer trackers surpasses existing benchmarks on seven tracking datasets
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yutao Cui, Cheng Jiang, Gangshan Wu, Limin Wang

Extended version of the paper arXiv:2203.11082 presented at CVPR 2022. In particular, the extented MixViT-L achieves AUC score of 73.3% on LaSOT. Besides, we design a new TrackMAE pre-training method for tracking Code has been released
License: CC BY 4.0

Abstract: Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT, and a non-hierarchical tracker MixViT. For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked pre-training to our MixFormer trackers and design the competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, OTB100 and UAV123. In particular, our MixViT-L achieves AUC score of 73.3% on LaSOT, 86.1% on TrackingNet, EAO of 0.584 on VOT2020, and AO of 75.7% on GOT-10k. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

Submitted to arXiv on 06 Feb. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2302.02814v2

, , , , The MixFormer framework introduces a new approach to visual object tracking that combines feature extraction with target information integration. It utilizes attention operations to propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration, allowing for the extraction of target-specific discriminative features and extensive communication between the target and search area. Two types of MixFormer trackers are instantiated, including a hierarchical tracker called MixCvT and a non-hierarchical tracker known as MixViT. The paper explores various pre-training methods for these trackers, including supervised, self-supervised, and masked pre-training techniques. Additionally, an asymmetric attention scheme is devised in MAM to reduce computational costs when handling multiple target templates during online tracking. An effective score prediction module is also proposed to select high-quality templates. The performance of MixFormer trackers surpasses existing benchmarks on seven tracking datasets, demonstrating its state-of-the-art performance in visual object tracking. In related work analysis, prevailing tracking methods typically employ a three-stage architecture consisting of a backbone for feature extraction, an integration module for combining target and search region information, and classification/localization heads for determining target states. Siamese-based trackers have gained popularity due to their simplicity and efficiency in modeling appearance similarity between targets and search areas. On the other hand, another family of trackers focuses on learning online target-dependent discriminative models through end-to-end training approaches such as CFNet and DiMP.
Created on 24 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.