BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization
AI-generated Key Points
- Authors propose a method called BEST to improve sign language recognition using BERT pre-training
- BEST leverages the success of BERT pre-training and model domain-specific statistics
- Hand and body movements are organized as pose triplet units and fed into the Transformer backbone in a frame-wise manner
- Pre-training involves reconstructing masked triplet units from corrupted input sequences to learn hierarchical correlation context cues
- Coupling tokenization is introduced to bridge the semantic gap between low-level pose units and high-level semantics required for SLR tasks
- After pre-training, fine-tuning is performed on downstream SLR tasks with a newly added task-specific layer
- Extensive experiments validate the proposed method, achieving new state-of-the-art performance on four benchmarks
- RGB-based methods and pose-based methods have been studied extensively in sign language recognition
- BEST addresses limitations of existing approaches by leveraging BERT pre-training and incorporating domain-specific statistics through pose triplet units
- The method achieves improved performance compared to previous state-of-the-art methods
Authors: Weichao Zhao, Hezhen Hu, Wengang Zhou, Jiaxin Shi, Houqiang Li
Abstract: In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.