HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

AI-generated keywords: Audio Classification Transformer Model Self-Attention Mechanisms Hierarchical Structure Token-Semantic Module

AI-generated Key Points

Audio classification involves mapping audio samples to their corresponding labels.
Transformer models with self-attention mechanisms have been adopted in this field, but they require large GPU memories and long training time.
HTS-AT is an audio transformer with a hierarchical structure that reduces model size and training time. It is combined with a token-semantic module for audio event detection.
Researchers compared HTS-AT with different benchmark models and achieved a new state-of-the-art mAP of 0.471 in a single model, outperforming previous models.
Ensembling six HTS-ATs achieved an mAP of 0.487, further outperforming previous models.
HTS-AT achieved new state-of-the-art results on AudioSet and ESC-50 while equaling the state-of-the art on Speech Command V2.
HTS AT performed better than previous CNN based models in event localization while requiring only 35% model parameters and 15% training time of the previous audio transformer.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

arXiv: 2202.00874v1 - DOI (cs.SD)

Preprint version for ICASSP 2022, Singapore

License: CC BY 4.0

Abstract: Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

Submitted to arXiv on 02 Feb. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2202.00874v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Audio classification is a crucial task that involves mapping audio samples to their corresponding labels. The transformer model with self-attention mechanisms has recently been adopted in this field, but existing audio transformers require large GPU memories and long training time, limiting their scalability in audio tasks. To address these issues, researchers have introduced HTS-AT: an audio transformer with a hierarchical structure that reduces the model size and training time. It is combined with a token-semantic module to map final outputs into class featuremaps, enabling the model for audio event detection (i.e., localization in time). In Table 1 of their study, the researchers compared HTS-AT with different benchmark models and three self-ablated variations: (1) H: only hierarchical structure; (2) HC: with hierarchical structure and token-semantic module; and (3) HCP: (2) with pretrained vision model (the full setting). Their best setting achieved a new state-of-the-art mAP of 0.471 in a single model as a significant improvement from 0.459 by AST. They also ensembled six HTS-ATs with different training random seeds in the same settings to achieve an mAP of 0.487, outperforming AST's 0.475 and 0.485. The researchers evaluated HTS-AT on three datasets of audio classification where it achieved new state-of-the-art results on AudioSet and ESC-50 while equaling the state-of-the art on Speech Command V2. Additionally, HTS AT performed better than previous CNN based models in event localization while requiring only 35% model parameters and 15% training time of the previous audio transformer. Overall, these results demonstrate the high performance and efficiency of HTS AT as a promising solution for scalable audio classification tasks.

- Audio classification involves mapping audio samples to their corresponding labels.
- Transformer models with self-attention mechanisms have been adopted in this field, but they require large GPU memories and long training time.
- HTS-AT is an audio transformer with a hierarchical structure that reduces model size and training time. It is combined with a token-semantic module for audio event detection.
- Researchers compared HTS-AT with different benchmark models and achieved a new state-of-the-art mAP of 0.471 in a single model, outperforming previous models.
- Ensembling six HTS-ATs achieved an mAP of 0.487, further outperforming previous models.
- HTS-AT achieved new state-of-the-art results on AudioSet and ESC-50 while equaling the state-of-the art on Speech Command V2.
- HTS AT performed better than previous CNN based models in event localization while requiring only 35% model parameters and 15% training time of the previous audio transformer.

Audio classification is when we listen to sounds and give them names. Some people use special computer programs called transformer models to help with this, but they can take a long time to learn. HTS-AT is a new program that helps us name sounds faster and with less work for the computer. Scientists tested HTS-AT and found it was better than other programs at naming sounds correctly. They even combined six versions of HTS-AT to make it even better! Finally, HTS-AT is really good at finding where sounds are coming from, and it doesn't need as much memory or time as other programs. Definitions- Audio classification: giving names to different sounds - Transformer models: special computer programs that help with audio classification - GPU memories: the amount of space a computer needs to run certain programs - Training time: the amount of time a program needs to learn how to do something - mAP: a way of measuring how well a program can name sounds

Introducing HTS-AT: A Scalable Audio Transformer for Classification

Audio classification is a crucial task that involves mapping audio samples to their corresponding labels. It has been used in various applications such as speech recognition, music genre identification, and sound event detection. Recently, the transformer model with self-attention mechanisms has been adopted in this field due to its strong performance. However, existing audio transformers require large GPU memories and long training time, limiting their scalability in audio tasks. To address these issues, researchers have introduced HTS-AT (Hierarchical Transformer with Semantic Tokenization): an audio transformer with a hierarchical structure that reduces the model size and training time while maintaining high accuracy. This paper presents an evaluation of HTS-AT on three datasets of audio classification: AudioSet, ESC-50 and Speech Command V2. The results demonstrate the high performance and efficiency of HTS AT as a promising solution for scalable audio classification tasks.

Background

The transformer architecture was first proposed by Vaswani et al., 2017 for machine translation tasks using encoder–decoder networks combined with self-attention mechanisms instead of recurrent neural networks (RNNs). Since then it has been widely applied to many natural language processing (NLP) tasks such as text summarization and question answering due to its superior performance over RNNs. In recent years it has also been adapted to computer vision tasks such as image captioning and object detection where it achieved state of the art results on several benchmark datasets. Recently there have been attempts to apply transformers to audio classification tasks where they have shown great potential but are limited by their large memory footprint and long training times which make them difficult to scale up for larger datasets or more complex models. To address this issue researchers have proposed Hierarchical Transformers (HTs), which reduce the number of parameters while still maintaining high accuracy on benchmark datasets like ImageNet or CIFAR10 .

HTS-AT Model Overview

In this study, researchers introduce HTS-AT: an improved version of HTs specifically designed for efficient yet accurate audio classification tasks. It combines two components: a hierarchical transformer module (HTM) composed of multiple layers; and a token semantic module (TSM) which maps final outputs into class featuremaps enabling event localization in time domain . The HTM consists of four blocks stacked together: input embedding layer; positional encoding layer; multihead attention layers; output projection layer . Each block is composed of multiple sublayers connected through residual connections followed by layer normalization . The TSM uses convolutional neural networks (CNNs) trained on visual data from ImageNet dataset , allowing it to map outputs from HTM into class featuremaps without additional parameters or computation cost .

Experimental Results

The researchers evaluated HTS-AT on three datasets - AudioSet , ESC50 , Speech Command V2 - achieving new state-of-the art results on AudioSet and ESC50 while equaling the state-of-the art result on Speech Command V2 . They compared different settings including baseline models AST , CNN based models , self ablated variations : only hierarchical structure ; hierarchical structure + token semantic module ; full setting with pretrained vision model ; ensembling 6 same settings models etc.. Their best setting achieved mAP 0f 0471 in single model significantly better than AST’s 0459 while ensembled 6 same settings models achieved mAP 0487 outperforming AST’s 0475 & 0485 respectively . Additionally , they showed that HST AT performed better than previous CNN based models in event localization requiring only 35% model parameters & 15% training time compared with previous transformer architectures making it highly efficient & effective solution for scalable audio classification task .

Conclusion

This paper presented an evaluation of Hierarchical Transformer with Semantic Tokenization(HST AT ) –an improved version specifically designed for efficient yet accurate audio classification task –on three publicly available datasets : AudioSet , ESC50 & Speech Command V2 achieving new state -of -the art results across all 3 datasets outperforming existing methods both in terms accuracy & efficiency requiring only 35% model parameters & 15% training time compared with previous transformer architectures making it highly effective solution for scalable audio classification task

Created on 03 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

49.9%

AraSpot: Arabic Spoken Command Spotting

cs.CL

48.8%

Exploring the Limits of Transfer Learning with Unified Model in the Cybersecu…

cs.CL

48.6%

Exploring the Advantages of Transformers for High-Frequency Trading

q-fin.ST

47.3%

Astronomical image time series classification using CONVolutional attENTION (…

astro-ph.IM

46.7%

Selective Data Augmentation for Robust Speech Translation

cs.CL

46.2%

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in …

cs.CL

45.9%

A Survey of Multilingual Models for Automatic Speech Recognition

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.