Audio classification is a crucial task that involves mapping audio samples to their corresponding labels. The transformer model with self-attention mechanisms has recently been adopted in this field, but existing audio transformers require large GPU memories and long training time, limiting their scalability in audio tasks. To address these issues, researchers have introduced HTS-AT: an audio transformer with a hierarchical structure that reduces the model size and training time. It is combined with a token-semantic module to map final outputs into class featuremaps, enabling the model for audio event detection (i.e., localization in time). In Table 1 of their study, the researchers compared HTS-AT with different benchmark models and three self-ablated variations: (1) H: only hierarchical structure; (2) HC: with hierarchical structure and token-semantic module; and (3) HCP: (2) with pretrained vision model (the full setting). Their best setting achieved a new state-of-the-art mAP of 0.471 in a single model as a significant improvement from 0.459 by AST. They also ensembled six HTS-ATs with different training random seeds in the same settings to achieve an mAP of 0.487, outperforming AST's 0.475 and 0.485. The researchers evaluated HTS-AT on three datasets of audio classification where it achieved new state-of-the-art results on AudioSet and ESC-50 while equaling the state-of-the art on Speech Command V2. Additionally, HTS AT performed better than previous CNN based models in event localization while requiring only 35% model parameters and 15% training time of the previous audio transformer. Overall, these results demonstrate the high performance and efficiency of HTS AT as a promising solution for scalable audio classification tasks.
- - Audio classification involves mapping audio samples to their corresponding labels.
- - Transformer models with self-attention mechanisms have been adopted in this field, but they require large GPU memories and long training time.
- - HTS-AT is an audio transformer with a hierarchical structure that reduces model size and training time. It is combined with a token-semantic module for audio event detection.
- - Researchers compared HTS-AT with different benchmark models and achieved a new state-of-the-art mAP of 0.471 in a single model, outperforming previous models.
- - Ensembling six HTS-ATs achieved an mAP of 0.487, further outperforming previous models.
- - HTS-AT achieved new state-of-the-art results on AudioSet and ESC-50 while equaling the state-of-the art on Speech Command V2.
- - HTS AT performed better than previous CNN based models in event localization while requiring only 35% model parameters and 15% training time of the previous audio transformer.
Audio classification is when we listen to sounds and give them names. Some people use special computer programs called transformer models to help with this, but they can take a long time to learn. HTS-AT is a new program that helps us name sounds faster and with less work for the computer. Scientists tested HTS-AT and found it was better than other programs at naming sounds correctly. They even combined six versions of HTS-AT to make it even better! Finally, HTS-AT is really good at finding where sounds are coming from, and it doesn't need as much memory or time as other programs.
Definitions- Audio classification: giving names to different sounds
- Transformer models: special computer programs that help with audio classification
- GPU memories: the amount of space a computer needs to run certain programs
- Training time: the amount of time a program needs to learn how to do something
- mAP: a way of measuring how well a program can name sounds
Introducing HTS-AT: A Scalable Audio Transformer for Classification
Audio classification is a crucial task that involves mapping audio samples to their corresponding labels. It has been used in various applications such as speech recognition, music genre identification, and sound event detection. Recently, the transformer model with self-attention mechanisms has been adopted in this field due to its strong performance. However, existing audio transformers require large GPU memories and long training time, limiting their scalability in audio tasks.
To address these issues, researchers have introduced HTS-AT (Hierarchical Transformer with Semantic Tokenization): an audio transformer with a hierarchical structure that reduces the model size and training time while maintaining high accuracy. This paper presents an evaluation of HTS-AT on three datasets of audio classification: AudioSet, ESC-50 and Speech Command V2. The results demonstrate the high performance and efficiency of HTS AT as a promising solution for scalable audio classification tasks.
Background
The transformer architecture was first proposed by Vaswani et al., 2017 for machine translation tasks using encoder–decoder networks combined with self-attention mechanisms instead of recurrent neural networks (RNNs). Since then it has been widely applied to many natural language processing (NLP) tasks such as text summarization and question answering due to its superior performance over RNNs. In recent years it has also been adapted to computer vision tasks such as image captioning and object detection where it achieved state of the art results on several benchmark datasets.
Recently there have been attempts to apply transformers to audio classification tasks where they have shown great potential but are limited by their large memory footprint and long training times which make them difficult to scale up for larger datasets or more complex models. To address this issue researchers have proposed Hierarchical Transformers (HTs), which reduce the number of parameters while still maintaining high accuracy on benchmark datasets like ImageNet or CIFAR10 .
HTS-AT Model Overview
In this study, researchers introduce HTS-AT: an improved version of HTs specifically designed for efficient yet accurate audio classification tasks. It combines two components: a hierarchical transformer module (HTM) composed of multiple layers; and a token semantic module (TSM) which maps final outputs into class featuremaps enabling event localization in time domain . The HTM consists of four blocks stacked together: input embedding layer; positional encoding layer; multihead attention layers; output projection layer . Each block is composed of multiple sublayers connected through residual connections followed by layer normalization . The TSM uses convolutional neural networks (CNNs) trained on visual data from ImageNet dataset , allowing it to map outputs from HTM into class featuremaps without additional parameters or computation cost .
Experimental Results
The researchers evaluated HTS-AT on three datasets - AudioSet , ESC50 , Speech Command V2 - achieving new state-of-the art results on AudioSet and ESC50 while equaling the state-of-the art result on Speech Command V2 . They compared different settings including baseline models AST , CNN based models , self ablated variations : only hierarchical structure ; hierarchical structure + token semantic module ; full setting with pretrained vision model ; ensembling 6 same settings models etc.. Their best setting achieved mAP 0f 0471 in single model significantly better than AST’s 0459 while ensembled 6 same settings models achieved mAP 0487 outperforming AST’s 0475 & 0485 respectively . Additionally , they showed that HST AT performed better than previous CNN based models in event localization requiring only 35% model parameters & 15% training time compared with previous transformer architectures making it highly efficient & effective solution for scalable audio classification task .
Conclusion
This paper presented an evaluation of Hierarchical Transformer with Semantic Tokenization(HST AT ) –an improved version specifically designed for efficient yet accurate audio classification task –on three publicly available datasets : AudioSet , ESC50 & Speech Command V2 achieving new state -of -the art results across all 3 datasets outperforming existing methods both in terms accuracy & efficiency requiring only 35% model parameters & 15% training time compared with previous transformer architectures making it highly effective solution for scalable audio classification task