HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

AI-generated keywords: Audio Classification Transformer Model Self-Attention Mechanisms Hierarchical Structure Token-Semantic Module

AI-generated Key Points

  • Audio classification involves mapping audio samples to their corresponding labels.
  • Transformer models with self-attention mechanisms have been adopted in this field, but they require large GPU memories and long training time.
  • HTS-AT is an audio transformer with a hierarchical structure that reduces model size and training time. It is combined with a token-semantic module for audio event detection.
  • Researchers compared HTS-AT with different benchmark models and achieved a new state-of-the-art mAP of 0.471 in a single model, outperforming previous models.
  • Ensembling six HTS-ATs achieved an mAP of 0.487, further outperforming previous models.
  • HTS-AT achieved new state-of-the-art results on AudioSet and ESC-50 while equaling the state-of-the art on Speech Command V2.
  • HTS AT performed better than previous CNN based models in event localization while requiring only 35% model parameters and 15% training time of the previous audio transformer.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Preprint version for ICASSP 2022, Singapore
License: CC BY 4.0

Abstract: Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

Submitted to arXiv on 02 Feb. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2202.00874v1

Audio classification is a crucial task that involves mapping audio samples to their corresponding labels. The transformer model with self-attention mechanisms has recently been adopted in this field, but existing audio transformers require large GPU memories and long training time, limiting their scalability in audio tasks. To address these issues, researchers have introduced HTS-AT: an audio transformer with a hierarchical structure that reduces the model size and training time. It is combined with a token-semantic module to map final outputs into class featuremaps, enabling the model for audio event detection (i.e., localization in time). In Table 1 of their study, the researchers compared HTS-AT with different benchmark models and three self-ablated variations: (1) H: only hierarchical structure; (2) HC: with hierarchical structure and token-semantic module; and (3) HCP: (2) with pretrained vision model (the full setting). Their best setting achieved a new state-of-the-art mAP of 0.471 in a single model as a significant improvement from 0.459 by AST. They also ensembled six HTS-ATs with different training random seeds in the same settings to achieve an mAP of 0.487, outperforming AST's 0.475 and 0.485. The researchers evaluated HTS-AT on three datasets of audio classification where it achieved new state-of-the-art results on AudioSet and ESC-50 while equaling the state-of-the art on Speech Command V2. Additionally, HTS AT performed better than previous CNN based models in event localization while requiring only 35% model parameters and 15% training time of the previous audio transformer. Overall, these results demonstrate the high performance and efficiency of HTS AT as a promising solution for scalable audio classification tasks.
Created on 03 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.