This paper explores the joint localization, detection and tracking of sound events using a convolutional recurrent neural network (CRNN). The CRNN model, previously proposed for localizing and detecting stationary sources, is adapted to enable the spatial tracking of moving sources when trained with dynamic scenes. The performance of the CRNN is compared with a stand-alone tracking method that combines a multi-source estimator and a particle filter. The study evaluates the performance of both methods in various acoustic conditions including anechoic and reverberant scenarios as well as stationary and moving sources at different angular velocities. Additionally, the experiments consider scenarios with varying numbers of overlapping sources. The results show that the CRNN consistently tracks multiple sources more effectively than the parametric method across different acoustic scenarios. However, it does come at the cost of higher localization error. The parametric method exhibits improved direction-of-arrival (DOA) estimation when combined with a temporal particle filter tracker but suffers from lower frame recall. Further analysis reveals that using maximum likelihood estimation instead of reference information for source number estimation reduces the overall performance of the parametric approach. This reduction is particularly evident in reverberant and moving source scenario datasets, highlighting the need for more robust source detection and counting schemes. Overall, while the CRNN outperforms the parametric method in terms of consistent tracking performance, it struggles with accurate localization. On the other hand, although the parametric method achieves better DOA estimation when combined with a particle filter tracker its frame recall decreases significantly in certain scenarios. These findings emphasize the importance of considering both tracking consistency and localization accuracy when choosing between these two methods. In conclusion, this study demonstrates that by leveraging recurrent layers within a CRNN architecture it is possible to achieve effective tracking of multiple sound sources. However further improvements are needed to enhance localization accuracy and robustness in challenging acoustic conditions.
- - This paper explores joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN).
- - The CRNN model is adapted to enable spatial tracking of moving sources when trained with dynamic scenes.
- - Performance of the CRNN is compared with a stand-alone tracking method that combines a multi-source estimator and a particle filter.
- - Experiments evaluate performance in various acoustic conditions including anechoic and reverberant scenarios, stationary and moving sources at different angular velocities, and varying numbers of overlapping sources.
- - The CRNN consistently tracks multiple sources more effectively than the parametric method across different acoustic scenarios but has higher localization error.
- - The parametric method achieves improved direction-of-arrival (DOA) estimation when combined with a temporal particle filter tracker but suffers from lower frame recall.
- - Using maximum likelihood estimation instead of reference information for source number estimation reduces the overall performance of the parametric approach, particularly in reverberant and moving source scenarios.
- - Both methods have trade-offs: CRNN outperforms in consistent tracking but struggles with accurate localization, while the parametric method achieves better DOA estimation but has decreased frame recall in certain scenarios.
- - Consideration should be given to both tracking consistency and localization accuracy when choosing between these two methods.
- - Recurrent layers within a CRNN architecture can achieve effective tracking of multiple sound sources, but further improvements are needed for localization accuracy and robustness in challenging acoustic conditions.
This paper is about using a special computer program to listen to and track different sounds. The program is called a convolutional recurrent neural network (CRNN). The researchers tested the CRNN by comparing it to another method that also tracks sounds. They did experiments in different sound conditions, like when there was lots of echo or when the sounds were moving. The CRNN was better at tracking multiple sounds, but it wasn't as good at knowing exactly where the sounds were coming from. The other method was better at knowing where the sounds were coming from, but sometimes it missed some of the sounds. It's important to think about both tracking and knowing where the sounds are coming from when choosing which method to use. The researchers think that more improvements can be made to make these methods even better."
Definitions- Joint localization: Figuring out where something is located.
- Detection: Finding something or noticing something.
- Tracking: Following something or keeping an eye on something.
- Convolutional recurrent neural network (CRNN): A type of computer program that can learn and understand patterns in sound.
- Spatial tracking: Tracking things that are moving around in space.
- Stand-alone: By itself, without any help.
- Estimator: A tool or method for making guesses or predictions.
- Particle filter: A way of estimating or guessing where things are based on small pieces of information.
- Acoustic conditions: Different situations involving sound, like how loud it is or if there's an echo.
-
Joint Localization, Detection and Tracking of Sound Events Using a Convolutional Recurrent Neural Network
Sound event localization, detection and tracking are important tasks in the field of audio signal processing. In this research paper, we explore how a convolutional recurrent neural network (CRNN) can be used to jointly localize, detect and track sound events. We compare the performance of the CRNN model with a stand-alone tracking method that combines a multi-source estimator and particle filter. The experiments consider various acoustic conditions including anechoic and reverberant scenarios as well as stationary and moving sources at different angular velocities. Additionally, the experiments also consider scenarios with varying numbers of overlapping sources.
Background
The CRNN model was previously proposed for localizing and detecting stationary sources. It is based on convolutional layers which extract features from raw audio signals followed by recurrent layers which enable temporal context modeling for improved source localization accuracy. For this study, we adapted the CRNN model to enable spatial tracking of moving sources when trained with dynamic scenes. This adaptation allows us to evaluate its performance across multiple acoustic conditions where traditional methods may struggle due to their lack of temporal information modelling capabilities or limited robustness against environmental noise interference.
Experimental Setup
We evaluated both methods using two datasets: one containing simulated anechoic recordings generated using image source method (ISM) simulations; another containing real-world recordings captured in reverberant environments using binaural microphones mounted on a robotic platform equipped with loudspeakers for generating dynamic sound scenes with up to four simultaneous moving sound sources at different angular velocities ranging from 0°/s - 90°/s . The results were evaluated in terms of direction-of-arrival (DOA) estimation accuracy as well as frame recall rate which measures how many frames contain correctly localized events out of all frames within each dataset recording session.
Results
The results show that the CRNN consistently tracks multiple sources more effectively than the parametric method across different acoustic scenarios while exhibiting higher localization error compared to it's counterpart approach. The parametric method achieves better DOA estimation when combined with a temporal particle filter tracker but suffers from lower frame recall rate in certain scenarios such as those involving reverberation or multiple moving sound sources at high angular velocities (>45°/s). Further analysis reveals that using maximum likelihood estimation instead of reference information for source number estimation reduces overall performance particularly evident in reverberant or moving source scenario datasets highlighting need for more robust source detection & counting schemes..
Conclusion
In conclusion, this study demonstrates that by leveraging recurrent layers within a CRNN architecture it is possible to achieve effective tracking of multiple sound sources while maintaining consistent performance across different acoustic conditions however further improvements are needed to enhance localization accuracy & robustness especially under challenging circumstances such as presence of reverberation or high angular velocity motion dynamics among other factors..