Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets

AI-generated keywords: Personalized ASR Disordered Speech Adaptation Data Word Error Rate Practical Approach

AI-generated Key Points

Study focuses on personalized automatic speech recognition (ASR) for recognizing disordered speech using small amounts of per-speaker adaptation data
Lack of available speech data has been a major challenge in adapting speaker-independent ASR systems for dysarthric speech
Researchers trained personalized models for 195 individuals with different types and severities of speech impairment
Training sets varied in size from less than one minute to 18-20 minutes of speech data per speaker
Word error rate (WER) thresholds were used to determine the Success Percentage, representing the percentage of personalized models that achieved the target WER in different application scenarios
In the home automation scenario, 79% of speakers reached the target WER when trained with 18-20 minutes of speech data, and even with only 3-4 minutes of data, 63% still reached the target WER
Performance on test sets containing conversational and out-of-domain unprompted phrases showed similar improvements
Personalized ASR can benefit individuals with disordered speech even with just a few minutes of recordings, which is significant as recording large amounts of samples per speaker is often impractical and challenging for people with speech impairments
Previous studies required hours of recorded speech data per speaker for substantial WER improvements, whereas this study shows promising results using significantly smaller amounts of adaptation data
Research highlights the potential and feasibility of personalized ASR for individuals with disordered speech, offering valuable insights into optimizing ASR systems for such impairments and providing a more practical approach that can be implemented with limited recording times per speaker.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jimmy Tobin, Katrin Tomanek

arXiv: 2110.04612v1 - DOI (eess.AS)

Submitted to ICASSP 2022

License: CC BY 4.0

Abstract: This study investigates the performance of personalized automatic speech recognition (ASR) for recognizing disordered speech using small amounts of per-speaker adaptation data. We trained personalized models for 195 individuals with different types and severities of speech impairment with training sets ranging in size from <1 minute to 18-20 minutes of speech data. Word error rate (WER) thresholds were selected to determine Success Percentage (the percentage of personalized models reaching the target WER) in different application scenarios. For the home automation scenario, 79% of speakers reached the target WER with 18-20 minutes of speech; but even with only 3-4 minutes of speech, 63% of speakers reached the target WER. Further evaluation found similar improvement on test sets with conversational and out-of-domain, unprompted phrases. Our results demonstrate that with only a few minutes of recordings, individuals with disordered speech could benefit from personalized ASR.

Submitted to arXiv on 09 Oct. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2110.04612v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

This study focuses on personalized automatic speech recognition (ASR) for recognizing disordered speech using small amounts of per-speaker adaptation data. Previous research has demonstrated promising results in adapting speaker-independent ASR systems for dysarthric speech, however the lack of available speech data has been a major challenge. Recent work has explored the potential of personalizing ASR models for individuals with speech impairments. The researchers trained personalized models for 195 individuals with different types and severities of speech impairment. The training sets varied in size from less than one minute to 18-20 minutes of speech data per speaker. They used word error rate (WER) thresholds to determine the Success Percentage which represents the percentage of personalized models that achieved the target WER in different application scenarios. In the home automation scenario they found that 79% of speakers reached the target WER when trained with 18-20 minutes of speech data and surprisingly even with only 3-4 minutes of speech data 63% still reached the target WER. The researchers also evaluated performance on test sets containing conversational and out-of-domain unprompted phrases and found similar improvements. These results demonstrate that individuals with disordered speech can benefit from personalized ASR even with just a few minutes of recordings which is significant because recording large amounts of samples per speaker is often impractical and challenging for people with speech impairments. While previous studies have reported substantial WER improvements through model personalization they typically required hours of recorded speech data per speaker; one study achieved an average WER improvement of 75% on a large corpus by recording about two hours per speaker whereas this study shows promising results using significantly smaller amounts adaptation data. Overall this research highlights the potential and feasibility of personalized ASR for individuals with disordered speech providing valuable insights into optimizing ASR systems for such impairments as well as offering a more practical approach that can be implemented with limited recording times per speaker.

- Study focuses on personalized automatic speech recognition (ASR) for recognizing disordered speech using small amounts of per-speaker adaptation data
- Lack of available speech data has been a major challenge in adapting speaker-independent ASR systems for dysarthric speech
- Researchers trained personalized models for 195 individuals with different types and severities of speech impairment
- Training sets varied in size from less than one minute to 18-20 minutes of speech data per speaker
- Word error rate (WER) thresholds were used to determine the Success Percentage, representing the percentage of personalized models that achieved the target WER in different application scenarios
- In the home automation scenario, 79% of speakers reached the target WER when trained with 18-20 minutes of speech data, and even with only 3-4 minutes of data, 63% still reached the target WER
- Performance on test sets containing conversational and out-of-domain unprompted phrases showed similar improvements
- Personalized ASR can benefit individuals with disordered speech even with just a few minutes of recordings, which is significant as recording large amounts of samples per speaker is often impractical and challenging for people with speech impairments
- Previous studies required hours of recorded speech data per speaker for substantial WER improvements, whereas this study shows promising results using significantly smaller amounts of adaptation data
- Research highlights the potential and feasibility of personalized ASR for individuals with disordered speech, offering valuable insights into optimizing ASR systems for such impairments and providing a more practical approach that can be implemented with limited recording times per speaker.

Researchers conducted a study to improve speech recognition for people with speech disorders using a small amount of personalized data. They trained models for 195 individuals with different types and severities of speech impairments. The size of the training sets varied from less than one minute to 18-20 minutes per person. They used word error rate (WER) thresholds to measure success, and found that even with just a few minutes of data, many speakers reached the target WER. This research shows that personalized speech recognition can help people with speech disorders, even with limited recording time per person. Definitions- Personalized: Tailored or customized specifically for an individual. - Automatic Speech Recognition (ASR): Technology that converts spoken language into written text. - Disordered speech: Speech that is difficult to understand due to a medical condition or impairment. - Adaptation data: Information used to modify or adjust a system based on individual characteristics or needs. - Severities: Different levels or degrees of seriousness or intensity. - Impairment: A condition that limits or affects someone's ability in some way. - Word Error Rate (WER): A measure of how accurately a speech recognition system transcribes spoken words into text. - Feasibility: The possibility or likelihood of something being successful or achievable.

Personalized Automatic Speech Recognition for Disordered Speech

Speech recognition technology has come a long way in recent years, with applications ranging from home automation to medical diagnostics. However, speech impairments such as dysarthria can present challenges for existing automatic speech recognition (ASR) systems. Previous research has demonstrated promising results in adapting speaker-independent ASR systems for dysarthric speech, however the lack of available speech data has been a major challenge. In this study, researchers explored the potential of personalizing ASR models for individuals with disordered speech. They trained personalized models for 195 individuals with different types and severities of impairment using varying amounts of adaptation data per speaker – from less than one minute to 18-20 minutes of recordings. The researchers evaluated performance on test sets containing conversational and out-of-domain unprompted phrases and used word error rate (WER) thresholds to determine the Success Percentage which represents the percentage of personalized models that achieved the target WER in different application scenarios.

Home Automation Scenario

The researchers found that 79% of speakers reached the target WER when trained with 18-20 minutes of speech data and surprisingly even with only 3-4 minutes of speech data 63% still reached the target WER. This is significant because recording large amounts of samples per speaker is often impractical and challenging for people with disordered speech due to physical limitations or fatigue caused by their condition.

Conversational & Out-of Domain Phrases

The researchers also evaluated performance on test sets containing conversational and out-of domain unprompted phrases and found similar improvements compared to their home automation scenario results. Overall this research highlights the potential and feasibility of personalized ASR for individuals with disordered speech providing valuable insights into optimizing ASR systems for such impairments as well as offering a more practical approach that can be implemented with limited recording times per speaker.

Comparison With Previous Studies

While previous studies have reported substantial WER improvements through model personalization they typically required hours of recorded speech data per speaker; one study achieved an average WER improvement of 75% on a large corpus by recording about two hours per speaker whereas this study shows promising results using significantly smaller amounts adaptation data – demonstrating that individuals with disordered speech can benefit from personalized ASR even with just a few minutes recordings.

Conclusion

This research paper provides important insights into how much training data is needed to achieve successful personalization results when working with individuals who have disordered or impaired speech patterns, showing that even small amounts can yield significant improvements in accuracy over traditional non-personalized methods while still being practical enough to implement without requiring excessive recording times per user/speaker..

Created on 03 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

57.3%

Proficiency assessment of L2 spoken English using wav2vec 2.0

cs.CL

55.1%

Cross-Attention is all you need: Real-Time Streaming Transformers for Persona…

eess.AS

54.5%

Training a Helpful and Harmless Assistant with Reinforcement Learning from Hu…

cs.CL

54.4%

AraSpot: Arabic Spoken Command Spotting

cs.CL

53.9%

Direct Speech Translation for Automatic Subtitling

cs.CL

53.7%

Classifying Autism from Crowdsourced Semi-Structured Speech Recordings: A Mac…

cs.SD

52.2%

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Ke…

cs.SD

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.