Unsupervised Topic Segmentation of Meetings with BERT Embeddings

AI-generated keywords: Topic Segmentation Meeting Transcripts Unsupervised Learning Pre-trained Models Neural Networks

AI-generated Key Points

  • Topic segmentation of meetings is challenging due to noisy meeting transcripts and lack of ground truth data.
  • Meetings involve multiple participants with personalized language use, leading to transcript errors that make accurate topic segmentation difficult.
  • Collecting labeled data for segmented meetings is complex and expensive as organizations are sensitive about their private meeting data.
  • The proposed unsupervised approach uses pre-trained transformer models for topic segmentation, addressing the lack of ground truth data issue.
  • A mechanism based on BERT embeddings and a new similarity score results in a 15.5% reduction in error rate compared to existing unsupervised methods.
  • The study demonstrates a 26.6% reduction in error rate compared to current state-of-the-art supervised topic segmentation models trained on text datasets like Wikipedia.
  • Utilizing pre-trained models like BERT and Sentence-BERT for sentence embeddings extraction helps filter out noisy speech data such as ASR miss-transcriptions and disfluencies from speakers.
  • Employing a modified TextTiling method for topic segmentation without requiring labeled training data is part of the proposed approach.
  • The unsupervised approach using pre-trained neural architectures shows significant improvements in topic segmentation accuracy for meeting transcripts compared to existing methods, effectively addressing challenges posed by noisy meeting data and lack of ground truth annotations.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Alessandro Solbiati, Kevin Heffernan, Georgios Damaskinos, Shivani Poddar, Shubham Modi, Jacques Cali

License: CC BY 4.0

Abstract: Topic segmentation of meetings is the task of dividing multi-person meeting transcripts into topic blocks. Supervised approaches to the problem have proven intractable due to the difficulties in collecting and accurately annotating large datasets. In this paper we show how previous unsupervised topic segmentation methods can be improved using pre-trained neural architectures. We introduce an unsupervised approach based on BERT embeddings that achieves a 15.5% reduction in error rate over existing unsupervised approaches applied to two popular datasets for meeting transcripts.

Submitted to arXiv on 24 Jun. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2106.12978v1

Topic segmentation of meetings is a challenging task due to the noisy nature of meeting transcripts and the lack of ground truth data. Meetings involve multiple participants with personalized language use, leading to transcript errors that make it difficult for even human annotators to accurately segment topics. Collecting labeled data for segmented meetings is complex and expensive as organizations are sensitive about their private meeting data. In this paper, we propose an unsupervised approach using pre-trained transformer models for topic segmentation of meetings. The lack of ground truth data hinders the benefits of advanced neural networks in comparison to other domains like written text. To address this issue, we introduce a mechanism based on BERT embeddings and a new similarity score that results in a 15.5% reduction in error rate compared to existing unsupervised methods. Our study also demonstrates a 26.6% reduction in error rate compared to current state-of-the-art supervised topic segmentation models trained on text datasets like Wikipedia. These models perform poorly due to differences between written text datasets and standard meeting transcripts datasets such as ICSI Meeting Corpus and AMI Meeting Corpus. The proposed approach involves utilizing pre-trained models like BERT and Sentence-BERT for sentence embeddings extraction, which helps filter out noisy speech data such as ASR miss-transcriptions and disfluencies from speakers. We also employ a modified TextTiling method for topic segmentation without requiring labeled training data. Overall, our unsupervised approach using pre-trained neural architectures shows significant improvements in topic segmentation accuracy for meeting transcripts compared to existing methods, effectively addressing the challenges posed by noisy meeting data and lack of ground truth annotations.
Created on 08 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.