BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

AI-generated keywords: M3-Embedding

AI-generated Key Points

  • M3-Embedding is a versatile model that supports over 100 languages and excels in multi-lingual and cross-lingual retrieval tasks.
  • It can handle dense, multi-vector, and sparse retrieval simultaneously, making it valuable for real-world information retrieval applications.
  • The model can process inputs from short sentences to long documents with up to 8192 tokens.
  • A novel self-knowledge distillation approach using relevance scores from different retrieval functionalities enhances the training quality of M3-Embedding.
  • Optimized batching strategy enables large batch sizes and high training throughput for more discriminative embeddings.
  • In experiments evaluating multilingual retrieval, cross-lingual retrieval, and long-document retrieval tasks, M3-Embedding consistently demonstrates superior performance across various languages and input lengths.
  • Even without fine-tuning on long document data, M3-Embedding outperforms most baselines due to its robust pre-training stage.
  • A simple strategy called MCLS addresses situations where fine-tuning for document retrieval is not feasible or resource-intensive.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu

Work in progress
License: CC BY 4.0

Abstract: In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.

Submitted to arXiv on 05 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.03216v1

, , , , Introducing M3-Embedding: A Versatile Model for Multilingual, Multi-functional, and Multi-granular Embeddings M3-Embedding is a powerful embedding model that supports over 100 languages and achieves state-of-the-art performance in multi-lingual and cross-lingual retrieval tasks. It excels in handling dense, multi-vector, and sparse retrieval simultaneously, making it a valuable tool for real-world information retrieval applications. Notably, M3-Embedding can process inputs ranging from short sentences to long documents with up to 8192 tokens. To further enhance the training quality of M3-Embedding, we propose a novel self-knowledge distillation approach where relevance scores from different retrieval functionalities serve as teacher signals. We also optimize the batching strategy to enable large batch sizes and high training throughput for more discriminative embeddings. This technical innovation sets M3-Embedding apart as the first model with such strong versatility. In our experiments evaluating multilingual retrieval, cross-lingual retrieval, and long-document retrieval tasks, M3-Embedding consistently demonstrates superior performance across various languages and input lengths. To showcase its proficiency in handling long inputs, we evaluate its performance on benchmarks like MLDR (Multilingual Long-Doc Retrieval) curated from multilingual articles on Wikipedia, Wudao, mC4, and NarrativeQA. Ablation studies show that even without fine-tuning on long document data (Dense-w.o.long), M3-Embedding outperforms most baselines due to its robust pre-training stage. Additionally, we introduce a simple strategy called MCLS to address situations where fine-tuning for document retrieval is not feasible or resource-intensive. Our analysis on NarrativeQA demonstrates that as sequence length increases, M3-Embedding consistently outperforms baselines in long-document retrieval tasks. Furthermore, we conduct experiments on self-knowledge distillation and multi-stage training with MIRACL (nDCG@10), showcasing the model's proficiency in handling long inputs. We also present an MCLS strategy to enhance the model's long-text capabilities without requiring additional training resources.
Created on 27 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.