Improving Text Embeddings with Large Language Models

AI-generated keywords: Text embeddings Synthetic data Language Model Models (LLMs) Multilingual performance Inference efficiency

AI-generated Key Points

  • Introduces a novel method for obtaining high-quality text embeddings using synthetic data and few training steps
  • Does not require complex training pipelines or manually collected datasets
  • Leverages proprietary Language Model Models (LLMs) to generate diverse synthetic data for text embedding tasks in multiple languages
  • Fine-tunes open-source decoder-only LLMs on synthetic data using standard contrastive loss
  • Achieves strong performance on competitive text embedding benchmarks without using labeled data
  • Sets new state-of-the-art results on BEIR and MTEB benchmarks when fine-tuned with a mixture of synthetic and labeled data
  • Discusses future work, including improving multilingual performance and exploring the use of open-source LLMs for synthetic data generation
  • Aims to improve inference efficiency and reduce storage costs for LLM-based text embeddings
  • Provides statistics on generated synthetic data: 500k examples in 93 languages, generated by GPT-4 and GPT-35-Turbo (with acceptable quality)
  • Trained model evaluated on MTEB benchmark, achieving highest average score and outperforming previous state-of-the-art models by 2.4 points
  • Builds upon previous work by leveraging LLMs and synthetic data to enhance text embeddings
  • Demonstrates that high-quality text embeddings can be obtained using synthetic data and streamlined training process with LLMs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei

15 pages, 8 tables
License: CC BY 4.0

Abstract: In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

Submitted to arXiv on 31 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.00368v1

This paper introduces a novel and simple method for obtaining high-quality text embeddings using only synthetic data and a small number of training steps. Unlike existing methods that rely on multi-stage pre-training with weakly-supervised text pairs, followed by fine-tuning with labeled datasets, this method does not require complex training pipelines or manually collected datasets. Instead, proprietary Language Model Models (LLMs) are leveraged to generate diverse synthetic data for text embedding tasks in multiple languages. Open-source decoder-only LLMs are then fine-tuned on the synthetic data using standard contrastive loss. Experimental results show that this method achieves strong performance on competitive text embedding benchmarks without using any labeled data. When fine-tuned with a mixture of synthetic and labeled data, the model sets new state-of-the-art results on the BEIR and MTEB benchmarks. The paper also discusses future work, including improving multilingual performance and exploring the use of open-source LLMs to generate synthetic data. Additionally, efforts will be made to improve inference efficiency and reduce storage costs for LLM-based text embeddings. The paper provides statistics on the generated synthetic data, which includes 500k examples with instructions in 93 languages. The majority of examples are generated by GPT-4, with some generated by GPT-35-Turbo. While some outputs from GPT-35-Turbo do not strictly follow prompt guidelines, the overall quality is acceptable. The model is fine-tuned using a combination of the generated synthetic data and public datasets. The trained model is evaluated on the MTEB benchmark, achieving the highest average score and outperforming previous state-of-the-art models by 2.4 points. In related work, text embeddings have been extensively studied as continuous low-dimensional representations of text. This paper builds upon previous work by leveraging LLMs and synthetic data to enhance text embeddings. Overall, this paper demonstrates that high-quality text embeddings can be obtained using synthetic data and a streamlined training process with LLMs. The method achieves strong performance on text embedding benchmarks and sets new state-of-the-art results when combined with labeled data. Future work will focus on further improving multilingual performance, exploring open-source LLMs for synthetic data generation, and optimizing inference efficiency and storage costs.
Created on 17 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.