Nomic Embed: Training a Reproducible Long Context Text Embedder

AI-generated keywords: Text embedding Nomic-embed-text-v1 OpenAI Ada-002 Long-context evaluation Reproducibility

AI-generated Key Points

  • The nomic-embed-text-v1 model outperforms leading models like OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving short and long contexts.
  • The model is fully reproducible, open-source, and comes with open weights and data under an Apache 2 license.
  • Despite a slight decrease in overall score, the decision was made to train on FEVER, HotpotQA, and MEDI datasets for comparability with other top open-source models.
  • Full training of nomic-embed-text-v1 can be completed in one week on an 8xH100 node using various stages such as masked language modeling and contrastive pretraining.
  • It is recommended to initialize from nomic-bert-2048 or Unsupervised Contrastive checkpoints for training the model.
  • Contributions from key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are highlighted for their roles in project leadership, implementation decisions, dataset curation efforts,and design contributions at all levels of the stack.
  • Challenges in evaluating text embedding models within benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023) are discussed emphasizing the need for comprehensive evaluations over long context lengths.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar

License: CC BY 4.0

Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors

Submitted to arXiv on 02 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.01613v1

This technical report presents the training process and performance of nomic-embed-text-v1, an innovative English text embedding model with a context length of 8192. The model outperforms leading models such as OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving both short and long contexts. It is fully reproducible, open-source, and comes with open weights and data under an Apache 2 license. The report also discusses the Jina Long Context Evaluation Benchmark, showcasing the performance of the nomic-embed-text-v1-ablated model without FEVER, HotpotQA, and MEDI datasets. Despite a slight decrease in overall score, it was decided to train on these datasets to maintain comparability with other top open-source models. The training resources for nomic-embed-text-v1 are outlined in detail. Full training can be completed in one week on an 8xH100 node using various stages such as masked language modeling and contrastive pretraining. It is recommended to initialize from nomic-bert-2048 or Unsupervised Contrastive checkpoints. In conclusion, this report introduces the first fully open-source long-context text embedding model that surpasses OpenAI Ada-002's performance across various benchmarks. The model weights, training code, and replication recipe are released under a permissible license for broader accessibility. Furthermore, contributions from key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are highlighted for their roles in project leadership, implementation decisions, dataset curation efforts,and design contributions at all levels of the stack. Lastly,the challenges in evaluating text embedding models are discussed within the context of benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023), emphasizing the need for comprehensive evaluations over long context lengths. Recent benchmarks specialized for long-context evaluation such as Jina (Günther et al., 2024) and LoCo (Saad-Falcon et al., 2024) provide valuable insights into effectively assessing model performance.
Created on 22 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.