Nomic Embed: Training a Reproducible Long Context Text Embedder

AI-generated keywords: Text embedding Nomic-embed-text-v1 OpenAI Ada-002 Long-context evaluation Reproducibility

AI-generated Key Points

The nomic-embed-text-v1 model outperforms leading models like OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving short and long contexts.
The model is fully reproducible, open-source, and comes with open weights and data under an Apache 2 license.
Despite a slight decrease in overall score, the decision was made to train on FEVER, HotpotQA, and MEDI datasets for comparability with other top open-source models.
Full training of nomic-embed-text-v1 can be completed in one week on an 8xH100 node using various stages such as masked language modeling and contrastive pretraining.
It is recommended to initialize from nomic-bert-2048 or Unsupervised Contrastive checkpoints for training the model.
Contributions from key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are highlighted for their roles in project leadership, implementation decisions, dataset curation efforts,and design contributions at all levels of the stack.
Challenges in evaluating text embedding models within benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023) are discussed emphasizing the need for comprehensive evaluations over long context lengths.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar

arXiv: 2402.01613v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors

Submitted to arXiv on 02 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.01613v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

This technical report presents the training process and performance of nomic-embed-text-v1, an innovative English text embedding model with a context length of 8192. The model outperforms leading models such as OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving both short and long contexts. It is fully reproducible, open-source, and comes with open weights and data under an Apache 2 license. The report also discusses the Jina Long Context Evaluation Benchmark, showcasing the performance of the nomic-embed-text-v1-ablated model without FEVER, HotpotQA, and MEDI datasets. Despite a slight decrease in overall score, it was decided to train on these datasets to maintain comparability with other top open-source models. The training resources for nomic-embed-text-v1 are outlined in detail. Full training can be completed in one week on an 8xH100 node using various stages such as masked language modeling and contrastive pretraining. It is recommended to initialize from nomic-bert-2048 or Unsupervised Contrastive checkpoints. In conclusion, this report introduces the first fully open-source long-context text embedding model that surpasses OpenAI Ada-002's performance across various benchmarks. The model weights, training code, and replication recipe are released under a permissible license for broader accessibility. Furthermore, contributions from key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are highlighted for their roles in project leadership, implementation decisions, dataset curation efforts,and design contributions at all levels of the stack. Lastly,the challenges in evaluating text embedding models are discussed within the context of benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023), emphasizing the need for comprehensive evaluations over long context lengths. Recent benchmarks specialized for long-context evaluation such as Jina (Günther et al., 2024) and LoCo (Saad-Falcon et al., 2024) provide valuable insights into effectively assessing model performance.

- The nomic-embed-text-v1 model outperforms leading models like OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving short and long contexts.
- The model is fully reproducible, open-source, and comes with open weights and data under an Apache 2 license.
- Despite a slight decrease in overall score, the decision was made to train on FEVER, HotpotQA, and MEDI datasets for comparability with other top open-source models.
- Full training of nomic-embed-text-v1 can be completed in one week on an 8xH100 node using various stages such as masked language modeling and contrastive pretraining.
- It is recommended to initialize from nomic-bert-2048 or Unsupervised Contrastive checkpoints for training the model.
- Contributions from key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are highlighted for their roles in project leadership, implementation decisions, dataset curation efforts,and design contributions at all levels of the stack.
- Challenges in evaluating text embedding models within benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023) are discussed emphasizing the need for comprehensive evaluations over long context lengths.

Summary- The nomic-embed-text-v1 model is better than other models like OpenAI Ada-002 and OpenAI text-embedding-3-small at tasks with short and long contexts. - This model can be reproduced, is open-source, and has its weights and data available under an Apache 2 license. - Even though there was a small drop in scores, the decision was made to train the model on FEVER, HotpotQA, and MEDI datasets for comparison with other top models. - It takes one week to fully train nomic-embed-text-v1 on specific hardware using different training stages like masked language modeling and contrastive pretraining. - To train the model, it's suggested to start from nomic-bert-2048 or Unsupervised Contrastive checkpoints. Definitions1. Model: A representation of something that helps us understand or predict how it works. 2. Reproducible: Something that can be done again to get the same results. 3. Open-source: Software that allows anyone to view, modify, and distribute its code freely. 4. License: A legal permission to use or distribute something owned by someone else under certain conditions. 5. Dataset: A collection of organized data used for analysis or research.

The world of natural language processing (NLP) is constantly evolving, with new models and techniques being developed to improve the performance of tasks such as text classification, sentiment analysis, and machine translation. In recent years, there has been a growing interest in long-context text embedding models that can capture more complex relationships between words and phrases. One such model that has gained attention is nomic-embed-text-v1, which is the focus of this technical report. Nomic-embed-text-v1 is an English text embedding model with a context length of 8192. This means that it can process texts up to 8192 tokens long, making it one of the longest context models currently available. The model was trained using various stages such as masked language modeling and contrastive pretraining on an 8xH100 node. It takes inspiration from other top-performing models like nomic-bert-2048 and Unsupervised Contrastive checkpoints. The main highlight of this report is the performance comparison between nomic-embed-text-v1 and leading models like OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving both short and long contexts. The results show that nomic-embed-text-v1 outperforms these models across various benchmarks. This makes it the first fully open-source long-context text embedding model to surpass OpenAI Ada-002's performance. One key aspect of this research paper is its reproducibility. All resources used for training nomic-embed-text-v1 are outlined in detail, making it easy for others to replicate the results. Additionally, all weights and data are released under an Apache 2 license for broader accessibility. The report also discusses the Jina Long Context Evaluation Benchmark, which showcases the performance of an ablated version of nomic-embed-text-v1 without FEVER, HotpotQA, and MEDI datasets. While there was a slight decrease in overall score, it was decided to include these datasets in the training process to maintain comparability with other top open-source models. The contributions of key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are also highlighted in this report. They played crucial roles in project leadership, implementation decisions, dataset curation efforts, and design contributions at all levels of the stack. Furthermore, the challenges in evaluating text embedding models are discussed within the context of benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023). These benchmarks highlight the need for comprehensive evaluations over long context lengths. Recent specialized benchmarks such as Jina (Günther et al., 2024) and LoCo (Saad-Falcon et al., 2024) provide valuable insights into effectively assessing model performance. In conclusion, nomic-embed-text-v1 is a groundbreaking long-context text embedding model that surpasses leading models' performance across various benchmarks. Its fully reproducible nature and open-source availability make it a valuable addition to the NLP community. The authors hope that their work will inspire further research and advancements in long-context text embedding models.

Created on 22 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.