This technical report presents the training process and performance of nomic-embed-text-v1, an innovative English text embedding model with a context length of 8192. The model outperforms leading models such as OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving both short and long contexts. It is fully reproducible, open-source, and comes with open weights and data under an Apache 2 license. The report also discusses the Jina Long Context Evaluation Benchmark, showcasing the performance of the nomic-embed-text-v1-ablated model without FEVER, HotpotQA, and MEDI datasets. Despite a slight decrease in overall score, it was decided to train on these datasets to maintain comparability with other top open-source models. The training resources for nomic-embed-text-v1 are outlined in detail. Full training can be completed in one week on an 8xH100 node using various stages such as masked language modeling and contrastive pretraining. It is recommended to initialize from nomic-bert-2048 or Unsupervised Contrastive checkpoints. In conclusion, this report introduces the first fully open-source long-context text embedding model that surpasses OpenAI Ada-002's performance across various benchmarks. The model weights, training code, and replication recipe are released under a permissible license for broader accessibility. Furthermore, contributions from key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are highlighted for their roles in project leadership, implementation decisions, dataset curation efforts,and design contributions at all levels of the stack. Lastly,the challenges in evaluating text embedding models are discussed within the context of benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023), emphasizing the need for comprehensive evaluations over long context lengths. Recent benchmarks specialized for long-context evaluation such as Jina (Günther et al., 2024) and LoCo (Saad-Falcon et al., 2024) provide valuable insights into effectively assessing model performance.
- - The nomic-embed-text-v1 model outperforms leading models like OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving short and long contexts.
- - The model is fully reproducible, open-source, and comes with open weights and data under an Apache 2 license.
- - Despite a slight decrease in overall score, the decision was made to train on FEVER, HotpotQA, and MEDI datasets for comparability with other top open-source models.
- - Full training of nomic-embed-text-v1 can be completed in one week on an 8xH100 node using various stages such as masked language modeling and contrastive pretraining.
- - It is recommended to initialize from nomic-bert-2048 or Unsupervised Contrastive checkpoints for training the model.
- - Contributions from key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are highlighted for their roles in project leadership, implementation decisions, dataset curation efforts,and design contributions at all levels of the stack.
- - Challenges in evaluating text embedding models within benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023) are discussed emphasizing the need for comprehensive evaluations over long context lengths.
Summary- The nomic-embed-text-v1 model is better than other models like OpenAI Ada-002 and OpenAI text-embedding-3-small at tasks with short and long contexts.
- This model can be reproduced, is open-source, and has its weights and data available under an Apache 2 license.
- Even though there was a small drop in scores, the decision was made to train the model on FEVER, HotpotQA, and MEDI datasets for comparison with other top models.
- It takes one week to fully train nomic-embed-text-v1 on specific hardware using different training stages like masked language modeling and contrastive pretraining.
- To train the model, it's suggested to start from nomic-bert-2048 or Unsupervised Contrastive checkpoints.
Definitions1. Model: A representation of something that helps us understand or predict how it works.
2. Reproducible: Something that can be done again to get the same results.
3. Open-source: Software that allows anyone to view, modify, and distribute its code freely.
4. License: A legal permission to use or distribute something owned by someone else under certain conditions.
5. Dataset: A collection of organized data used for analysis or research.
The world of natural language processing (NLP) is constantly evolving, with new models and techniques being developed to improve the performance of tasks such as text classification, sentiment analysis, and machine translation. In recent years, there has been a growing interest in long-context text embedding models that can capture more complex relationships between words and phrases. One such model that has gained attention is nomic-embed-text-v1, which is the focus of this technical report.
Nomic-embed-text-v1 is an English text embedding model with a context length of 8192. This means that it can process texts up to 8192 tokens long, making it one of the longest context models currently available. The model was trained using various stages such as masked language modeling and contrastive pretraining on an 8xH100 node. It takes inspiration from other top-performing models like nomic-bert-2048 and Unsupervised Contrastive checkpoints.
The main highlight of this report is the performance comparison between nomic-embed-text-v1 and leading models like OpenAI Ada-002 and OpenAI text-embedding-3-small on tasks involving both short and long contexts. The results show that nomic-embed-text-v1 outperforms these models across various benchmarks. This makes it the first fully open-source long-context text embedding model to surpass OpenAI Ada-002's performance.
One key aspect of this research paper is its reproducibility. All resources used for training nomic-embed-text-v1 are outlined in detail, making it easy for others to replicate the results. Additionally, all weights and data are released under an Apache 2 license for broader accessibility.
The report also discusses the Jina Long Context Evaluation Benchmark, which showcases the performance of an ablated version of nomic-embed-text-v1 without FEVER, HotpotQA, and MEDI datasets. While there was a slight decrease in overall score, it was decided to include these datasets in the training process to maintain comparability with other top open-source models.
The contributions of key team members Zach Nussbaum, Jack Morris, Brandon Duderstadt, and Andriy Mulyar are also highlighted in this report. They played crucial roles in project leadership, implementation decisions, dataset curation efforts, and design contributions at all levels of the stack.
Furthermore, the challenges in evaluating text embedding models are discussed within the context of benchmarking initiatives like BEIR (Thakur et al., 2021) and MTEB (Muennighoff et al., 2023). These benchmarks highlight the need for comprehensive evaluations over long context lengths. Recent specialized benchmarks such as Jina (Günther et al., 2024) and LoCo (Saad-Falcon et al., 2024) provide valuable insights into effectively assessing model performance.
In conclusion, nomic-embed-text-v1 is a groundbreaking long-context text embedding model that surpasses leading models' performance across various benchmarks. Its fully reproducible nature and open-source availability make it a valuable addition to the NLP community. The authors hope that their work will inspire further research and advancements in long-context text embedding models.