Towards Robust Text Retrieval with Progressive Learning

AI-generated keywords: Text Retrieval Progressive Learning Large Language Models External Knowledge Sources PEG Model

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors: Tong Wu, Yulei Qin, Enwei Zhang, Zihan Xu, Yuting Gao, Ke Li, Xing Sun
  • Introduces PEG model to enhance large language models (LLMs) with external knowledge sources
  • Existing embedding models for text retrieval face limitations in batch sample diversity and noise levels affecting semantic correctness
  • Struggle with sub-optimal convergence due to equal treatment of easy and difficult samples
  • PEG model increases in-batch negative samples to 80,000 and extracts five hard negatives per query
  • Incorporates progressive learning mechanism for dynamic attention modulation during training
  • Trained on over 100 million data points across domains like finance, medicine, and tourism
  • Covers tasks including question-answering and similarity matching
  • Experimental results show PEG outperforms state-of-the-art embeddings in retrieving true positives on C-MTEB and DuReader datasets
  • Publicly available at https://huggingface.co/TownsWu/PEG for further exploration and implementation in text retrieval tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tong Wu, Yulei Qin, Enwei Zhang, Zihan Xu, Yuting Gao, Ke Li, Xing Sun

Abstract: Retrieval augmentation has become an effective solution to empower large language models (LLMs) with external and verified knowledge sources from the database, which overcomes the limitations and hallucinations of LLMs in handling up-to-date and domain-specific information. However, existing embedding models for text retrieval usually have three non-negligible limitations. First, the number and diversity of samples in a batch are too restricted to supervise the modeling of textual nuances at scale. Second, the high proportional noise are detrimental to the semantic correctness and consistency of embeddings. Third, the equal treatment to easy and difficult samples would cause sub-optimum convergence of embeddings with poorer generalization. In this paper, we propose the PEG, a progressively learned embeddings for robust text retrieval. Specifically, we increase the training in-batch negative samples to 80,000, and for each query, we extracted five hard negatives. Concurrently, we incorporated a progressive learning mechanism, enabling the model to dynamically modulate its attention to the samples throughout the entire training process. Additionally, PEG is trained on more than 100 million data, encompassing a wide range of domains (e.g., finance, medicine, and tourism) and covering various tasks (e.g., question-answering, machine reading comprehension, and similarity matching). Extensive experiments conducted on C-MTEB and DuReader demonstrate that PEG surpasses state-of-the-art embeddings in retrieving true positives, highlighting its significant potential for applications in LLMs. Our model is publicly available at https://huggingface.co/TownsWu/PEG.

Submitted to arXiv on 20 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.11691v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Towards Robust Text Retrieval with Progressive Learning," authors Tong Wu, Yulei Qin, Enwei Zhang, Zihan Xu, Yuting Gao, Ke Li, and Xing Sun introduce the PEG model as a solution to enhance large language models (LLMs) with external knowledge sources for improved information handling. The existing embedding models for text retrieval face limitations in batch sample diversity and noise levels affecting semantic correctness. Additionally, they struggle with sub-optimal convergence due to equal treatment of easy and difficult samples. The PEG model addresses these challenges by increasing in-batch negative samples to 80,000 and extracting five hard negatives per query. It also incorporates a progressive learning mechanism that allows dynamic attention modulation throughout training. Trained on over 100 million data points spanning various domains such as finance, medicine, and tourism, PEG covers tasks including question-answering and similarity matching. Experimental results on C-MTEB and DuReader demonstrate that PEG outperforms state-of-the-art embeddings in retrieving true positives. This highlights the significant potential of PEG for applications in LLMs. The model is publicly available at https://huggingface.co/TownsWu/PEG for further exploration and implementation in text retrieval tasks.
Created on 30 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.