Retrieving Texts based on Abstract Descriptions

AI-generated keywords: Semantic Retrieval Large Language Models Description-based Retrieval Text Embeddings Nearest Neighbor Search

AI-generated Key Points

  • The preprint explores the task of retrieving sentences from large text collections based on abstract descriptions or specifications.
  • Existing retrieval-based models and instruction-tuned Large Language Models (LLMs) have limitations in performing semantic retrieval.
  • LLMs are not suitable for similarity search over embedding vectors, while keyword-based retrieval methods rely on exact lexical matches which makes them weak for retrieval based on abstract descriptions.
  • The authors propose a description-based retrieval approach that leverages the strengths of LLMs to generate descriptions of sentences sampled from Wikipedia.
  • They produce five different descriptions for each text, in addition to incorrect descriptions, to be used as negative examples.
  • The resulting dataset includes both positive and misleading descriptions that align with the original sentence and the abstract description.
  • The authors train a descriptions encoder and a text encoder using positive and negative pairs sourced through prompting an LLM.
  • The vector encodings learned by these encoders can be used in a standard similarity-based retrieval setting to retrieve sentences that align with a user's description or specification.
  • The proposed model significantly improves upon existing text embeddings when used in standard nearest neighbor search.
  • This approach can serve as a useful component to enhance discovery ability in many data-intensive domains, especially in professional domains such as legal, medical or scientific search.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shauli Ravfogel, Valentina Pyatkin, Amir DN Cohen, Avshalom Manevich, Yoav Goldberg

A preprint; demo available at https://github.com/shauli-ravfogel/AbstractSim
License: CC BY 4.0

Abstract: In this work, we aim to connect two research areas: instruction models and retrieval-based models. While instruction-tuned Large Language Models (LLMs) excel at extracting information from text, they are not suitable for semantic retrieval. Similarity search over embedding vectors allows to index and query vectors, but the similarity reflected in the embedding is sub-optimal for many use cases. We identify the task of retrieving sentences based on abstract descriptions of their content. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting an a large language model (LLM). While it is easy to source the training material from an LLM, the retrieval task cannot be performed by the LLM directly. This demonstrates that data from LLMs can be used not only for distilling more efficient specialized models than the original LLM, but also for creating new capabilities not immediately possible using the original model.

Submitted to arXiv on 21 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.12517v1

This preprint explores the task of retrieving sentences from large text collections based on abstract descriptions or specifications. The authors identify the limitations of existing retrieval-based models and instruction-tuned Large Language Models (LLMs) in performing semantic retrieval. While LLMs excel at extracting information from text, they are not suitable for similarity search over embedding vectors. On the other hand, keyword-based retrieval methods rely on exact lexical matches, which makes them inherently weak for retrieval based on abstract descriptions. To address this challenge, the authors propose a description-based retrieval approach that leverages the strengths of LLMs to generate descriptions of sentences sampled from Wikipedia. They produce five different descriptions for each text, in addition to incorrect descriptions, to be used as negative examples. The resulting dataset includes a wide range of both positive and misleading descriptions that align with the original sentence and the abstract description. The authors then train a descriptions encoder and a text encoder using positive and negative pairs sourced through prompting an LLM. The vector encodings learned by these encoders can be used in a standard similarity-based retrieval setting to retrieve sentences that align with a user's description or specification. The proposed model significantly improves upon existing text embeddings when used in standard nearest neighbor search. The results demonstrate that data from LLMs can be used not only for distilling more efficient specialized models than the original LLM but also for creating new capabilities not immediately possible using the original model. The description-based retrieval capability demonstrated in this work can serve as a useful component to enhance discovery ability in many data-intensive domains, especially in professional domains such as legal, medical or scientific search. Overall, this preprint presents an innovative approach to improving semantic retrieval using large language models and highlights its potential applications across various domains.
Created on 10 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.