Large Language Models are Built-in Autoregressive Search Engines
AI-generated Key Points
- Large language models (LLMs) can generate Web URLs for document retrieval based on query-URL pairs as demonstrations.
- LLMs can act as built-in search engines without explicit training for mapping questions to document identifiers.
- LLMs achieve nearly 90% accuracy in generating URLs that lead to documents with correct answers to open-domain questions.
- The method consistently outperforms existing retrieval approaches on three open-domain question answering benchmarks, in both zero-shot and few-shot settings.
- Future research directions include fine-tuning prompts for individual questions and using clustering to select diverse demonstrations for improved retrieval performance.
- Limitations include the need for retraining when updating knowledge, potential hallucination errors, and slow web requests and document processing.
- A case study comparing LLM-URL approach with Contriever and BM25 retrievers shows superior performance in retrieving answer-containing documents.
- LLMs offer effective URL generation for document retrieval but face challenges related to knowledge updating, hallucination errors, and practical usability.
Authors: Noah Ziems, Wenhao Yu, Zhihan Zhang, Meng Jiang
Abstract: Document retrieval is a key stage of standard Web search engines. Existing dual-encoder dense retrievers obtain representations for questions and documents independently, allowing for only shallow interactions between them. To overcome this limitation, recent autoregressive search engines replace the dual-encoder architecture by directly generating identifiers for relevant documents in the candidate pool. However, the training cost of such autoregressive search engines rises sharply as the number of candidate documents increases. In this paper, we find that large language models (LLMs) can follow human instructions to directly generate URLs for document retrieval. Surprisingly, when providing a few {Query-URL} pairs as in-context demonstrations, LLMs can generate Web URLs where nearly 90\% of the corresponding documents contain correct answers to open-domain questions. In this way, LLMs can be thought of as built-in search engines, since they have not been explicitly trained to map questions to document identifiers. Experiments demonstrate that our method can consistently achieve better retrieval performance than existing retrieval approaches by a significant margin on three open-domain question answering benchmarks, under both zero and few-shot settings. The code for this work can be found at \url{https://github.com/Ziems/llm-url}.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.