Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting

AI-generated keywords: Large Language Models Text Ranking Pairwise Ranking Prompting TREC-DL2020 TREC-DL2019

AI-generated Key Points

  • Large Language Models (LLMs) have limited success in text ranking compared to baseline rankers
  • Existing methods struggle to outperform baseline rankers, except for a recent approach using a blackbox commercial system
  • The authors propose Pairwise Ranking Prompting (PRP) as a new technique to address this issue
  • LLMs do not fully understand pointwise and listwise ranking prompts due to their training process
  • PRP reduces the burden on LLMs by introducing a new approach
  • PRP achieves state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs
  • PRP outperforms previous approaches based on larger models by over 5% at NDCG@1 and other existing solutions by over 10% for nearly all ranking metrics
  • Several variants of PRP are proposed to improve efficiency with competitive results even with linear complexity
  • PRP has additional benefits such as supporting both generation and scoring LLM APIs and being insensitive to input ordering.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, Michael Bendersky

12 pages, 3 figures
License: CC BY 4.0

Abstract: Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, there has been limited success so far, as researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these ranking formulations, possibly due to the nature of how LLMs are trained. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL2020, PRP based on the Flan-UL2 model with 20B parameters outperforms the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, by over 5% at NDCG@1. On TREC-DL2019, PRP is only inferior to the GPT-4 solution on the NDCG@5 and NDCG@10 metrics, while outperforming other existing solutions, such as InstructGPT which has 175B parameters, by over 10% for nearly all ranking metrics. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity. We also discuss other benefits of PRP, such as supporting both generation and scoring LLM APIs, as well as being insensitive to input ordering.

Submitted to arXiv on 30 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.17563v1

Large Language Models (LLMs) have shown impressive performance on various natural language tasks. However, they have limited success in the text ranking problem compared to well-trained baseline rankers. Existing methods struggle to outperform these baseline rankers, and the only exception is a recent approach that relies on a blackbox commercial system. In this paper, the authors propose a new technique called Pairwise Ranking Prompting (PRP) to address this issue. The authors analyze the pointwise and listwise ranking prompts used by existing methods and argue that LLMs do not fully understand these ranking formulations due to their training process. Pointwise approaches require LLMs to output calibrated prediction probabilities before sorting, which is challenging for them. Listwise approaches also generate conflicting or useless outputs on moderate-sized LLMs. To overcome these limitations, the authors introduce PRP as a way to reduce the burden on LLMs. They achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL2020, PRP outperforms previous approaches based on larger models by over 5% at NDCG@1. On TREC-DL2019, PRP outperforms other existing solutions by over 10% for nearly all ranking metrics. The authors also propose several variants of PRP to improve efficiency and show that competitive results can be achieved even with linear complexity. Additionally, they highlight other benefits of PRP such as supporting both generation and scoring LLM APIs and being insensitive to input ordering. Overall, this paper presents a novel technique, PRP which significantly improves text ranking performance using moderate-sized LLMs. The results demonstrate its effectiveness in outperforming existing methods and achieving state-of-the-art performance on benchmark datasets.
Created on 05 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.