LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LLM-Generated Texts

AI-generated keywords: Source Bias Information Retrieval Language Models Text Compression Multimodal Content

AI-generated Key Points

  • Authors address the impact of generated content on information retrieval (IR) systems in the era of large language models (LLMs)
  • Introduce two new benchmarks, SciFact+AIGC and NQ320K+AIGC, to evaluate IR models with human-written and LLM-generated texts
  • Uncover "source bias" in neural retrieval models favoring LLM-generated text
  • Bias attributed to semantic concentration and less noise in LLM-generated texts
  • Discuss concerns and risks of source bias to the web ecosystem
  • Suggest exploring bias manifestation in other information systems beyond IR
  • Highlight biases towards other data modalities like images and multimodal content
  • Propose investigating source bias on tasks beyond texts, such as image-text retrieval
  • Research provides insights for future exploration, raises questions about mitigating biases caused by LLMs' influence, highlights potential risks associated with these biases
  • Two new benchmarks available on GitHub for further research in IR during the LLM era.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Jun Xu

License: CC BY-NC-SA 4.0

Abstract: Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search. With their remarkable capabilities in generating human-like texts, LLMs have created enormous texts on the Internet. As a result, IR systems in the LLMs era are facing a new challenge: the indexed documents now are not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of different IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher.We refer to this category of biases in neural retrieval models towards the LLM-generated text as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, we provide an in-depth analysis from the perspective of text compression and observe that neural models can better understand the semantic information of LLM-generated text, which is further substantiated by our theoretical analysis.We also discuss the potential server concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks and codes will later be available at \url{https://github.com/KID-22/LLM4IR-Bias}.

Submitted to arXiv on 31 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.20501v1

In this paper, the authors address the pressing and emerging problem of the impact of generated content on information retrieval (IR) systems in the era of large language models (LLMs). They introduce two new benchmarks, SciFact+AIGC and NQ320K+AIGC, to evaluate IR models in scenarios where both human-written and LLM-generated texts are involved. Through extensive experiments, they uncover an unexpected bias of neural retrieval models favoring LLM-generated text, which they refer to as "source bias". The authors provide an in-depth analysis of this bias from the perspective of text compression. They observe that LLM-generated texts tend to have more focused semantics with less noise compared to human-written texts, making them more suitable for precise semantic similarity calculations. This difference in semantic concentration contributes to the observed bias in neural IR models. Furthermore, the authors discuss the crucial concerns and potential risks of this bias to the entire web ecosystem. They emphasize the need to mitigate the source bias and ensure sustainable development in the face of LLMs' influence. Additionally, they suggest exploring whether this bias manifests in other information systems beyond IR such as recommender and advertising systems. The authors also highlight that neural models may exhibit biases not only towards generated text but also other data modalities like images and multimodal content. They propose investigating the source bias on tasks beyond texts such as image-text retrieval especially considering emerging generative models capable of generating high-quality images. Overall, this research provides valuable insights into several promising directions for future exploration. It raises important questions about mitigating biases caused by LLMs' influence on information systems and highlights potential risks associated with these biases. The findings serve as a critical wake-up call to the IR community and beyond. To facilitate further research in IR during the LLM era, the authors have constructed two new benchmarks and made them available along with their codes on GitHub.
Created on 01 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.