GPT-4 is judged more human than humans in displaced and inverted Turing tests

AI-generated keywords: Everyday AI detection Turing test GPT models Human-AI interactions Transcript length

AI-generated Key Points

  • Study focuses on challenges of detecting everyday AI in informal online conversations
  • Modified versions of Turing test conducted to measure ability to discriminate between human and AI interactions
  • Judges included GPT-3.5, GPT-4, and displaced humans, all showed below chance accuracy
  • Best-performing GPT-4 witness often judged as human more than actual humans
  • AI system may be perceived as human more often than actual person in online conversations
  • Transcript length had counter-intuitive effect on accuracy, shorter transcripts potentially more helpful due to biases in length determination
  • Differences in how human adjudicators completed transcripts compared to LLM adjudicators highlighted factors influencing judgment accuracy
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ishika Rathi, Sydney Taylor, Benjamin K. Bergen, Cameron R. Jones

License: CC BY 4.0

Abstract: Everyday AI detection requires differentiating between people and AI in informal, online conversations. In many cases, people will not interact directly with AI systems but instead read conversations between AI systems and other people. We measured how well people and large language models can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.

Submitted to arXiv on 11 Jul. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2407.08853v1

The study delves into the challenges of detecting everyday AI in informal online conversations. Distinguishing between human and AI interactions is crucial in this context. To measure people's and large language models' (LLMs) ability to discriminate between human and AI agents, modified versions of the Turing test were conducted. These included inverted and displaced tests. The judges tasked with determining whether an agent was human or AI were GPT-3.5, GPT-4, and displaced humans. Surprisingly, all three groups exhibited lower accuracy compared to interactive interrogators, with overall below chance accuracy. Strikingly, they tended to judge the best-performing GPT-4 witness as human more often than actual human witnesses. This highlights the difficulty for both humans and current LLMs in distinguishing between human and AI interactions without active interrogation. Further analysis revealed that the best-performing GPT-4 witness had a higher pass rate than human witnesses in both inverted and displaced tests. This suggests that in online conversations between humans and AI models, the AI system may be more likely to be perceived as human than an actual person. Additionally, a counter-intuitive effect of transcript length on accuracy was found - shorter transcripts may contain information more helpful to adjudicators due to potential biases in how transcript length was determined. Moreover, differences in how human adjudicators completed transcripts in series compared to LLM adjudicators who assessed each transcript separately highlighted potential factors influencing judgment accuracy. Overall, this study emphasizes the need for improved tools for detecting AI in conversations given the challenges faced by both humans and current LLMs in accurately discerning between human and AI interactions without active interrogation.
Created on 25 Aug. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.