ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

AI-generated keywords: ToxiGen

AI-generated Key Points

  • Toxic language detection systems struggle to accurately identify toxic text mentioning minority groups
  • Over-reliance on spurious correlations hinders the detection of implicitly toxic language
  • ToxiGen is a large-scale and machine-generated dataset consisting of 274k toxic and benign statements about 13 minority groups
  • ToxiGen uses a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method
  • ToxiGen covers implicitly toxic text on a larger scale and across more demographic groups compared to previous datasets composed of human-written text
  • Human evaluation found that annotators struggled to distinguish between machine-generated text and human-written language
  • 94.5% of the toxic examples in ToxiGen were labeled as hate speech by human annotators
  • Finetuning a toxicity classifier on ToxiGen significantly improved its performance on human-written data from three publicly available datasets
  • Finetuning the classifier with ToxiGen led to significant improvements in combating machine-generated toxicity
  • Demonstration-based prompting reliably generated toxic and benign statements about minority groups
  • Machine-generated examples had 30.2% harmful content and only 4% ambiguous content, adequately representing both categories
  • All identity groups covered by ToxiGen were represented in the human study, although there was some deviation in referencing specific groups due to potential conflation or mention of multiple groups by the language model.
  • No significant difference in perceived toxicity between machine-generated text and human-written text.
  • "Moral judgement" was the most common framing tactic in generated statements, previously linked to toxicity.
  • ALICE generated statements were more adversarial compared to top k generated ones, validating generation methods.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar

Published as a long paper at ACL 2022. Code: https://github.com/microsoft/TOXIGEN
License: CC BY 4.0

Abstract: Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that annotators struggle to distinguish machine-generated text from human-written language. We also find that 94.5% of toxic examples are labeled as hate speech by human annotators. Using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. We also demonstrate that ToxiGen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset. Our code and data can be found at https://github.com/microsoft/ToxiGen.

Submitted to arXiv on 17 Mar. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2203.09509v4

Toxic language detection systems often struggle to accurately identify toxic text that mentions minority groups, as these groups are frequently targeted by online hate. This over-reliance on spurious correlations also hinders the detection of implicitly toxic language. To address these issues, the researchers have developed ToxiGen, a large-scale and machine-generated dataset consisting of 274k toxic and benign statements about 13 minority groups. They have employed a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text using a powerful pretrained language model. By controlling the machine generation process, ToxiGen is able to cover implicitly toxic text on a larger scale and across more demographic groups compared to previous datasets composed of human-written text. The researchers conducted a human evaluation on a challenging subset of ToxiGen and found that annotators struggled to distinguish between machine-generated text and human-written language. Additionally, they discovered that 94.5% of the toxic examples in ToxiGen were labeled as hate speech by human annotators. The researchers also demonstrated that finetuning a toxicity classifier on their data significantly improved its performance on human-written data from three publicly available datasets. Furthermore, they showed that ToxiGen can be used to combat machine generated toxicity, as finetuning the classifier with their dataset led to significant improvements in its performance on their evaluation subset. In further analysis, the researchers found that demonstration based prompting reliably generated toxic and benign statements about minority groups. They observed that 30.2% of the machine generated examples were deemed harmful, while only 4% were considered ambiguous. This indicates that the generated data adequately represents both toxic and benign categories. Moreover, all identity groups covered by ToxiGen were represented in the human study, although there was some deviation in terms of which group was referenced by both the prompt and corresponding TOXIGEN text due to potential conflation or mention of multiple groups by the language model. Interestingly, there was no significant difference in perceived toxicity between machine generated text and human written text. The researchers also identified that the most common framing tactic in the generated statements was "moral judgement," which involves questioning the morality of an identity group; this tactic has been previously linked to toxicity by other studies. To validate their generation methods, they compared ALICE generated statements (using a demonstration based prompting framework) with top k generated ones; they found that ALICE generated statements were more adversarial compared to top k generated ones.
Created on 08 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.