Better Synthetic Data by Retrieving and Transforming Existing Datasets

AI-generated keywords: AI research high-quality training data NLP models DataTune dataset generation

AI-generated Key Points

  • Need for abundant and dependable data remains a significant bottleneck in AI research
  • Manual curation of task-specific annotated data is labor-intensive
  • Previous studies have explored prompt-driven synthetic data generation using large language models
  • DataTune method introduced to enhance automatic dataset generation by leveraging existing datasets more effectively
  • DataTune facilitates dataset transformation to align with specific task requirements
  • Experiments show that finetuning language models using DataTune outperformed few-shot prompting baselines by 49%
  • Dataset transformation increased diversity and difficulty of generated data across tasks
  • DataTune integrated into an open-source repository for broader research community access
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

License: CC BY 4.0

Abstract: Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, \textit{DataTune}, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

Submitted to arXiv on 22 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.14361v1

In the realm of , the need for remains a significant bottleneck despite recent advancements in large language models. Obtaining abundant and dependable data is crucial for building and deploying effectively. However, many specialized or emerging tasks lack task-specific annotated data, making the manual curation of such data a labor-intensive process. To address this challenge, previous studies have explored prompt-driven synthetic data generation using large language models. Yet, these generated datasets often lack complexity and diversity. In response to these limitations, a novel method called \textit{DataTune} has been introduced to enhance automatic dataset generation by leveraging existing publicly available datasets more effectively. facilitates dataset transformation, enabling the repurposing of existing datasets into formats that align directly with specific task requirements. Through experiments on a diverse range of language-based tasks from the BIG-Bench benchmark, it was found that finetuning language models using outperformed few-shot prompting baselines by 49% and surpassed existing methods utilizing synthetic or retrieved training data by 34%. Notably, dataset transformation significantly increased the diversity and difficulty of generated data across various tasks. Furthermore, has been integrated into an open-source repository to make this innovative method accessible to the broader research community at https://github.com/neulab/prompt2model. This advancement holds promise for improving the quality and efficiency of dataset generation for specialized tasks in AI research, offering a valuable tool for researchers seeking to optimize model performance in low-resource settings where task-specific annotated data is limited.
Created on 24 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.