Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

AI-generated keywords: Data Management

AI-generated Key Points

  • Long-standing goal: Develop automated systems for processing semi-structured documents without human effort or domain-specific customization
  • Current systems rely on simplifying assumptions and domain-specific training
  • Researchers explore the use of large language models (LLMs) for generality in this task
  • Prototype system called EVAPORATE powered by LLMs
  • Two strategies: direct extraction using LLMs or code synthesis
  • Tradeoff between cost and quality: code synthesis is cheaper but less accurate than direct extraction with LLMs
  • Introduction of EVAPORATE-CODE+ improves quality by generating multiple candidate functions and ensembling their extractions using weak supervision
  • EVAPORATE-CODE+ outperforms state-of-the-art systems and reduces LLM processing load by 110 times on average
  • Focus on converting heterogeneous data sources into structured tables for analytical queries (webpages, PDFs, text documents)
  • Example application: structuring attributes related to medical devices in FDA 510(k) reviews for premarket notification submissions
  • LLM-based systems effectively generate structured views of heterogeneous data
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Ré

License: CC ZERO 1.0

Abstract: A long standing goal of the data management community is to develop general, automated systems that ingest semi-structured documents and output queryable tables without human effort or domain specific customization. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad data, can perform diverse downstream tasks simply conditioned on natural language task descriptions. We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify two fundamentally different strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our key insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of tokens the LLM needs to process, averaged across 16 real-world evaluation settings of 10k documents each.

Submitted to arXiv on 19 Apr. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2304.09433v2

A long-standing goal in the data management community is to develop automated systems that can process semi-structured documents and output queryable tables without the need for human effort or domain-specific customization. However, due to the wide variety of potential documents, current systems often make simplifying assumptions and rely on domain-specific training. In this study, the researchers explore whether large language models (LLMs) can help achieve generality in this task. The researchers propose a prototype system called EVAPORATE, which is powered by LLMs. They investigate two different strategies for implementing this system: directly extracting values from documents using LLMs or synthesizing code that performs the extraction. Through evaluations, they find that there is a tradeoff between cost and quality with these approaches. While code synthesis is cheaper, it is less accurate compared to directly processing each document with LLMs. To improve the quality while maintaining low cost, the researchers introduce an extended code synthesis implementation called EVAPORATE-CODE+. This approach achieves better quality than direct extraction by generating multiple candidate functions and ensembling their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms state-of-the-art systems but also reduces the number of tokens the LLM needs to process by 110 times on average across various evaluation settings. The study focuses on developing systems that can convert heterogeneous data sources into structured tables for analytical queries. These data sources include webpages, PDFs, and text documents. The proposed systems aim to identify the schema of these documents and perform extraction to populate tables automatically. As an example application, medical researchers often analyze data from electronic health records (EHR), clinical trials, PubMed knowledge sources, and FDA reports. The researchers consider FDA 510(k) reviews for premarket notification submissions as a motivating setting. Their objective is to automatically structure attributes related to medical devices in these reviews and output a table format. Overall, the study demonstrates that LLM-based systems can effectively generate structured views of heterogeneous data. The proposed EVAPORATE-CODE+ system achieves high-quality results while minimizing the processing load on the LLM.
Created on 25 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.