Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
AI-generated Key Points
- Long-standing goal: Develop automated systems for processing semi-structured documents without human effort or domain-specific customization
- Current systems rely on simplifying assumptions and domain-specific training
- Researchers explore the use of large language models (LLMs) for generality in this task
- Prototype system called EVAPORATE powered by LLMs
- Two strategies: direct extraction using LLMs or code synthesis
- Tradeoff between cost and quality: code synthesis is cheaper but less accurate than direct extraction with LLMs
- Introduction of EVAPORATE-CODE+ improves quality by generating multiple candidate functions and ensembling their extractions using weak supervision
- EVAPORATE-CODE+ outperforms state-of-the-art systems and reduces LLM processing load by 110 times on average
- Focus on converting heterogeneous data sources into structured tables for analytical queries (webpages, PDFs, text documents)
- Example application: structuring attributes related to medical devices in FDA 510(k) reviews for premarket notification submissions
- LLM-based systems effectively generate structured views of heterogeneous data
Authors: Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Ré
Abstract: A long standing goal of the data management community is to develop general, automated systems that ingest semi-structured documents and output queryable tables without human effort or domain specific customization. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad data, can perform diverse downstream tasks simply conditioned on natural language task descriptions. We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify two fundamentally different strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our key insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of tokens the LLM needs to process, averaged across 16 real-world evaluation settings of 10k documents each.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.