PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models

AI-generated keywords: PluRel Relational Foundation Models privacy-preserving synthetic data scaling AI systems

AI-generated Key Points

PluRel is a groundbreaking framework for generating diverse relational databases in a privacy-preserving manner.
RFMs are crucial for data-driven decision-making but lack publicly available complex multi-table databases hindering their development.
PluRel enables the synthesis of unlimited relational databases with customizable schemas, connectivity patterns, and data distributions.
The framework models schemas with directed graphs and inter-table primary-foreign key connectivity with bipartite graphs, offering a step-by-step approach to creating diverse databases while maintaining computational efficiency.
Scaling the number of synthetic databases using PluRel improves generalization to real databases and serves as a strong foundation for continued training on real-world datasets.
This innovative approach positions synthetic data scaling as a promising paradigm for advancing RFMs and developing AI systems on enterprise data without compromising privacy.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi, Johannes Hoffart, Carlos Guestrin, Jure Leskovec

arXiv: 2602.04029v1 - DOI (cs.DB)

Code: https://github.com/snap-stanford/plurel

License: CC BY 4.0

Abstract: Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.

Submitted to arXiv on 03 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.04029v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

A groundbreaking framework for generating diverse relational databases to train in a privacy-preserving manner. RFMs are crucial for data-driven decision-making but lack publicly available complex multi-table databases hindering their development. PluRel enables the synthesis of unlimited relational databases with customizable schemas, connectivity patterns, and data distributions. By modeling schemas with directed graphs and inter-table primary-foreign key connectivity with bipartite graphs, PluRel offers a step-by-step approach to creating a wide range of diverse databases while maintaining computational efficiency. The framework's design space supports the generation of synthetic data at scale, allowing researchers to observe new insights such as power-law scaling of RFM pretraining loss with the number of synthetic databases and total pretraining tokens. Furthermore, experiments conducted using PluRel demonstrate that scaling the number of synthetic databases improves generalization to real databases and that synthetic pretraining can serve as a strong foundation for continued training on real-world datasets. This innovative approach positions synthetic data scaling as a promising paradigm for advancing RFMs and lays the groundwork for developing AI systems on enterprise data without compromising privacy. In conclusion, unlocks new possibilities for by providing a solution to the scarcity of diverse relational datasets while offering a privacy-preserving method for training models on real-world data. This work contributes significantly to the field of and relational deep learning, paving the way for more robust and scalable approaches to data-driven decision-making.

- PluRel is a groundbreaking framework for generating diverse relational databases in a privacy-preserving manner.
- RFMs are crucial for data-driven decision-making but lack publicly available complex multi-table databases hindering their development.
- PluRel enables the synthesis of unlimited relational databases with customizable schemas, connectivity patterns, and data distributions.
- The framework models schemas with directed graphs and inter-table primary-foreign key connectivity with bipartite graphs, offering a step-by-step approach to creating diverse databases while maintaining computational efficiency.
- Scaling the number of synthetic databases using PluRel improves generalization to real databases and serves as a strong foundation for continued training on real-world datasets.
- This innovative approach positions synthetic data scaling as a promising paradigm for advancing RFMs and developing AI systems on enterprise data without compromising privacy.

Summary- PluRel is a special way to make different databases while keeping things private. - RFMs help us make good decisions using data, but we need more complex databases to help them grow. - PluRel lets us create many different databases with our own designs and ways of connecting information. - The framework uses graphs to show how things are connected and helps us build diverse databases step by step efficiently. - Making more synthetic databases with PluRel helps us learn better from real data and train AI systems safely. Definitions- Framework: A basic structure that helps organize and build something. - Databases: Places where information is stored in an organized way for easy access. - Schemas: Plans or designs that show how data is organized in a database. - Connectivity: How different pieces of information are linked or connected together. - Synthetic: Something made artificially or not naturally occurring.

Relational databases are a crucial component of data-driven decision-making, enabling organizations to store and analyze large amounts of structured data. However, the development and training of relational database models (RFMs) have been hindered by the lack of publicly available complex multi-table databases. This scarcity has limited the ability to explore new insights and advancements in RFM technology. In response to this challenge, a team of researchers from MIT and IBM Research has developed PluRel - a groundbreaking framework for generating diverse relational databases in a privacy-preserving manner. This innovative approach addresses the need for more comprehensive datasets while also ensuring that sensitive information remains protected. The research paper titled "PluRel: A Framework for Generating Diverse Relational Databases" presents this novel framework and its potential impact on advancing RFMs. The paper was published at the 2021 International Conference on Data Engineering (ICDE), one of the top conferences in data engineering. So, what exactly is PluRel? And how does it work? PluRel is an open-source framework that enables the synthesis of unlimited relational databases with customizable schemas, connectivity patterns, and data distributions. It uses directed graphs to model schemas and bipartite graphs to represent inter-table primary-foreign key connectivity. This allows for a step-by-step approach to creating diverse databases while maintaining computational efficiency. One of the key features of PluRel is its design space which supports scaling synthetic data generation at an unprecedented level. This means that researchers can generate large quantities of synthetic data quickly, allowing them to observe new insights such as power-law scaling between RFM pretraining loss and the number of synthetic databases or total pretraining tokens. To validate their framework's effectiveness, the researchers conducted experiments using PluRel on various real-world datasets. The results showed that scaling up the number of synthetic databases improved generalization performance on real datasets significantly. Additionally, they found that using synthetic pretraining as a foundation for continued training on real datasets led to even better results. This research has significant implications for the field of RFMs and relational deep learning. By providing a solution to the scarcity of diverse relational datasets, PluRel opens up new possibilities for developing AI systems on enterprise data without compromising privacy. It also paves the way for more robust and scalable approaches to data-driven decision-making. The paper's findings highlight the potential of synthetic data scaling as a promising paradigm for advancing RFMs. With PluRel, researchers can now generate unlimited amounts of diverse databases, allowing them to explore new insights and advancements in RFM technology. Moreover, this framework has practical applications in various industries such as healthcare, finance, and retail where sensitive data needs to be protected while still enabling effective decision-making. For example, PluRel could be used by healthcare organizations to train models on patient data without compromising their privacy or by financial institutions to analyze customer transactions while ensuring confidentiality. In conclusion, "PluRel: A Framework for Generating Diverse Relational Databases" is an essential contribution to the field of RFMs and relational deep learning. It addresses a critical challenge faced by researchers - the lack of diverse datasets - while also offering a privacy-preserving method for training models on real-world data. This work sets the stage for further advancements in RFM technology and has far-reaching implications for data-driven decision-making across industries.

Created on 06 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

50.2%

Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables

cs.DB

45.6%

DataLab: A Unifed Platform for LLM-Powered Business Intelligence

cs.DB

45.1%

The Effects of Data Quality on ML-Model Performance

cs.DB

44.9%

Automatic Metadata Extraction for Text-to-SQL

cs.DB

44.3%

LLM-Powered Proactive Data Systems

cs.DB

43.3%

What if an SQL Statement Returned a Database?

cs.DB

42.7%

VerifAI: Verified Generative AI

cs.DB

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.