TabICLv2: A better, faster, scalable, and open tabular foundation model

AI-generated keywords: Predictive modeling

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Tabular foundation models like TabPFNv2 and TabICL have risen to prominence, outperforming traditional gradient-boosted trees.
The significance of in-context learning tailored for tabular datasets is highlighted by this shift.
TabICLv2 is a cutting-edge foundation model for regression and classification with three key pillars of innovation:
Novel synthetic data generation engine for high pretraining diversity
Architectural enhancements including scalable softmax in attention mechanism for improved generalization capabilities
Optimized pretraining protocols shifting from AdamW to the Muon optimizer
On benchmark tests such as TabArena and TALENT, TabICLv2 surpasses RealTabPFN-2.5 without tuning, demonstrating remarkable generalization abilities on million-scale datasets within memory constraints while processing faster.
Extensive ablation studies quantify the impact of each enhancement introduced in TabICLv2.
Authors have released inference code and model weights on GitHub with plans to share synthetic data engine and pretraining code in subsequent releases.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan

arXiv: 2602.11139v1 - DOI (cs.LG)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient-boosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at https://github.com/soda-inria/tabicl, with synthetic data engine and pretraining code to follow.

Submitted to arXiv on 11 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.11139v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the realm of predictive modeling for tabular data, recent advancements have seen Tabular foundation models like TabPFNv2 and TabICL rise to prominence, outperforming traditional gradient-boosted trees. This shift underscores the significance of in-context learning specifically tailored for tabular datasets. Enter TabICLv2, a cutting-edge foundation model for regression and classification that stands on three key pillars of innovation. Firstly, TabICLv2 boasts a novel synthetic data generation engine meticulously crafted to ensure high pretraining diversity. This engine sets the stage for robust model training and performance optimization. Secondly, the model incorporates various architectural enhancements, including a revolutionary scalable softmax in attention mechanism that enhances generalization capabilities across larger datasets without requiring prohibitively long-sequence pretraining. These innovations collectively contribute to improved model efficiency and accuracy. Moreover, TabICLv2 adopts optimized pretraining protocols, with a notable shift from AdamW to the Muon optimizer. This strategic change further refines the model's training process, resulting in enhanced performance outcomes. On benchmark tests such as TabArena and TALENT, TabICLv2 showcases its prowess by surpassing the current state-of-the-art RealTabPFN-2.5 without any tuning required. Notably, even with moderate pretraining compute resources, TabICLv2 demonstrates remarkable generalization abilities on million-scale datasets within 50GB GPU memory constraints while also exhibiting faster processing speeds compared to RealTabPFN-2.5. To substantiate these claims and contributions further, extensive ablation studies have been conducted to quantify the impact of each enhancement introduced in TabICLv2. In a commitment to open research practices, the authors have released inference code and model weights on GitHub (https://github.com/soda-inria/tabicl), with plans to share synthetic data engine and pretraining code in subsequent releases. Authored by Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan, "TabICLv2: A better, faster, scalable, and open tabular foundation model" represents a significant leap forward in predictive modeling for tabular data analysis.

- Tabular foundation models like TabPFNv2 and TabICL have risen to prominence, outperforming traditional gradient-boosted trees.
- The significance of in-context learning tailored for tabular datasets is highlighted by this shift.
- TabICLv2 is a cutting-edge foundation model for regression and classification with three key pillars of innovation:
- Novel synthetic data generation engine for high pretraining diversity
- Architectural enhancements including scalable softmax in attention mechanism for improved generalization capabilities
- Optimized pretraining protocols shifting from AdamW to the Muon optimizer
- On benchmark tests such as TabArena and TALENT, TabICLv2 surpasses RealTabPFN-2.5 without tuning, demonstrating remarkable generalization abilities on million-scale datasets within memory constraints while processing faster.
- Extensive ablation studies quantify the impact of each enhancement introduced in TabICLv2.
- Authors have released inference code and model weights on GitHub with plans to share synthetic data engine and pretraining code in subsequent releases.

Summary1. New types of models like TabPFNv2 and TabICL are now more popular than traditional gradient-boosted trees for working with tables of data. 2. Learning that focuses on the context of the data in tables is important, as shown by this change. 3. TabICLv2 is a modern model for sorting data into categories or making predictions, with three main new features - Making up new data to help it learn better before starting - Improving how it pays attention to different parts of the data - Changing how it learns from examples to be more efficient 4. In tests like TabArena and TALENT, TabICLv2 does better than RealTabPFN-2.5 without needing adjustments, showing it can handle big datasets quickly and accurately. 5. Studies have been done to see exactly how each improvement in TabICLv2 helps. Definitions- Models: Different ways of organizing information to solve problems. - Gradient-boosted trees: A type of model that uses decision trees in a specific way to make predictions. - Context: The surrounding details or information that help understand something better. - Regression: Sorting things into groups based on similarities or patterns. - Classification: Predicting which group something belongs to based on its features. - Synthetic: Made artificially instead of being naturally occurring. - Pretraining: Learning before starting the main task to be better prepared. - Generalization capabilities: Ability to apply what has been learned

Predictive modeling is a powerful tool used in data analysis to make predictions about future outcomes based on historical data. In recent years, there has been a shift towards using Tabular foundation models for this purpose, with newer advancements such as TabPFNv2 and TabICL gaining prominence over traditional gradient-boosted trees. This trend highlights the importance of in-context learning specifically tailored for tabular datasets. To address this need, a team of researchers from INRIA (the French National Institute for Research in Digital Science and Technology) have developed an innovative new model called TabICLv2. This cutting-edge foundation model stands on three key pillars of innovation, making it better, faster, scalable, and open compared to existing models. The first pillar of innovation in TabICLv2 is its novel synthetic data generation engine. This engine has been meticulously crafted to ensure high pretraining diversity, setting the stage for robust model training and performance optimization. By generating diverse synthetic data that mimics real-world scenarios, the model can learn more effectively and generalize better when applied to new datasets. The second pillar of innovation lies in various architectural enhancements incorporated into the model. One notable enhancement is the use of a revolutionary scalable softmax in attention mechanism that improves generalization capabilities across larger datasets without requiring prohibitively long-sequence pretraining. This allows TabICLv2 to handle million-scale datasets within 50GB GPU memory constraints while also exhibiting faster processing speeds compared to other state-of-the-art models like RealTabPFN-2.5. Moreover, TabICLv2 adopts optimized pretraining protocols by shifting from AdamW to the Muon optimizer. This strategic change further refines the model's training process and results in enhanced performance outcomes. To showcase its prowess, extensive benchmark tests were conducted on popular tabular datasets such as TabArena and TALENT. The results demonstrated that even without any tuning required, TabICLv2 outperforms the current state-of-the-art RealTabPFN-2.5 model. This is a significant achievement and highlights the effectiveness of TabICLv2 in predictive modeling for tabular data. To further substantiate their claims and contributions, the authors have also conducted extensive ablation studies to quantify the impact of each enhancement introduced in TabICLv2. This provides a deeper understanding of how these innovations contribute to improved efficiency and accuracy. In line with their commitment to open research practices, the authors have released inference code and model weights on GitHub (https://github.com/soda-inria/tabicl). They also plan to share the synthetic data engine and pretraining code in subsequent releases, making it easier for other researchers to replicate their results and build upon their work. The paper "TabICLv2: A better, faster, scalable, and open tabular foundation model" is authored by Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan from INRIA. It represents a significant leap forward in predictive modeling for tabular data analysis. With its innovative approach towards synthetic data generation, architectural enhancements, optimized pretraining protocols, and commitment to open research practices, TabICLv2 sets a new standard for predictive modeling in this domain.

Created on 12 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

73.8%

Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data G…

cs.LG

68.5%

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a …

cs.LG

66.4%

TabNet: Attentive Interpretable Tabular Learning

cs.LG

66.2%

Why do tree-based models still outperform deep learning on tabular data?

cs.LG

65.8%

UniTabE: Pretraining a Unified Tabular Encoder for Heterogeneous Tabular Data

cs.LG

65.8%

Modeling Tabular data using Conditional GAN

cs.LG

64.9%

Revisiting Deep Learning Models for Tabular Data

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.