d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

AI-generated keywords: Large Language Models Reinforcement Learning Diffusion-based Models Reasoning Pre-trained Models

AI-generated Key Points

Recent large language models (LLMs) have strong reasoning capabilities
Online reinforcement learning (RL) benefits LLMs, especially in left-to-right autoregressive (AR) generation
Diffusion-based large language models (dLLMs) generate text in a coarse-to-fine manner
dLLMs show competitive performance compared to AR models but it's unclear if they can leverage recent advances in reasoning
The d1 framework adapts pre-trained masked dLLMs into reasoning models through supervised fine-tuning and RL
Techniques like masked SFT and diffu-GRPO RL algorithm are used to enhance reasoning in pretrained dLLMs
Empirical studies show that d1 framework yields the best performance and enhances state-of-the-art dLLMs
Advancements in scaling diffusion language models include masked diffusion as a specific instance of discrete diffusion
Efforts are being made to address challenges related to scalability and discretization for further improvement in reasoning abilities

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover

arXiv: 2504.12216v1 - DOI (cs.CL)

25 pages, project page at https://dllm-reasoning.github.io/

License: CC BY 4.0

Abstract: Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.

Submitted to arXiv on 16 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.12216v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefit from online reinforcement learning (RL), particularly within the left-to-right autoregressive (AR) generation paradigm. However, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner, presenting an alternative approach. Recent diffusion-based large language models (dLLMs) have shown competitive language modeling performance compared to their AR counterparts but it is unclear if they can leverage recent advances in LLM reasoning. In response to this gap, the d1 framework has been proposed to adapt pre-trained masked dLLMs into reasoning models through a combination of supervised fine-tuning (SFT) and RL. This framework introduces techniques aimed at enhancing reasoning in pretrained dLLMs, including a masked SFT technique to distill knowledge and instill self-improvement behavior from existing datasets, as well as a novel critic-free policy-gradient based RL algorithm called diffu-GRPO. Empirical studies have been conducted to evaluate the performance of different post-training recipes on various mathematical and logical reasoning benchmarks. The results indicate that d1 yields the best performance and significantly enhances the performance of state-of-the-art dLLMs. Additionally, advancements in scaling diffusion language models have been explored, with masked diffusion established as a specific instance of discrete diffusion. Efforts are being made to address challenges related to scalability and discretization in order to further improve reasoning abilities in large diffusion language models through reinforcement learning strategies.

- Recent large language models (LLMs) have strong reasoning capabilities
- Online reinforcement learning (RL) benefits LLMs, especially in left-to-right autoregressive (AR) generation
- Diffusion-based large language models (dLLMs) generate text in a coarse-to-fine manner
- dLLMs show competitive performance compared to AR models but it's unclear if they can leverage recent advances in reasoning
- The d1 framework adapts pre-trained masked dLLMs into reasoning models through supervised fine-tuning and RL
- Techniques like masked SFT and diffu-GRPO RL algorithm are used to enhance reasoning in pretrained dLLMs
- Empirical studies show that d1 framework yields the best performance and enhances state-of-the-art dLLMs
- Advancements in scaling diffusion language models include masked diffusion as a specific instance of discrete diffusion
- Efforts are being made to address challenges related to scalability and discretization for further improvement in reasoning abilities

Summary1. Big computer programs can think well. 2. Learning online helps these programs, especially when they write from left to right. 3. Some programs write step by step, starting rough and getting better. 4. One type of program competes well with others but may not be as good at thinking. 5. A new way makes these programs smarter through practice and rules. Definitions- Language models: Computer programs that understand and generate human language. - Reinforcement learning: A type of machine learning where a program learns by trial and error, receiving rewards for good actions. - Autoregressive generation: Writing or generating text one word at a time in order. - Diffusion-based models: Programs that generate text gradually improving its quality. - Supervised fine-tuning: Adjusting a pre-trained model to perform better on specific tasks with guidance data. - Empirical studies: Experiments based on observations and real-world data to draw conclusions.

Recent advancements in large language models (LLMs) have shown impressive reasoning capabilities, particularly when combined with online reinforcement learning (RL). However, recent research has also explored the potential of non-autoregressive paradigms based on diffusion for generating text in a coarse-to-fine manner. This presents an alternative approach to traditional left-to-right autoregressive (AR) generation methods. In response to this gap, a new framework called d1 has been proposed to adapt pre-trained masked diffusion-based large language models (dLLMs) into reasoning models through a combination of supervised fine-tuning (SFT) and RL. This framework introduces techniques aimed at enhancing reasoning in pretrained dLLMs, including a masked SFT technique and a novel critic-free policy-gradient based RL algorithm called diffu-GRPO. The main goal of the d1 framework is to improve the performance of state-of-the-art dLLMs on various mathematical and logical reasoning benchmarks. To achieve this, it utilizes a combination of SFT and RL strategies to enhance the reasoning abilities of these models. The results from empirical studies show that d1 yields the best performance compared to other post-training recipes and significantly improves upon the performance of existing state-of-the-art dLLMs. One key aspect of the d1 framework is its use of masked SFT technique which aims to distill knowledge from existing datasets and instill self-improvement behavior in pretrained dLLMs. This allows these models to learn from their mistakes and continuously improve their reasoning abilities over time. Additionally, diffu-GRPO provides a novel way for these models to learn through reinforcement learning without relying on external critics or evaluators. Another important area that has been explored in recent research is scaling diffusion language models. Masked diffusion has been established as one specific instance of discrete diffusion which shows promising results for improving reasoning capabilities in large LLMs through reinforcement learning strategies. However, there are still challenges related to scalability and discretization that need to be addressed in order to fully leverage the potential of diffusion-based LLMs for reasoning tasks. In conclusion, recent research has shown that non-autoregressive paradigms based on diffusion have the potential to improve reasoning capabilities in large language models. The d1 framework, with its combination of SFT and RL techniques, has demonstrated significant improvements in performance compared to existing state-of-the-art dLLMs. Further advancements in scaling diffusion language models and addressing challenges related to scalability and discretization will continue to enhance the reasoning abilities of these models through reinforcement learning strategies. This opens up new possibilities for using large LLMs for complex reasoning tasks, bringing us closer to achieving truly intelligent natural language processing systems.

Created on 23 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

69.9%

Speed Always Wins: A Survey on Efficient Architectures for Large Language Mod…

cs.CL

66.4%

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Gene…

cs.CL

61.6%

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

cs.CL

61.3%

Statistical Rejection Sampling Improves Preference Optimization

cs.CL

59.8%

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

cs.CL

58.6%

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Infere…

cs.CL

58.6%

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.