d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

AI-generated keywords: Large Language Models Reinforcement Learning Diffusion-based Models Reasoning Pre-trained Models

AI-generated Key Points

  • Recent large language models (LLMs) have strong reasoning capabilities
  • Online reinforcement learning (RL) benefits LLMs, especially in left-to-right autoregressive (AR) generation
  • Diffusion-based large language models (dLLMs) generate text in a coarse-to-fine manner
  • dLLMs show competitive performance compared to AR models but it's unclear if they can leverage recent advances in reasoning
  • The d1 framework adapts pre-trained masked dLLMs into reasoning models through supervised fine-tuning and RL
  • Techniques like masked SFT and diffu-GRPO RL algorithm are used to enhance reasoning in pretrained dLLMs
  • Empirical studies show that d1 framework yields the best performance and enhances state-of-the-art dLLMs
  • Advancements in scaling diffusion language models include masked diffusion as a specific instance of discrete diffusion
  • Efforts are being made to address challenges related to scalability and discretization for further improvement in reasoning abilities
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover

25 pages, project page at https://dllm-reasoning.github.io/
License: CC BY 4.0

Abstract: Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.

Submitted to arXiv on 16 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.12216v1

Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefit from online reinforcement learning (RL), particularly within the left-to-right autoregressive (AR) generation paradigm. However, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner, presenting an alternative approach. Recent diffusion-based large language models (dLLMs) have shown competitive language modeling performance compared to their AR counterparts but it is unclear if they can leverage recent advances in LLM reasoning. In response to this gap, the d1 framework has been proposed to adapt pre-trained masked dLLMs into reasoning models through a combination of supervised fine-tuning (SFT) and RL. This framework introduces techniques aimed at enhancing reasoning in pretrained dLLMs, including a masked SFT technique to distill knowledge and instill self-improvement behavior from existing datasets, as well as a novel critic-free policy-gradient based RL algorithm called diffu-GRPO. Empirical studies have been conducted to evaluate the performance of different post-training recipes on various mathematical and logical reasoning benchmarks. The results indicate that d1 yields the best performance and significantly enhances the performance of state-of-the-art dLLMs. Additionally, advancements in scaling diffusion language models have been explored, with masked diffusion established as a specific instance of discrete diffusion. Efforts are being made to address challenges related to scalability and discretization in order to further improve reasoning abilities in large diffusion language models through reinforcement learning strategies.
Created on 23 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.