d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
AI-generated Key Points
- Recent large language models (LLMs) have strong reasoning capabilities
- Online reinforcement learning (RL) benefits LLMs, especially in left-to-right autoregressive (AR) generation
- Diffusion-based large language models (dLLMs) generate text in a coarse-to-fine manner
- dLLMs show competitive performance compared to AR models but it's unclear if they can leverage recent advances in reasoning
- The d1 framework adapts pre-trained masked dLLMs into reasoning models through supervised fine-tuning and RL
- Techniques like masked SFT and diffu-GRPO RL algorithm are used to enhance reasoning in pretrained dLLMs
- Empirical studies show that d1 framework yields the best performance and enhances state-of-the-art dLLMs
- Advancements in scaling diffusion language models include masked diffusion as a specific instance of discrete diffusion
- Efforts are being made to address challenges related to scalability and discretization for further improvement in reasoning abilities
Authors: Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover
Abstract: Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.