Fine-Tuning GPT-5 for GPU Kernel Generation

AI-generated keywords: AI systems

AI-generated Key Points

  • Efficient GPU kernels are crucial for achieving scalability in AI systems.
  • Large Language Models (LLMs) face challenges in generating GPU code due to lack of labeled training data, biases in compilers, and limited generalization across hardware generations.
  • Reinforcement learning (RL) offers an adaptive and data-efficient alternative for fine-tuning models, but requires relevant tools, careful problem selection, and a robust evaluation environment.
  • Makora's environment and tools are designed for reinforcement learning fine-tuning of cutting-edge models like GPT-5 for Triton code generation.
  • Fine-tuned GPT-5 model significantly improves kernel correctness from 43.7% to 77.0%, surpassing previous state-of-the-art models on KernelBench.
  • Integrated into a coding agent framework, the fine-tuned model solves up to 97.4% of problems in an expanded KernelBench suite, outperforming PyTorch TorchInductor compiler on 72.9% of problems with a speedup of 2.12x.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ali Tehrani, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, Łukasz Dudziak, Mohamed S. Abdelfattah

License: CC BY 4.0

Abstract: Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.

Submitted to arXiv on 11 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.11000v1

, , , , In the realm of AI systems, efficient GPU kernels are crucial for achieving scalability. However, this task remains complex due to the intricate nature of hardware architectures and specialized optimization expertise required. Large Language Models (LLMs) have shown impressive capabilities in general sequential code generation, but they encounter significant challenges when it comes to generating GPU code. These challenges stem from a lack of high-quality labeled training data, biases in compilers, and limited generalization across different hardware generations. Traditionally, supervised fine-tuning (SFT) has been used to enhance current LLMs, but its scalability is hindered by limitations mentioned above. In contrast, reinforcement learning (RL) offers an adaptive and data-efficient alternative for fine-tuning models. However, leveraging RL effectively necessitates access to relevant tools, careful selection of training problems, and a robust evaluation environment. This study introduces Makora's environment and tools designed for reinforcement learning fine-tuning of cutting-edge models. The research team reports their findings from fine-tuning GPT-5 specifically for Triton code generation. In a single-attempt setting, the fine-tuned model significantly improves kernel correctness from 43.7% to 77.0%, marking a notable increase of 33.3 percentage points compared to the baseline GPT-5 model. Moreover, it enhances the fraction of problems outperforming TorchInductor from 14.8% to 21.8%, showcasing a gain of 7 percentage points while surpassing previous state-of-the-art models on KernelBench. When integrated into a comprehensive coding agent framework, the fine-tuned model demonstrates remarkable performance by solving up to 97.4% of problems in an expanded KernelBench suite. It outperforms the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Overall, this work showcases that targeted post-training using reinforcement learning can unlock the full potential of LLMs in highly specialized technical domains where traditional supervised learning methods are constrained by data availability constraints. This breakthrough opens up new avenues for AI-assisted accelerator programming and highlights the promising future prospects for automatic performant kernel code generation leveraging advanced language models like GPT-5.
Created on 12 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.