Fine-Tuning GPT-5 for GPU Kernel Generation

AI-generated keywords: AI systems

AI-generated Key Points

Efficient GPU kernels are crucial for achieving scalability in AI systems.
Large Language Models (LLMs) face challenges in generating GPU code due to lack of labeled training data, biases in compilers, and limited generalization across hardware generations.
Reinforcement learning (RL) offers an adaptive and data-efficient alternative for fine-tuning models, but requires relevant tools, careful problem selection, and a robust evaluation environment.
Makora's environment and tools are designed for reinforcement learning fine-tuning of cutting-edge models like GPT-5 for Triton code generation.
Fine-tuned GPT-5 model significantly improves kernel correctness from 43.7% to 77.0%, surpassing previous state-of-the-art models on KernelBench.
Integrated into a coding agent framework, the fine-tuned model solves up to 97.4% of problems in an expanded KernelBench suite, outperforming PyTorch TorchInductor compiler on 72.9% of problems with a speedup of 2.12x.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ali Tehrani, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, Łukasz Dudziak, Mohamed S. Abdelfattah

arXiv: 2602.11000v1 - DOI (cs.DC)

License: CC BY 4.0

Abstract: Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.

Submitted to arXiv on 11 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.11000v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the realm of AI systems, efficient GPU kernels are crucial for achieving scalability. However, this task remains complex due to the intricate nature of hardware architectures and specialized optimization expertise required. Large Language Models (LLMs) have shown impressive capabilities in general sequential code generation, but they encounter significant challenges when it comes to generating GPU code. These challenges stem from a lack of high-quality labeled training data, biases in compilers, and limited generalization across different hardware generations. Traditionally, supervised fine-tuning (SFT) has been used to enhance current LLMs, but its scalability is hindered by limitations mentioned above. In contrast, reinforcement learning (RL) offers an adaptive and data-efficient alternative for fine-tuning models. However, leveraging RL effectively necessitates access to relevant tools, careful selection of training problems, and a robust evaluation environment. This study introduces Makora's environment and tools designed for reinforcement learning fine-tuning of cutting-edge models. The research team reports their findings from fine-tuning GPT-5 specifically for Triton code generation. In a single-attempt setting, the fine-tuned model significantly improves kernel correctness from 43.7% to 77.0%, marking a notable increase of 33.3 percentage points compared to the baseline GPT-5 model. Moreover, it enhances the fraction of problems outperforming TorchInductor from 14.8% to 21.8%, showcasing a gain of 7 percentage points while surpassing previous state-of-the-art models on KernelBench. When integrated into a comprehensive coding agent framework, the fine-tuned model demonstrates remarkable performance by solving up to 97.4% of problems in an expanded KernelBench suite. It outperforms the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Overall, this work showcases that targeted post-training using reinforcement learning can unlock the full potential of LLMs in highly specialized technical domains where traditional supervised learning methods are constrained by data availability constraints. This breakthrough opens up new avenues for AI-assisted accelerator programming and highlights the promising future prospects for automatic performant kernel code generation leveraging advanced language models like GPT-5.

- Efficient GPU kernels are crucial for achieving scalability in AI systems.
- Large Language Models (LLMs) face challenges in generating GPU code due to lack of labeled training data, biases in compilers, and limited generalization across hardware generations.
- Reinforcement learning (RL) offers an adaptive and data-efficient alternative for fine-tuning models, but requires relevant tools, careful problem selection, and a robust evaluation environment.
- Makora's environment and tools are designed for reinforcement learning fine-tuning of cutting-edge models like GPT-5 for Triton code generation.
- Fine-tuned GPT-5 model significantly improves kernel correctness from 43.7% to 77.0%, surpassing previous state-of-the-art models on KernelBench.
- Integrated into a coding agent framework, the fine-tuned model solves up to 97.4% of problems in an expanded KernelBench suite, outperforming PyTorch TorchInductor compiler on 72.9% of problems with a speedup of 2.12x.

Summary1. Using powerful computer parts called GPUs is very important for making AI systems work better. 2. Big language models have trouble using GPUs because they don't have enough training data and face issues with how the computer programs are made. 3. Reinforcement learning is a smart way to make models better by learning from mistakes, but it needs special tools and careful planning. 4. Makora's tools help make GPT-5 model even better for writing computer code. 5. The improved GPT-5 model can solve coding problems faster and more accurately than before. Definitions- Efficient: Doing things well without wasting time or energy. - GPU: A type of computer part that helps with graphics and calculations in AI systems. - Reinforcement learning: A method where machines learn by trying out different things and getting rewards for good actions. - Fine-tuned: Making small adjustments to improve something further. - Compiler: A program that changes human-written code into instructions the computer can understand.

Introduction

In recent years, Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, when it comes to generating GPU code, these models face significant challenges due to the complex nature of hardware architectures and specialized optimization expertise required. In this research paper, titled "Makora: Reinforcement Learning Fine-Tuning for Efficient GPU Kernels", a team of researchers introduces an environment and tools designed specifically for fine-tuning LLMs using reinforcement learning techniques.

The Challenges of Generating Efficient GPU Kernels

Efficient GPU kernels are crucial for achieving scalability in AI systems. However, traditional supervised fine-tuning methods are limited by a lack of high-quality labeled training data, biases in compilers, and limited generalization across different hardware generations. This makes it challenging to improve upon existing LLMs for generating efficient GPU code.

The Role of Reinforcement Learning

Reinforcement learning (RL) offers an adaptive and data-efficient alternative for fine-tuning LLMs. It allows the model to learn from its own experiences rather than relying on pre-labeled data. However, effectively leveraging RL requires access to relevant tools, careful selection of training problems, and a robust evaluation environment.

Makora: An Environment Designed for Reinforcement Learning Fine-Tuning

To address the limitations of traditional supervised fine-tuning methods and leverage the potential of RL techniques, the research team introduces Makora – an environment designed specifically for reinforcement learning fine-tuning of cutting-edge models. Makora provides access to relevant tools such as TorchInductor – a PyTorch-based compiler that generates efficient CUDA kernels – and KernelBench – a benchmark suite consisting of real-world problems that require specialized optimizations. The environment also includes features such as problem randomization and automatic validation checks to ensure fair evaluations during training. This allows for a more comprehensive and accurate assessment of the model's performance.

Fine-Tuning GPT-5 for Triton Code Generation

To demonstrate the effectiveness of Makora, the research team fine-tunes GPT-5 – a state-of-the-art LLM – specifically for Triton code generation. In a single-attempt setting, the fine-tuned model significantly improves kernel correctness from 43.7% to 77.0%, marking an impressive increase of 33.3 percentage points compared to the baseline GPT-5 model. Moreover, it also enhances the fraction of problems outperforming TorchInductor from 14.8% to 21.8%, showcasing a gain of 7 percentage points while surpassing previous state-of-the-art models on KernelBench.

Integrating Fine-Tuned Models into a Comprehensive Coding Agent Framework

The research team also integrates their fine-tuned model into a comprehensive coding agent framework that combines both LLM-based code generation and traditional compiler techniques. This framework demonstrates remarkable performance by solving up to 97.4% of problems in an expanded KernelBench suite. It outperforms TorchInductor on 72.9% of problems with a geometric mean speedup of 2.12x, highlighting the potential for automatic performant kernel code generation using advanced language models like GPT-5.

Conclusion

In conclusion, this research paper showcases how targeted post-training using reinforcement learning can unlock the full potential of LLMs in highly specialized technical domains such as GPU kernel code generation. The introduction of Makora provides researchers and developers with access to relevant tools and environments necessary for effective RL-based fine-tuning methods. This breakthrough opens up new avenues for AI-assisted accelerator programming and highlights promising future prospects for automatic performant kernel code generation leveraging advanced language models like GPT-5. With further advancements and improvements in this area, we can expect to see even more efficient and scalable AI systems in the future.

Created on 12 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

54.8%

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pip…

cs.DC

50.3%

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices

cs.DC

50.2%

Optimizing Distributed Training on Frontier for Large Language Models

cs.DC

49.7%

Daisen: A Framework for Visualizing Detailed GPU Execution

cs.DC

49.3%

An Overview of the Data-Loader Landscape: Comparative Performance Analysis

cs.DC

48.3%

Towards Efficient and Reliable LLM Serving: A Real-World Workload Study

cs.DC

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.