Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

AI-generated keywords: GPUs CUDA Graph performance optimization heterogeneous systems iterative applications

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Jonah Ekelund, Stefano Markidis, and Ivy Peng focus on accelerating scientific applications on GPUs.
  • Introduction of CUDA Graph as a graph-based execution model to optimize GPU performance by consolidating kernel launches.
  • Proposal of a performance optimization strategy for iteratively launched kernels through grouping launches into iteration batches and unrolling them into CUDA Graphs.
  • Demonstration of an optimal size for iteration batches that yields more than 1.4 times speed-up in the skeleton application.
  • Extension of findings to showcase speed-up results in benchmark suites like Hotspot, Hotspot3D, and FDTD Maxwell solver.
  • Research highlights significant performance improvements in GPU-accelerated iterative applications through strategic kernel batching with CUDA Graphs.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jonah Ekelund, Stefano Markidis, Ivy Peng

Accepted to PDP2025

Abstract: Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The skeleton application also serves as a generalized example of converting an iterative solver to CUDA Graph, and for deriving a performance model. Using the skeleton application, we show that when unrolling iteration batches for a given platform, there is an optimal size of the iteration batch, which is independent of workload, balancing the extra overhead from graph creation with the performance gain of the graph execution. Depending on workload, we show that the optimal iteration batch size gives more than 1.4x speed-up in the skeleton application. Furthermore, we show that similar speed-up can be gained in Hotspot and Hotspot3D from the Rodinia benchmark suite and a Finite-Difference Time-Domain (FDTD) Maxwell solver.

Submitted to arXiv on 16 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.09398v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs," authors Jonah Ekelund, Stefano Markidis, and Ivy Peng delve into the realm of Graphics Processing Units (GPUs) and their role in accelerating scientific applications on heterogeneous systems. With the increasing speed of GPUs, a notable performance bottleneck arises from launching multiple fine-grained kernels. To address this challenge, the authors introduce CUDA Graph, a graph-based execution model that represents operations as nodes and dependencies as edges in a static graph. By consolidating multiple kernel launches into a single graph launch, CUDA Graph aims to optimize performance for GPU-accelerated applications. The paper proposes a novel performance optimization strategy specifically tailored for iteratively launched kernels. By grouping kernel launches into iteration batches and subsequently unrolling these batches into a CUDA Graph, iterative applications can harness the benefits of CUDA Graph for enhanced performance. The authors conduct an in-depth analysis of the performance gains and overhead associated with this approach by designing a skeleton application. This application serves as both a practical example of converting an iterative solver to utilize CUDA Graph and as a basis for deriving a performance model. Through experimentation with the skeleton application, the authors demonstrate that there exists an optimal size for iteration batches when unrolled for a given platform. This optimal batch size remains independent of workload and strikes a balance between the additional overhead incurred from graph creation and the performance enhancement achieved through graph execution. Depending on the workload characteristics, the optimal iteration batch size yields more than 1.4 times speed-up in the skeleton application. Furthermore, the authors extend their findings to showcase similar speed-up results in popular benchmark suites such as Hotspot and Hotspot3D from the Rodinia suite, as well as in a Finite-Difference Time-Domain (FDTD) Maxwell solver. By highlighting the efficacy of their proposed approach across various applications, Ekelund et al. 's research underscores the potential for significant performance improvements in GPU-accelerated iterative applications through strategic kernel batching with CUDA Graphs.
Created on 04 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.