In-Datacenter Performance Analysis of a Tensor Processing Unit

AI-generated keywords: TPU ASIC Neural Networks Performance Energy Efficiency

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Custom ASIC called Tensor Processing Unit (TPU) evaluated
  • TPU designed to accelerate neural network inference phase
  • TPU features 65,536 8-bit MAC matrix multiply unit with peak throughput of 92 TeraOps/second (TOPS) and on-chip memory of 28 MiB
  • Domain-specific hardware like TPU necessary for major improvements in cost-energy-performance
  • Deterministic execution model of TPU better suited for meeting 99th-percentile response-time requirement compared to CPUs and GPUs
  • Despite numerous MACs and large memory capacity, TPU remains small and low power due to lack of time-varying optimizations
  • TPU outperforms contemporary GPU or CPU by approximately 15X–30X in terms of speed on average
  • TOPS/Watt measure of energy efficiency is about 30X–80X higher for TPU compared to GPU and CPU
  • Incorporating GPU's GDDR5 memory into TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X that of GPU and 200X that of CPU
  • Significant performance advantages demonstrated using TPU over traditional CPUs and GPUs for NN inference tasks in datacenter applications
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon

17 pages, 11 figures, 8 tables. To appear at the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 2017

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Submitted to arXiv on 16 Apr. 2017

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1704.04760v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

This paper evaluates the performance of a custom ASIC called a Tensor Processing Unit (TPU) that has been deployed in datacenters since 2015. The TPU is designed to accelerate the inference phase of neural networks (NN) and features a 65,536 8-bit MAC matrix multiply unit with a peak throughput of 92 TeraOps/second (TOPS) and a large on-chip memory of 28 MiB. The authors argue that domain-specific hardware, like the TPU, is necessary for achieving major improvements in cost-energy-performance. They highlight that the deterministic execution model of the TPU is better suited for meeting the 99th-percentile response-time requirement of NN applications compared to CPUs and GPUs, which rely on time-varying optimizations such as caches, out-of-order execution, multithreading, multiprocessing, and prefetching. These optimizations may improve average throughput but do not guarantee low latency. Despite having numerous MACs and a large memory capacity, the TPU remains relatively small and low power due to its lack of these time-varying optimizations. To assess its performance, the authors compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU—both contemporaries deployed in the same datacenters. The workload used for evaluation consists of production NN applications (MLPs, CNNs, and LSTMs) written in the high-level TensorFlow framework. The results show that despite some applications experiencing low utilization, on average, the TPU outperforms its contemporary GPU or CPU by approximately 15X–30X in terms of speed. Additionally TOPS/Watt—the measure of energy efficiency—is about 30X–80X higher for the TPU compared to both GPU and CPU. The authors also suggest that incorporating the GPU's GDDR5 memory into the TPU would triple achieved TOPS while raising TOPS/Watt to nearly 70X that of GPU and 200X that of CPU. Overall this study demonstrates significant performance advantages of using a TPU over traditional CPUs and GPUs when accelerating NN inference tasks in datacenter applications. The findings support the idea that domain specific hardware can lead to substantial improvements in cost–energy–performance metrics for such tasks.
Created on 03 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.