In-Datacenter Performance Analysis of a Tensor Processing Unit
AI-generated keywords:
TPU
ASIC
Neural Networks
Performance
Energy Efficiency
- Custom ASIC called Tensor Processing Unit (TPU) evaluated
- TPU designed to accelerate neural network inference phase
- TPU features 65,536 8-bit MAC matrix multiply unit with peak throughput of 92 TeraOps/second (TOPS) and on-chip memory of 28 MiB
- Domain-specific hardware like TPU necessary for major improvements in cost-energy-performance
- Deterministic execution model of TPU better suited for meeting 99th-percentile response-time requirement compared to CPUs and GPUs
- Despite numerous MACs and large memory capacity, TPU remains small and low power due to lack of time-varying optimizations
- TPU outperforms contemporary GPU or CPU by approximately 15X–30X in terms of speed on average
- TOPS/Watt measure of energy efficiency is about 30X–80X higher for TPU compared to GPU and CPU
- Incorporating GPU's GDDR5 memory into TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X that of GPU and 200X that of CPU
- Significant performance advantages demonstrated using TPU over traditional CPUs and GPUs for NN inference tasks in datacenter applications
Authors:
Norman P. Jouppi,
Cliff Young,
Nishant Patil,
David Patterson,
Gaurav Agrawal,
Raminder Bajwa,
Sarah Bates,
Suresh Bhatia,
Nan Boden,
Al Borchers,
Rick Boyle,
Pierre-luc Cantin,
Clifford Chao,
Chris Clark,
Jeremy Coriell,
Mike Daley,
Matt Dau,
Jeffrey Dean,
Ben Gelb,
Tara Vazir Ghaemmaghami,
Rajendra Gottipati,
William Gulland,
Robert Hagmann,
C. Richard Ho,
Doug Hogberg,
John Hu,
Robert Hundt,
Dan Hurt,
Julian Ibarz,
Aaron Jaffey,
Alek Jaworski,
Alexander Kaplan,
Harshit Khaitan,
Andy Koch,
Naveen Kumar,
Steve Lacy,
James Laudon,
James Law,
Diemthu Le,
Chris Leary,
Zhuyuan Liu,
Kyle Lucke,
Alan Lundin,
Gordon MacKean,
Adriana Maggiore,
Maire Mahony,
Kieran Miller,
Rahul Nagarajan,
Ravi Narayanaswami,
Ray Ni,
Kathy Nix,
Thomas Norrie,
Mark Omernick,
Narayana Penukonda,
Andy Phelps,
Jonathan Ross,
Matt Ross,
Amir Salek,
Emad Samadiani,
Chris Severn,
Gregory Sizikov,
Matthew Snelham,
Jed Souter,
Dan Steinberg,
Andy Swing,
Mercedes Tan,
Gregory Thorson,
Bo Tian,
Horia Toma,
Erick Tuttle,
Vijay Vasudevan,
Richard Walter,
Walter Wang,
Eric Wilcox,
Doe Hyun Yoon
17 pages, 11 figures, 8 tables. To appear at the 44th International
Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 2017
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Submitted to arXiv on 16 Apr. 2017
- Comprehensive Summary
- Key points
- Layman's Summary
- Blog article
This paper evaluates the performance of a custom ASIC called a Tensor Processing Unit (TPU) that has been deployed in datacenters since 2015. The TPU is designed to accelerate the inference phase of neural networks (NN) and features a 65,536 8-bit MAC matrix multiply unit with a peak throughput of 92 TeraOps/second (TOPS) and a large on-chip memory of 28 MiB. The authors argue that domain-specific hardware, like the TPU, is necessary for achieving major improvements in cost-energy-performance. They highlight that the deterministic execution model of the TPU is better suited for meeting the 99th-percentile response-time requirement of NN applications compared to CPUs and GPUs, which rely on time-varying optimizations such as caches, out-of-order execution, multithreading, multiprocessing, and prefetching. These optimizations may improve average throughput but do not guarantee low latency. Despite having numerous MACs and a large memory capacity, the TPU remains relatively small and low power due to its lack of these time-varying optimizations. To assess its performance, the authors compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU—both contemporaries deployed in the same datacenters. The workload used for evaluation consists of production NN applications (MLPs, CNNs, and LSTMs) written in the high-level TensorFlow framework. The results show that despite some applications experiencing low utilization, on average, the TPU outperforms its contemporary GPU or CPU by approximately 15X–30X in terms of speed. Additionally TOPS/Watt—the measure of energy efficiency—is about 30X–80X higher for the TPU compared to both GPU and CPU. The authors also suggest that incorporating the GPU's GDDR5 memory into the TPU would triple achieved TOPS while raising TOPS/Watt to nearly 70X that of GPU and 200X that of CPU. Overall this study demonstrates significant performance advantages of using a TPU over traditional CPUs and GPUs when accelerating NN inference tasks in datacenter applications. The findings support the idea that domain specific hardware can lead to substantial improvements in cost–energy–performance metrics for such tasks.
- - Custom ASIC called Tensor Processing Unit (TPU) evaluated
- - TPU designed to accelerate neural network inference phase
- - TPU features 65,536 8-bit MAC matrix multiply unit with peak throughput of 92 TeraOps/second (TOPS) and on-chip memory of 28 MiB
- - Domain-specific hardware like TPU necessary for major improvements in cost-energy-performance
- - Deterministic execution model of TPU better suited for meeting 99th-percentile response-time requirement compared to CPUs and GPUs
- - Despite numerous MACs and large memory capacity, TPU remains small and low power due to lack of time-varying optimizations
- - TPU outperforms contemporary GPU or CPU by approximately 15X–30X in terms of speed on average
- - TOPS/Watt measure of energy efficiency is about 30X–80X higher for TPU compared to GPU and CPU
- - Incorporating GPU's GDDR5 memory into TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X that of GPU and 200X that of CPU
- - Significant performance advantages demonstrated using TPU over traditional CPUs and GPUs for NN inference tasks in datacenter applications
Summary:
1. A special computer chip called Tensor Processing Unit (TPU) was tested and found to be very good at helping computers think like humans.
2. The TPU chip is made to make computers learn faster and it has a lot of small parts that can do math quickly.
3. The TPU chip is also very smart because it can remember things on its own without needing help from other chips.
4. It's important to have chips like the TPU because they can make computers work better, faster, and use less energy.
5. When compared to other chips, the TPU chip is much better at doing certain tasks.
Definitions- Custom ASIC: A special type of computer chip that is designed for a specific purpose.
- Neural network inference phase: The part of a computer program where the computer tries to understand information and make decisions based on what it knows.
- MAC matrix multiply unit: A small part of the TPU chip that can do math calculations quickly.
- Peak throughput: How fast the TPU chip can do calculations at its best performance level.
- On-chip memory: The place inside the TPU chip where it can store information for later use.
- Domain-specific hardware: Computer chips that are made specifically for certain types of tasks or jobs.
- Deterministic execution model: How the TPU chip works in a predictable way, always giving the same results for the same inputs.
- CPUs and GPUs: Different types of computer chips that are used
The Performance of Tensor Processing Units (TPUs) for Neural Network Inference
Neural networks (NNs) have become increasingly popular in recent years, with applications ranging from computer vision to natural language processing. To accelerate the inference phase of these networks, a custom ASIC called a Tensor Processing Unit (TPU) has been deployed in datacenters since 2015. This paper evaluates the performance of this domain-specific hardware and compares it to contemporary CPUs and GPUs.
Background on TPUs
The TPU is designed specifically for accelerating NN inference tasks and features a 65,536 8-bit MAC matrix multiply unit with a peak throughput of 92 TeraOps/second (TOPS). It also has a large on-chip memory capacity of 28 MiB. The authors argue that domain-specific hardware like the TPU is necessary for achieving major improvements in cost–energy–performance metrics compared to traditional CPUs and GPUs.
Unlike CPUs and GPUs which rely on time-varying optimizations such as caches, out-of-order execution, multithreading, multiprocessing, and prefetching to improve average throughput but not guarantee low latency; the deterministic execution model of the TPU is better suited for meeting the 99th percentile response time requirement of NN applications. Additionally despite having numerous MACs and large memory capacity, its lack of these time varying optimizations keeps it relatively small and low power.
Evaluation Methodology
To assess its performance relative to contemporary CPUs and GPUs deployed in datacenters at that time—an Intel Haswell CPU server class processor and an Nvidia K80 GPU—the authors used production NN applications written in high level framework TensorFlow as their workloads. These included MLPs (Multi Layer Perceptrons), CNNs (Convolutional Neural Networks), LSTMs (Long Short Term Memory).
Results
The results showed that despite some applications experiencing low utilization due to limited parallelism or data reuse opportunities; on average the TPU outperformed its contemporaries by 15X–30X in terms of speed while TOPS/Watt—the measure energy efficiency was 30X–80X higher than both GPU or CPU counterparts respectively. Furthermore incorporating GDDR5 memory into the design would triple achieved TOPS while raising TOPS/Watt up nearly 70X that of GPU’s 200X that CPU’s .
Conclusion
Overall this study demonstrates significant performance advantages when using a custom ASIC like a Tensor Processing Unit over traditional CPUs or GPUs when accelerating neural network inference tasks within datacenter applications. The findings support idea that domain specific hardware can lead substantial improvements cost – energy – performance metrics such tasks making them more accessible wider range users across various industries including healthcare finance retail etc