Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

AI-generated keywords: FPGA DNN Quantization MSQ Inference

AI-generated Key Points

  • The paper proposes a novel FPGA-centric DNN quantization framework for efficient DNN inference engine on FPGA devices.
  • Different quantization schemes are applied for different rows of the weight matrix to achieve better utilization of heterogeneous FPGA hardware resources.
  • A hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) is proposed for Gaussian-like weight distribution, while fixed-point quantization is suitable for Uniform-like weight distribution.
  • An intra-layer multi-scheme quantization framework with an ensemble of SP2 and fixed-point schemes is proposed to fully explore the FPGA resources and maintain or even increase accuracy due to better matching with weight distributions.
  • The authors evaluate their framework across multiple application domains with various DNNs such as CNN and RNN, achieving performance improvement of 2.1×−4.1× compared to solely exploiting DSPs for all multiplication operations.
  • This research contributes to addressing the critical step of model compression required to deploy DNN models on edge devices while maintaining or even improving accuracy.
  • The proposed MSQ approach offers a hardware-friendly solution that enables efficient implementation of DNN inference on edge computing platforms such as ASICs, FPGAs, and embedded systems.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K. -H. So, Xuehai Qian, Yanzhi Wang, Xue Lin

13 pages, 2 figures
License: CC BY 4.0

Abstract: Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size and computation amount, model compression is a critical step to deploy DNN models on edge devices. This paper focuses on weight quantization, a hardware-friendly model compression approach that is complementary to weight pruning. Unlike existing methods that use the same quantization scheme for all weights, we propose the first solution that applies different quantization schemes for different rows of the weight matrix. It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of heterogeneous FPGA hardware resources. To achieve that, we first propose a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) suitable for Gaussian-like weight distribution, in which the multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, the existing fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. Then to fully explore the resources, we propose an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes. Combining the two schemes can maintain, or even increase accuracy due to better matching with weight distributions.

Submitted to arXiv on 08 Dec. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2012.04240v1

This paper focuses on the development of a novel FPGA-centric deep neural network (DNN) quantization framework that enables efficient DNN inference engine on FPGA devices through DNN quantization. The proposed solution applies different quantization schemes for different rows of the weight matrix, which is motivated by the fact that the distribution of weights in different rows is not uniform and there is potential to achieve better utilization of heterogeneous FPGA hardware resources. Unlike existing methods that use the same quantization scheme for all weights, this paper proposes a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2), suitable for Gaussian-like weight distribution. The multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. To fully explore the FPGA resources, an intra-layer multi-scheme quantization framework with an ensemble of SP2 and fixed-point schemes is proposed. This mixed scheme quantization (MSQ) approach can maintain or even increase accuracy due to better matching with weight distributions. The authors evaluate their FPGA-centric quantization framework across multiple application domains with various DNNs such as convolutional neural networks (CNN) and recurrent neural networks (RNN). With optimal SP2/fixed-point ratios on two FPGA devices, i.e., Zynq XC7Z020 and XC7Z045, they achieve performance improvement of 2.1×−4.1× compared to solely exploiting DSPs for all multiplication operations. This research contributes to addressing the critical step of model compression required to deploy DNN models on edge devices due to their huge model size and computation amount. The proposed MSQ approach offers a hardware-friendly solution that enables efficient implementation of DNN inference on edge computing platforms such as ASICs, FPGAs, and embedded systems while maintaining or even improving accuracy. This work is partly supported by the National Science Foundation CCF-1901378, CCF-1919117, CCF-1919289, CNS 1909172 and DARPA HR00112090055 grants.
Created on 08 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.