Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Authors Trevor E. Pogue and Nicola Nicolici introduce the Free-pipeline Fast Inner Product (FFIP) algorithm and its corresponding hardware architecture
- FFIP enhances the fast inner-product algorithm (FIP) proposed by Winograd in 1968, boosting clock frequency and improving throughput without significant hardware cost increase
- FFIP is versatile and applicable to various machine learning model layers involving matrix multiplication like fully-connected, convolutional, recurrent, and attention/transformer layers
- Researchers implement FIP within an ML accelerator for the first time, presenting the FFIP algorithm and its generalized architecture
- FFIP seamlessly integrates into traditional fixed-point systolic array ML accelerators, achieving equivalent throughput with half the number of MAC units or doubling maximum systolic array size under a fixed hardware budget
- FFIP implementation targeting non-sparse ML models with 8 to 16-bit fixed-point inputs demonstrates superior throughput and compute efficiency compared to existing state-of-the-art solutions on similar platforms
Authors: Trevor E. Pogue, Nicola Nicolici
Abstract: We introduce a new algorithm called the Free-pipeline Fast Inner Product (FFIP) and its hardware architecture that improve an under-explored fast inner-product algorithm (FIP) proposed by Winograd in 1968. Unlike the unrelated Winograd minimal filtering algorithms for convolutional layers, FIP is applicable to all machine learning (ML) model layers that can mainly decompose to matrix multiplication, including fully-connected, convolutional, recurrent, and attention/transformer layers. We implement FIP for the first time in an ML accelerator then present our FFIP algorithm and generalized architecture which inherently improve FIP's clock frequency and, as a consequence, throughput for a similar hardware cost. Finally, we contribute ML-specific optimizations for the FIP and FFIP algorithms and architectures. We show that FFIP can be seamlessly incorporated into traditional fixed-point systolic array ML accelerators to achieve the same throughput with half the number of multiply-accumulate (MAC) units, or it can double the maximum systolic array size that can fit onto devices with a fixed hardware budget. Our FFIP implementation for non-sparse ML models with 8 to 16-bit fixed-point inputs achieves higher throughput and compute efficiency than the best-in-class prior solutions on the same type of compute platform.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.