GPTVQ: The Blessing of Dimensionality for LLM Quantization

AI-generated keywords: GPTVQ

AI-generated Key Points

  • **GPTVQ:**
  • A new post-training vector quantization technique designed for large language models.
  • **Dimensionality:**
  • Increasing quantization dimensionality can improve the balance between size and accuracy in neural network quantization.
  • **Quantization:**
  • GPTVQ method involves interleaving specific column quantizations with weight updates to enhance performance.
  • **Large Language Models:**
  • GPTVQ method is tailored for large language models, demonstrating efficiency in processing complex models.
  • **Trade-Off:**
  • GPTVQ improves trade-offs between size and accuracy while enhancing efficiency on mobile CPUs.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

License: CC BY 4.0

Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

Submitted to arXiv on 23 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.15319v1

, , , , In the study "GPTVQ: The Blessing of Dimensionality for LLM Quantization," the authors explore how increasing the quantization dimensionality can significantly enhance the trade-off between size and accuracy in neural network quantization. They introduce the GPTVQ method, a rapid post-training vector quantization (VQ) technique tailored for Large Language Models (LLMs). This method involves interleaving quantization of specific columns with updates to unquantized weights, leveraging information from the Hessian of per-layer output reconstruction Mean Squared Error (MSE). The initialization of quantization codebooks is performed using an efficient data-aware version of the Expectation-Maximization (EM) algorithm. Subsequently, these codebooks are updated and further compressed through integer quantization and Singular Value Decomposition (SVD)-based compression techniques. <kw>GPTVQ:</kw> In their study, the authors propose a new post-training vector quantization technique specifically designed for large language models. <kw>Dimensionality:</kw> By increasing the dimensionality of quantization, significant improvements can be made in balancing size and accuracy in neural network quantization. <kw>Quantization:</kw> The GPTVQ method involves interleaving specific column quantizations with weight updates to improve overall performance. <kw>Large Language Models:</kw> The GPTVQ method is tailored specifically for large language models, showcasing its efficiency in processing these complex models. <kw>Trade-Off:</kw> Through comprehensive experimentation and analysis, GPTVQ emerges as a state-of-the-art approach that improves trade-offs between size and accuracy while also enhancing efficiency on mobile CPUs.
Created on 01 May. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.