GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

AI-generated keywords: GPTQ Quantization Compression Language Modeling Inference

AI-generated Key Points

  • Generative Pre-trained Transformer models (GPT or OPT) are exceptional in complex language modeling tasks.
  • The large size and high computational and storage costs of GPT models limit their usability.
  • The authors propose a new one-shot weight quantization method called GPTQ based on approximate second-order information that is highly accurate and efficient.
  • GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours while reducing the bitwidth down to 3 or 4 bits per weight with negligible accuracy degradation relative to the uncompressed baseline.
  • This more than doubles the compression gains relative to previously-proposed one-shot quantization methods while preserving accuracy.
  • GPTQ can still provide reasonable accuracy even in extreme quantization regimes where weights are quantized to 2-bit or even ternary levels.
  • The implementation of GPTQ is available at https://github.com/IST-DASLab/gptq along with acknowledgements for funding from the European Research Council under the European Union’s Horizon 2020 program as well as experimental support from Eldar Kurtic and from IST Austria IT department.
  • Experimentally, improvements can be leveraged for end-to-end inference speedups over FP16 of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000).
  • While our study focused on "leading accuracy" metrics such as perplexity which is standard in literature, we believe a thorough study of the impact of compression upon secondary measures such as bias effects is warranted.
  • In conclusion, GPTQ offers a highly accurate and efficient method for compressing large language models via quantization with little-to-no accuracy loss resulting in end-to-end inference speedups over FP16.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

ICLR 2023
License: CC BY 4.0

Abstract: Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

Submitted to arXiv on 31 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.17323v2

Generative Pre-trained Transformer models (GPT or OPT) have revolutionized natural language processing with their exceptional performance in complex language modeling tasks. However, their large size and high computational and storage costs limit their usability. To address this issue, the authors propose a new one-shot weight quantization method called GPTQ based on approximate second-order information that is both highly accurate and efficient. This method can quantize GPT models with 175 billion parameters in approximately four GPU hours while reducing the bitwidth down to 3 or 4 bits per weight with negligible accuracy degradation relative to the uncompressed baseline. This more than doubles the compression gains relative to previously-proposed one-shot quantization methods while preserving accuracy. The authors demonstrate that their method can still provide reasonable accuracy even in extreme quantization regimes where weights are quantized to 2-bit or even ternary levels. They show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16 of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation of GPTQ is available at https://github.com/IST-DASLab/gptq along with acknowledgements for funding from the European Research Council under the European Union’s Horizon 2020 program as well as experimental support from Eldar Kurtic and from IST Austria IT department. While our study focused on "leading accuracy" metrics such as perplexity which is standard in literature, we believe a thorough study of the impact of compression upon secondary measures such as bias effects is warranted. Our work makes inference on extremely large language models more accessible, for better or for worse; thus making it necessary to understand its power and limitations even more stringent over time as tools become easier to use and deploy. In conclusion, GPTQ offers a highly accurate and efficient method for compressing large language models via quantization with little-to-no accuracy loss resulting in end-to-end inference speedups over FP16.
Created on 22 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.