GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

AI-generated keywords: GPTQ Weight Quantization Language Modeling Compression Techniques Ethical Considerations

AI-generated Key Points

  • GPT models are recognized for their exceptional performance in complex language modeling tasks
  • The large size of GPT models requires multiple high-performance GPUs for inference, limiting usability
  • Researchers have explored model compression techniques to address this issue
  • Existing compression methods are constrained by the scale and complexity of GPT models
  • The authors propose a novel one-shot weight quantization method called GPTQ
  • GPTQ leverages approximate second-order information for high accuracy and efficiency
  • GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing bitwidth to 3 or 4 bits per weight without significant accuracy degradation compared to uncompressed baseline model
  • GPTQ outperforms previously proposed one-shot quantization methods by more than doubling the compression gains while preserving accuracy
  • Implementation of GPTQ is available on GitHub for further exploration and application
  • Efficient execution of massive language models on a single GPU makes them more accessible for various applications
  • Comprehensive evaluation of compression techniques should consider secondary measures beyond standard accuracy metrics, such as transferability and bias effects, to better understand implications
  • Funding from European Research Council (ERC) and experimental/compute infrastructure support acknowledged
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

License: CC BY 4.0

Abstract: Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

Submitted to arXiv on 31 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.17323v1

Generative Pre-trained Transformer (GPT) models have gained recognition for their exceptional performance in complex language modeling tasks. However, their extensive computational and storage requirements pose significant challenges. The large size of GPT models necessitates the use of multiple high-performance GPUs for inference, limiting their usability. To address this issue, researchers have explored model compression techniques. However, existing compression methods are constrained by the scale and complexity of GPT models. In this paper, the authors propose a novel one-shot weight quantization method called GPTQ. This method leverages approximate second-order information to achieve both high accuracy and efficiency. The authors demonstrate that GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth to 3 or 4 bits per weight without significant accuracy degradation compared to the uncompressed baseline model. In fact, GPTQ outperforms previously proposed one-shot quantization methods by more than doubling the compression gains while preserving accuracy. The implementation of GPTQ is made available on GitHub for further exploration and application. By enabling efficient execution of massive language models on a single GPU, this research contributes to making these models more accessible for various applications. While the technical details of this work do not raise significant ethical concerns directly, it is important to consider secondary measures beyond standard accuracy metrics when evaluating compressed language models such as transferability and bias effects in order to better understand the implications of compression techniques like GPTQ. The authors acknowledge funding from the European Research Council (ERC) under Horizon 2020 program grant agreement No. 805223 ScaleML as well as experimental support from Eldar Kurtic and compute infrastructure support from Swiss National Supercomputing Center (CSCS). In conclusion, this paper introduces an accurate and efficient weight quantization method called GPTQ which significantly compresses large language models like GPT while highlighting the need for comprehensive evaluation of compression techniques in terms of secondary measures and ethical considerations.
Created on 10 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.