Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Quantization of Large Language Models (LLMs) using LoRA-finetuning is a key area of research in machine learning.
- Existing methods have limitations in maintaining performance when quantizing LLMs with LoRA finetuning.
- IR-QLoRA, introduced by Haotong Qin, Xudong Ma, and team, focuses on information retention to enhance accuracy of quantized LLMs with LoRA.
- IR-QLoRA leverages Statistics-based Information Calibration Quantization and Finetuning-based Information Elastic Connection for unified information processing.
- Extensive experiments show that IR-QLoRA significantly improves accuracy across various LLaMA and LLaMA2 model families under 2-4 bit-width configurations.
- For example, a 4-bit LLaMA-7B model achieved a 1.4% enhancement in Mean Model Log-Likelihood Uncertainty (MMLU) compared to state-of-the-art methods with minimal increase in time consumption.
- IR-QLoRA is versatile and compatible with different frameworks like NormalFloat and Integer quantization techniques while consistently delivering enhanced accuracy outcomes.
- Researchers can access the code implementation of IR-QLoRA at https://github.com/htqin/ir-qlora.
Authors: Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno
Abstract: The LoRA-finetuning quantization of LLMs has been extensively studied to obtain accurate yet compact LLMs for deployment on resource-constrained hardware. However, existing methods cause the quantized LLM to severely degrade and even fail to benefit from the finetuning of LoRA. This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention. The proposed IR-QLoRA mainly relies on two technologies derived from the perspective of unified information: (1) statistics-based Information Calibration Quantization allows the quantized parameters of LLM to retain original information accurately; (2) finetuning-based Information Elastic Connection makes LoRA utilizes elastic representation transformation with diverse information. Comprehensive experiments show that IR-QLoRA can significantly improve accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths, e.g., 4- bit LLaMA-7B achieves 1.4% improvement on MMLU compared with the state-of-the-art methods. The significant performance gain requires only a tiny 0.31% additional time consumption, revealing the satisfactory efficiency of our IR-QLoRA. We highlight that IR-QLoRA enjoys excellent versatility, compatible with various frameworks (e.g., NormalFloat and Integer quantization) and brings general accuracy gains. The code is available at https://github.com/htqin/ir-qlora.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.