Knowledge Distillation of Large Language Models
AI-generated Key Points
- Knowledge Distillation (KD) is used to reduce the computational demand of large language models (LLMs)
- Previous KD methods have focused on white-box classification models or training small models to imitate black-box model APIs
- Limited exploration on effectively distilling knowledge from white-box generative LLMs
- Authors propose MiniLLM, a method for distilling smaller language models from larger generative language models
- Forward Kullback-Leibler divergence (KLD) objective replaced with reverse KLD to prevent overestimation of low-probability regions
- Effective optimization approach developed to learn this objective
- MiniLLM demonstrated through extensive experiments in an instruction-following setting
- Results show more precise responses, higher overall quality, lower exposure bias, better calibration, and improved long-text generation performance
- Scalable and applicable to different model families ranging from 120M to 13B parameters
- Code and model checkpoints will be released for further exploration
- White-box KD becomes valuable for leveraging teacher parameters in addition to existing techniques like black-box KD
- MiniLLM offers a promising approach for reducing computational demands while maintaining high-quality responses
Authors: Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge from white-box generative LLMs is still under-explored, which becomes more and more important with the prosperity of LLMs. In this work, we propose MiniLLM that distills smaller language models from generative larger language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. Extensive experiments in the instruction-following setting show that the MiniLLM models generate more precise responses with the higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance. Our method is also scalable for different model families with 120M to 13B parameters. We will release our code and model checkpoints at https://aka.ms/MiniLLM.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.