Knowledge Distillation of Large Language Models

AI-generated keywords: Knowledge Distillation MiniLLM Generative Language Models Reverse KLD Optimization

AI-generated Key Points

  • Knowledge Distillation (KD) is used to reduce the computational demand of large language models (LLMs)
  • Previous KD methods have focused on white-box classification models or training small models to imitate black-box model APIs
  • Limited exploration on effectively distilling knowledge from white-box generative LLMs
  • Authors propose MiniLLM, a method for distilling smaller language models from larger generative language models
  • Forward Kullback-Leibler divergence (KLD) objective replaced with reverse KLD to prevent overestimation of low-probability regions
  • Effective optimization approach developed to learn this objective
  • MiniLLM demonstrated through extensive experiments in an instruction-following setting
  • Results show more precise responses, higher overall quality, lower exposure bias, better calibration, and improved long-text generation performance
  • Scalable and applicable to different model families ranging from 120M to 13B parameters
  • Code and model checkpoints will be released for further exploration
  • White-box KD becomes valuable for leveraging teacher parameters in addition to existing techniques like black-box KD
  • MiniLLM offers a promising approach for reducing computational demands while maintaining high-quality responses
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

20 pages, 12 figures
License: CC BY 4.0

Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge from white-box generative LLMs is still under-explored, which becomes more and more important with the prosperity of LLMs. In this work, we propose MiniLLM that distills smaller language models from generative larger language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. Extensive experiments in the instruction-following setting show that the MiniLLM models generate more precise responses with the higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance. Our method is also scalable for different model families with 120M to 13B parameters. We will release our code and model checkpoints at https://aka.ms/MiniLLM.

Submitted to arXiv on 14 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.08543v1

Knowledge Distillation (KD) is a technique used to reduce the computational demand of large language models (LLMs). While previous KD methods have primarily focused on white-box classification models or training small models to imitate black-box model APIs like ChatGPT, there is still limited exploration on effectively distilling knowledge from white-box generative LLMs. This becomes increasingly important as LLMs continue to prosper. In this work, the authors propose MiniLLM, a method for distilling smaller language models from larger generative language models. To address the challenges specific to generative language models, the authors replace the forward Kullback-Leibler divergence (KLD) objective in standard KD approaches with reverse KLD. This helps prevent the student model from overestimating low-probability regions of the teacher distribution. The authors also develop an effective optimization approach to learn this objective. The effectiveness of MiniLLM is demonstrated through extensive experiments in an instruction-following setting. The results show that MiniLLM models generate more precise responses with higher overall quality, lower exposure bias, better calibration, and improved long-text generation performance. Importantly, MiniLLM is scalable and can be applied to different model families ranging from 120M to 13B parameters. The authors plan to release their code and model checkpoints for further exploration. In addition to existing techniques such as black-box KD which only rely on teacher predictions, white-box KD becomes increasingly valuable for both research communities and industry sectors due to its ability to leverage teacher parameters. By refining and expanding upon existing knowledge distillation methods for generative LLMs, MiniLLM offers a promising approach for reducing computational demands while maintaining high-quality responses. Overall, this work contributes to the advancement of knowledge distillation techniques specifically tailored for generative language models and provides insights into improving response precision, diversity, and overall performance in various applications.
Created on 12 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.