SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

AI-generated keywords: Sparse Pre-training

AI-generated Key Points

  • The pre-training and fine-tuning paradigm has revolutionized Natural Language Processing (NLP)
  • Pre-training on large datasets before fine-tuning on task-specific data is computationally expensive
  • Sparse Pre-training and Dense Fine-tuning (SPDF) approach uses unstructured weight sparsity to train only a subset of weights during pre-training
  • Recover the representational capacity by allowing zeroed weights to learn during dense fine-tuning
  • SPDF effectiveness evaluated on natural language generation (NLG) tasks such as E2E, WebNLG, and DART and text summarization tasks such as Curation Corpus using GPT-2 Small and GPT-3 XL models
  • High degrees of weight sparsity can be induced during pre-training without significant degradation in accuracy across all NLG tasks
  • Performance of sparse pre-trained model is correlated with difficulty of fine-tuning task
  • As size of model increases, it becomes more amenable to higher sparsity levels
  • GPT models can be pre-trained with 50%-75% sparsity without losing significant accuracy on downstream tasks
  • Fully sparse end-to-end training can prevent models from generalizing well on downstream tasks
  • Transitioning from sparse to dense matrices during fine-tuning mitigates poor generalizability due to sparse only training.
  • SPDF presents a promising direction for training large GPT models at a fraction of the training FLOPs while retaining benefits for downstream NLP tasks.
  • Study establishes relationship between sparsity, task complexity, dataset size and provides insights into how best to use SPDF for efficient training of large scale language models.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Vithursan Thangarasa, Abhay Gupta, William Marshall, Tianda Li, Kevin Leong, Dennis DeCoste, Sean Lie, Shreyas Saxena

Presented at the ICLR 2023 Workshop on Sparsity in Neural Networks
License: CC BY 4.0

Abstract: The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also leads to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity, and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity while retaining the benefits of pre-trained textual representations for downstream tasks.

Submitted to arXiv on 18 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.10464v1

The pre-training and fine-tuning paradigm has revolutionized Natural Language Processing (NLP) by enabling language models to learn cross-domain knowledge through pre-training on large datasets before fine-tuning on task-specific data. However, the computational costs of pre-training are prohibitively high, requiring orders of magnitude more FLOPs than fine-tuning. To address this challenge, a new approach called Sparse Pre-training and Dense Fine-tuning (SPDF) has been proposed. This approach uses unstructured weight sparsity to train only a subset of weights during pre-training and then recovers the representational capacity by allowing the zeroed weights to learn during dense fine-tuning. The effectiveness of SPDF has been evaluated on natural language generation (NLG) tasks such as E2E, WebNLG, and DART and text summarization tasks such as Curation Corpus using GPT-2 Small and GPT-3 XL models. The study validates three hypotheses: first, high degrees of weight sparsity can be induced during pre-training without significant degradation in accuracy across all NLG tasks; second, the performance of the sparse pre-trained model is correlated with the difficulty of the fine-tuning task; third, as the size of the model increases, it becomes more amenable to higher sparsity levels. The results indicate that these GPT models can be pre-trained with 50%-75% sparsity without losing significant accuracy on downstream tasks. However, fully sparse end-to-end training can prevent models from generalizing well on downstream tasks. Therefore transitioning from sparse to dense matrices during fine-tuning mitigates poor generalizability due to sparse only training. Overall, SPDF presents a promising direction for training large GPT models at a fraction of the training FLOPs while retaining benefits for downstream NLP tasks. The study establishes a relationship between sparsity, task complexity, dataset size and provides insights into how best to use SPDF for efficient training of large scale language models.
Created on 07 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.