SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
AI-generated Key Points
- The pre-training and fine-tuning paradigm has revolutionized Natural Language Processing (NLP)
- Pre-training on large datasets before fine-tuning on task-specific data is computationally expensive
- Sparse Pre-training and Dense Fine-tuning (SPDF) approach uses unstructured weight sparsity to train only a subset of weights during pre-training
- Recover the representational capacity by allowing zeroed weights to learn during dense fine-tuning
- SPDF effectiveness evaluated on natural language generation (NLG) tasks such as E2E, WebNLG, and DART and text summarization tasks such as Curation Corpus using GPT-2 Small and GPT-3 XL models
- High degrees of weight sparsity can be induced during pre-training without significant degradation in accuracy across all NLG tasks
- Performance of sparse pre-trained model is correlated with difficulty of fine-tuning task
- As size of model increases, it becomes more amenable to higher sparsity levels
- GPT models can be pre-trained with 50%-75% sparsity without losing significant accuracy on downstream tasks
- Fully sparse end-to-end training can prevent models from generalizing well on downstream tasks
- Transitioning from sparse to dense matrices during fine-tuning mitigates poor generalizability due to sparse only training.
- SPDF presents a promising direction for training large GPT models at a fraction of the training FLOPs while retaining benefits for downstream NLP tasks.
- Study establishes relationship between sparsity, task complexity, dataset size and provides insights into how best to use SPDF for efficient training of large scale language models.
Authors: Vithursan Thangarasa, Abhay Gupta, William Marshall, Tianda Li, Kevin Leong, Dennis DeCoste, Sean Lie, Shreyas Saxena
Abstract: The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also leads to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity, and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity while retaining the benefits of pre-trained textual representations for downstream tasks.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 1
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.