Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
AI-generated Key Points
- The paper proposes a new optimizer called Sophia
- Sophia is a scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner
- The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping
- Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead
- The authors evaluate Sophia on auto-regressive language modeling with GPT-2 models ranging from 125M to 770M parameters
- Results show that Sophia achieves a 2x speed-up compared to AdamW and Lion while also having better scaling laws than AdamW
- In addition to its performance benefits, Sophia adapts to different components' curvature in parameters for language modeling tasks, which can be highly heterogeneous
- The run-time bound does not depend on the condition number of loss
- The experimental setup involves training autoregressive models on OpenWebText using decoder-only architecture with context length set to 1024
- The authors use five prompts for evaluation purposes
- Overall, this paper presents an efficient optimizer that can significantly reduce training time and cost for language model pre-training while also adapting to different curvatures in parameters for language modeling tasks.
Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma
Abstract: Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.