Orca 1 has demonstrated superior performance on benchmarks like BigBench Hard and AGIEval by learning from rich signals such as explanation traces. Building on this success, Orca 2 aims to enhance the reasoning abilities of smaller language models (LMs) by exploring improved training signals. Previous research on training small LMs has relied heavily on imitation learning, replicating the output of larger models. However, Orca 2 believes that excessive emphasis on imitation may limit the potential of smaller models. In Orca 2, the focus is on teaching small LMs different solution strategies for various tasks, which may differ from those employed by larger models. While larger models might directly answer complex tasks, smaller models may not have the same capacity. Therefore, Orca 2 trains the model in various reasoning techniques such as step-by-step analysis, recall then generate, recall-reason-generate, and direct answer approaches. The goal is to enable the model to determine the most effective solution strategy for each task. To evaluate Orca 2's performance, a comprehensive set of 15 diverse benchmarks consisting of approximately 100 tasks and over 36,000 unique prompts are used. Remarkably, Orca 2 surpasses models of similar size and achieves comparable or better performance levels compared to models that are 5-10 times larger. These evaluations specifically focus on complex tasks that test advanced reasoning abilities in zero-shot settings. Additionally, safety evaluation is an important aspect considered in this study. Experiments are conducted using publicly available datasets related to implicit and explicit toxicity, truthfulness, content harms across different domains (IP), and jailbreaks. Two evaluation regimes are employed: discriminative evaluation where the model classifies given content types accurately and generative evaluation where the model produces output that adheres to safety guidelines. The experiments reveal that certain models perform better at classifying toxic statements than neutral statements. This observation raises concerns about potential erasure of content related to specific identity groups even if it is not problematic. However, models in the Orca 2 family, LLaMa-2 family and WizardLM family do not exhibit this problem. Instruction following rates for various models are also assessed with high rates observed for most models except for Orca 1. Overall these findings highlight the influence of system instructions on GPT-4's response and its ability to engage in careful thinking as well as demonstrate how strategy employed by an LM can affect its performance significantly when reasoning about a task . By expanding research on smaller LMs and providing publicly available weights for Orca 2 at aka ms/orca lm , this study aims to support further development , evaluation , alignment ,and potentially improvement of their capabilities even more .
- - Orca 1 has demonstrated superior performance on benchmarks like BigBench Hard and AGIEval by learning from rich signals such as explanation traces.
- - Orca 2 aims to enhance the reasoning abilities of smaller language models (LMs) by exploring improved training signals.
- - Previous research on training small LMs has relied heavily on imitation learning, but Orca 2 believes excessive emphasis on imitation may limit the potential of smaller models.
- - Orca 2 focuses on teaching small LMs different solution strategies for various tasks, which may differ from those employed by larger models.
- - Orca 2 trains the model in various reasoning techniques such as step-by-step analysis, recall then generate, recall-reason-generate, and direct answer approaches.
- - Orca 2 surpasses models of similar size and achieves comparable or better performance levels compared to models that are 5-10 times larger.
- - Evaluations specifically focus on complex tasks that test advanced reasoning abilities in zero-shot settings.
- - Safety evaluation is an important aspect considered in this study, with experiments conducted using publicly available datasets related to implicit and explicit toxicity, truthfulness, content harms across different domains (IP), and jailbreaks.
- - Two evaluation regimes are employed: discriminative evaluation where the model classifies given content types accurately and generative evaluation where the model produces output that adheres to safety guidelines.
- - Certain models perform better at classifying toxic statements than neutral statements, raising concerns about potential erasure of content related to specific identity groups even if it is not problematic. However, Orca 2 family models do not exhibit this problem.
- - Instruction following rates for various models are assessed with high rates observed for most models except for Orca 1.
- - System instructions influence GPT-4's response and its ability to engage in careful thinking. The strategy employed by an LM can significantly affect its performance when reasoning about a task.
Orca 1 is a computer program that is really good at solving difficult problems. It learns from examples and explanations to get better at these problems.
Orca 2 is another computer program that wants to make smaller computer programs smarter. It wants to teach them different ways of solving problems so they can be as good as bigger programs.
Some people have been teaching small computer programs by copying what bigger programs do, but Orca 2 thinks this might not be the best way. It wants to teach them new strategies for solving problems.
Orca 2 trains the small computer programs in different techniques like analyzing step-by-step, remembering then creating, remembering-reasoning-creating, and giving direct answers.
Orca 2 is better than other similar-sized programs and can do just as well or even better than models that are much bigger."
Exploring Improved Training Signals for Small Language Models with Orca 2
The development of language models (LMs) has seen a great deal of progress in recent years, particularly with the emergence of large-scale models such as GPT-3 and its successors. However, these larger models require significant computational resources to train and deploy. As a result, there is an increasing focus on smaller LMs that can achieve comparable performance levels while requiring fewer resources.
In this regard, Orca 1 has demonstrated superior performance on benchmarks like BigBench Hard and AGIEval by learning from rich signals such as explanation traces. Building on this success, Orca 2 aims to enhance the reasoning abilities of smaller language models by exploring improved training signals. This article will explore how Orca 2 achieves this goal and evaluates its performance through comprehensive benchmarking experiments.
Training Smaller Language Models
Previous research on training small LMs has relied heavily on imitation learning, replicating the output of larger models. However, Orca 2 believes that excessive emphasis on imitation may limit the potential of smaller models. In contrast to existing approaches which simply replicate outputs from larger models without understanding their underlying strategies or reasoning processes, Orca 2 focuses on teaching small LMs different solution strategies for various tasks which may differ from those employed by larger models. While larger models might directly answer complex tasks due to their increased capacity for computation and memory storage, smaller models may not have the same capacity and thus need to be trained in various reasoning techniques such as step-by-step analysis, recall then generate, recall-reason-generate or direct answer approaches in order to determine the most effective solution strategy for each task.
Evaluating Performance
To evaluate Orca 2's performance against other language model architectures across diverse benchmarks consisting of approximately 100 tasks and over 36000 unique prompts are used . Remarkably ,Orca 2 surpasses even similar sized architectures while achieving comparable or better performance levels compared to much bigger ones that are 5 - 10 times larger . These evaluations specifically focus on complex tasks that test advanced reasoning abilities in zero shot settings .
Safety Evaluation
Safety evaluation is also an important aspect considered when evaluating LM’s capabilities . Experiments were conducted using publicly available datasets related to implicit/explicit toxicity , truthfulness , content harms across different domains (IP) , jailbreaks etc . Two evaluation regimes were employed : discriminative evaluation where the model classifies given content types accurately & generative evaluation where it produces output adhering safety guidelines . The results reveal certain discrepancies between toxic & neutral statements classification rates raising concerns about potential erasure of content related specific identity groups even if it isn’t problematic . However ,models belonging to families like LLaMa -2 family & WizardLM family do not exhibit this problem indicating careful thinking & instruction following rate assessment also show high rates observed for most except one i..e.,Orca 1 highlighting influence system instructions have GPT -4 response & ability engage careful thinking
Conclusion
This study provides valuable insights into improving training signals for small language modelling architectures with ORCA –2 demonstrating superior performances compared similar sized ones while achieving competitive scores against much bigger ones up 5 –10 times size all while focusing complex tasks testing advanced reasoning abilities zero shot settings along conducting safety evaluations public datasets revealing discrepancies toxic neutral statements classification rates raising concerns potential erasure content related specific identity groups even if isn’t problematic however certain families like LLaMa –2 family WizardLM family do not exhibit problem indicating careful thinking instruction following rate assessment showing high rates observed most except one i..e.,ORCA 1 highlighting influence system instructions have GPT 4 response ability engage careful thinking By providing publicly available weights ORCA –2 aka ms/orc alm study supports further development alignment potentially improvement capabilities even more