Mixtral of Experts

AI-generated Key Points

  • Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model
  • It outperforms previous models in various benchmarks
  • Each layer in Mixtral is composed of 8 feedforward blocks or experts
  • A router network selects two experts to process the current state and combine their outputs at every token
  • Mixtral utilizes an impressive 47B parameters while only using 13B active parameters during inference
  • Mixtral excels in mathematics, code generation, and multilingual benchmarks compared to Llama 2 70B
  • There is a fine-tuned version called Mixtral 8x7B - Instruct which surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks for following instructions
  • Both the base and instruct models are released under the Apache 2.0 license
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

See more details at https://mistral.ai/news/mixtral-of-experts/
License: CC BY 4.0

Abstract: We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Submitted to arXiv on 08 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.04088v1

<Mixtral 8x7B>, <Sparse Mixture of Experts (SMoE) language model>, <feedforward blocks>, <context size of 32k tokens>, <fine-tuned version called Mixtral 8x7B - Instruct> We present Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that outperforms previous models in various benchmarks. With the same architecture as Mistral 7B, each layer in Mixtral is composed of 8 feedforward blocks or experts. At every token, a router network selects two experts to process the current state and combine their outputs. This allows for access to an impressive 47B parameters while only utilizing 13B active parameters during inference. Notably, Mixtral excels in mathematics, code generation, and multilingual benchmarks compared to Llama 2 70B. Additionally, we offer a fine-tuned version called Mixtral 8x7B - Instruct which surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks for following instructions. Both the base and instruct models are released under the Apache 2.0 license. For more details about Mixtral of Experts, please refer to our publication [link].
Created on 19 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.