Attention Is All You Need

AI-generated keywords: Transformer Attention Mechanism Machine Translation BLEU Score Training Time

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors propose a new neural network architecture called the Transformer for sequence transduction tasks such as machine translation
  • The Transformer is based solely on attention mechanisms and does not require recurrence or convolutions like traditional models
  • The model outperforms existing models in terms of quality while being more parallelizable and requiring significantly less training time
  • Achieves 28.4 BLEU on the WMT 2014 English-to-German translation task and establishes a new single-model state-of-the-art BLEU score of 41.8 on the WMT 2014 English-to-French translation task after training for only 3.5 days on eight GPUs
  • Successfully applied to English constituency parsing both with large and limited training data sets
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

15 pages, 5 figures

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Submitted to arXiv on 12 Jun. 2017

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1706.03762v5

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper "Attention Is All You Need," authors Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin propose a new neural network architecture called the Transformer for sequence transduction tasks such as machine translation. The Transformer is based solely on attention mechanisms and does not require recurrence or convolutions like traditional models that rely on complex recurrent or convolutional neural networks in an encoder-decoder configuration with an attention mechanism connecting them. The authors demonstrate through experiments on two machine translation tasks that the Transformer outperforms existing models in terms of quality while being more parallelizable and requiring significantly less training time. Specifically, the model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task and establishes a new single-model state-of-the-art BLEU score of 41.8 on the WMT 2014 English-to-French translation task after training for only 3.5 days on eight GPUs - a fraction of the training costs of other best-performing models from literature. Furthermore, they show that the Transformer generalizes well to other tasks by successfully applying it to English constituency parsing both with large and limited training data sets.
Created on 17 Mar. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.