ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

AI-generated keywords: ProtTrans

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Ahmed Elnaggar, Michael Heinzinger, and others explore computational biology and bioinformatics using Language Models (LMs) derived from Natural Language Processing (NLP).
  • Experiments involved training auto-regressive models (Transformer-XL, XLNet) and auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD databases containing 393 billion amino acids.
  • Training was conducted on the Summit supercomputer with 5616 GPUs and TPU Pod with up to 1024 cores.
  • Protein LM-embeddings extracted from unlabeled data captured essential biophysical features in protein sequences.
  • Impressive results were achieved in per-residue predictions of protein secondary structure (Q3=81%-87%), sub-cellular localization (Q10=81%), and distinguishing membrane-bound vs. water-soluble proteins (Q2=91%).
  • ProtT5 embeddings showed transferability surpassing existing methods without relying on evolutionary information or costly database searches.
  • The authors released their models on GitHub at https://github.com/agemagician/ProtTrans to support future research in this area.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost

17 pages, 9 figures, 4 tables

Abstract: Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.

Submitted to arXiv on 13 Jul. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2007.06225v3

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing," authors Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost delve into the realm of computational biology and bioinformatics. They highlight the vast data gold-mines present in protein sequences that are ideal for Language Models (LMs) derived from Natural Language Processing (NLP). These LMs have the potential to push boundaries in new prediction tasks at low inference costs. The authors conducted experiments where they trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) using data sourced from UniRef and BFD databases containing a staggering 393 billion amino acids. This training was carried out on the powerful Summit supercomputer utilizing 5616 GPUs and TPU Pod with up to 1024 cores. Through dimensionality reduction techniques, they discovered that raw protein LM-embeddings extracted from unlabeled data managed to capture essential biophysical features inherent in protein sequences. Validation of these embeddings as exclusive inputs for various subsequent tasks showcased their effectiveness. The authors achieved impressive results in per-residue predictions of protein secondary structure with an accuracy range of Q3=81%-87%. Furthermore, they successfully predicted protein sub-cellular localization with a ten-state accuracy of Q10=81% and distinguished between membrane-bound and water-soluble proteins with a two-state accuracy of Q2=91%. Of particular note was the transferability of informative embeddings such as ProtT5 which surpassed existing state-of-the-art methods without relying on evolutionary information or costly database searches for per-residue predictions. These outcomes strongly suggested that protein LMs had acquired an understanding akin to the grammar embedded within the language of life. To aid future research endeavors in this domain, the authors generously released their models on GitHub at https://github.com/agemagician/ProtTrans. This comprehensive study spanning 17 pages featuring 9 figures and 4 tables sheds light on how self-supervised deep learning coupled with high-performance computing can potentially unlock profound insights into deciphering the intricate code governing life processes.
Created on 20 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.