Deep Recurrent Neural Network for Protein Function Prediction from Sequence

AI-generated keywords: Protein Function Machine Learning Artificial Recurrent Neural Networks (RNN) Long-Short-Term-Memory (LSTM) Bioinformatics

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • High-throughput biological sequencing has advanced in terms of speed and cost-effectiveness.
  • Extracting meaningful information from these sequences is a challenge due to low-throughput experimental characterizations.
  • Predicting protein functions directly from primary amino-acid sequences is a particular challenge.
  • Machine learning techniques using artificial recurrent neural networks (RNN) are used to address this challenge.
  • RNN models incorporate long-short-term-memory (LSTM) units and are trained on annotated datasets from UniProt.
  • RNN models achieve high performance for in-class prediction of four important protein functions tested.
  • RNN models outperform other machine learning algorithms that use sequence-derived protein features.
  • RNN models can make out-of-class predictions for phylogenetically distinct protein families with similar functions.
  • Trained RNN models predict candidates validated by existing annotations and identify unannotated sequences in the UniRef100 database.
  • Some predictions made by the RNN models for the ferritin-like iron sequestering function were experimentally validated.
  • This machine learning approach based on RNN holds great potential for discovering and predicting homologues for a wide range of protein functions.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xueliang Liu

arXiv: 1701.08318v1 - DOI (q-bio.QM)

Abstract: As high-throughput biological sequencing becomes faster and cheaper, the need to extract useful information from sequencing becomes ever more paramount, often limited by low-throughput experimental characterizations. For proteins, accurate prediction of their functions directly from their primary amino-acid sequences has been a long standing challenge. Here, machine learning using artificial recurrent neural networks (RNN) was applied towards classification of protein function directly from primary sequence without sequence alignment, heuristic scoring or feature engineering. The RNN models containing long-short-term-memory (LSTM) units trained on public, annotated datasets from UniProt achieved high performance for in-class prediction of four important protein functions tested, particularly compared to other machine learning algorithms using sequence-derived protein features. RNN models were used also for out-of-class predictions of phylogenetically distinct protein families with similar functions, including proteins of the CRISPR-associated nuclease, ferritin-like iron storage and cytochrome P450 families. Applying the trained RNN models on the partially unannotated UniRef100 database predicted not only candidates validated by existing annotations but also currently unannotated sequences. Some RNN predictions for the ferritin-like iron sequestering function were experimentally validated, even though their sequences differ significantly from known, characterized proteins and from each other and cannot be easily predicted using popular bioinformatics methods. As sequencing and experimental characterization data increases rapidly, the machine-learning approach based on RNN could be useful for discovery and prediction of homologues for a wide range of protein functions.

Submitted to arXiv on 28 Jan. 2017

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1701.08318v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The field of high-throughput biological sequencing has seen significant advancements in terms of speed and cost-effectiveness. However, the challenge lies in extracting meaningful information from these sequences, especially due to the limitations of low-throughput experimental characterizations. One particular challenge is accurately predicting protein functions directly from their primary amino-acid sequences. In this study, the authors address this challenge by applying machine learning techniques using artificial recurrent neural networks (RNN). They specifically focus on classifying protein function directly from primary sequence without relying on sequence alignment, heuristic scoring, or feature engineering. The RNN models used in this study incorporate long-short-term-memory (LSTM) units and are trained on publicly available annotated datasets from UniProt. The results demonstrate that the RNN models achieve high performance for in-class prediction of four important protein functions tested. In fact, they outperform other machine learning algorithms that utilize sequence-derived protein features. Additionally, the RNN models are capable of making out-of-class predictions for phylogenetically distinct protein families with similar functions, including proteins belonging to the CRISPR-associated nuclease, ferritin-like iron storage, and cytochrome P450 families. Furthermore, when applied to the partially unannotated UniRef100 database, the trained RNN models not only predict candidates validated by existing annotations but also identify currently unannotated sequences. Some predictions made by the RNN models for the ferritin-like iron sequestering function were experimentally validated. Notably, these validated sequences differ significantly from known characterized proteins and each other, making them challenging to predict using popular bioinformatics methods. Overall, as sequencing and experimental characterization data continue to increase rapidly, this machine learning approach based on RNN holds great potential for discovering and predicting homologues for a wide range of protein functions. By leveraging primary amino acid sequences without relying on traditional bioinformatics methods such as sequence alignment or feature engineering, this approach offers a promising avenue for advancing our understanding of protein function.
Created on 09 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.