Proteins serve a variety of functions within cells. Some are involved in structural support and movement, others in enzymatic activity, and still others in interaction with the outside world.
The gap between the number of known proteins and their biological function annotation is continually increasing. Protein function is a multifaceted and complex phenomenon, and there is currently no single algorithm to determine it. There are a lot of algorithms of functional prediction today based on sequence homology, combining data from multiple different sources and implementing advanced machine learning technique. But how one can validate protein function prediction?
At the first stage we made retrospective analysis of Nextprot database (Gaudet et al., 2017) to reveal the most popular way of experiments for protein function validation (for different functions-different experimental approaches). After that we decided to focus our efforts on the functional annotation of chromosome 18 upe1 proteins (upe1 protein – proteins without known function).We decided to perform text-mining and meta-analysis. Search queries - the names of this protein in the PubMed does not give results. PRIDE contained 23 datasets with this protein. For the further analysis we have chosen 16 datasets created after 2016 (when HPP Data Interpretation Guidelines version 2.0 were published). This datasets were described in 12 articles respectively. Analysis of their MeSH-terms allowed us to form primary hypothesis about the Q68DL7 protein functional role.
At the next stage we analized co-occurance of this protein with other proteins in the same articles and experimental datasets. We used COFACTOR (Zhang et al., 2017) and I-TASSER (Yang et al., 2015) algorithms for protein function prediction based on protein structure. Basing on the principle “guily-by-association” the hypothesis about the role of this protein in different metabolic pathways was formulated.