Computational protein structure prediction is increasing in feasibility and accuracy, as demonstrated by recent advances from alphaFold and recurrent geometric networks (AlQuraishi et al 2019). Protein structure prediction will soon likely form a standard part of genome reannotation pipelines for both model and divergent organisms. Prediction of human protein structures can greatly assist in understanding the roles of different isoforms, and nonsynonymous genetic mutations across tissues, and in disease.
However in order to harness the potential of large-scale protein structure prediction, methods are needed to independently and efficiently assess the quality of thousands of predicted structures, without human assistance.
To test a novel quality assessment method we used I-TASSER software to predict the structure of 5000 proteins encoded in the human parasite Giardia, a simple eukaryote that lacks introns and thus encodes thousands of full-length proteins. We used discrete protein sequence annotations (Pfam codes) assigned to peptides encoding predicted structures, and their closest empirically-determined homologues in the PDB, to bin the predicted structures into a high-confidence (matching IPR code) category, or lower-confidence category (non-matching).
Continuous metrics output by I-TASSER were used to construct a random forest machine learning model that predicted the high-confidence category, yielding structural insight into ~1000 proteins including enzymes important for drug resistance and redox maintenance. The classifier also produced a second tier of predicted structures that have features of the high-confidence structures, but lack matching PFAM domains with their closest crystal structure homologue.
High-confidence models exhibited greater transcriptional abundance, and the classifier generalized to selected human proteins, indicating the broad utility of this approach for automatically stratifying predicted structures.
This work provides a method for assigning confidence in predicted protein structures en masse in a software-agnostic manner, and can be used to vastly increase knowledge of the structural proteome in humans and other medically relevant species.