Poster Presentation HUPO 2019 - 18th Human Proteome Organization World Congress

Discovery of new human protein coding genes in GENCODE using evolutionary signatures and mass spectrometry (#921)

James C Wright 1 , Lu Yu 1 , Jonathan Mudge 2 , Irwin Jungreis 3 , Jyoti Choudhary 1
  1. Institute of Cancer Research, London, LONDON, United Kingdom
  2. Vertebrate Genomics, European Bioinformatics Institute, Cambridge, UK
  3. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, US

Despite nearly twenty years of intense study the exact portion of the human genome that is translated into protein remains to be ascertained. The GENCODE gene annotation is created by a consortium of manual annotation, computational biology and experimental groups. The objective of the GENCODE consortium is to create in the GENCODE geneset a foundational reference genome annotation encompassing all functional elements, including protein coding genes for Mouse and Human,. This annotation is released through Ensembl and forms the basis for most protein sequence databases used in proteomics. There is a huge focus on discovering and validating novel CCDS as well as removing spurious protein coding genes from the reference annotations to finalise the complete complement of reference proteins. This work has involved the implementation of strict criteria based on high quality experimental evidence as well as orthogonal sequence based characteristics to refine a genes protein coding potential and to confirm the structure of novel protein coding transcripts. One recent approach has been the use of PhyloCSF to identify the evolutionary signatures of protein-coding regions using multi-species genome alignments and machine learning. These regions provide potential novel conserved protein-coding sequences that can be searched using mass spectrometry data and has led to the largest annotation of new proteins in the human genome in recent years. This work approach will be important for the annotation of personal genomes and the use of proteomic experiments to help provide translational information about personal protein sequences and allelic specific expression.