Poster Presentation HUPO 2019 - 18th Human Proteome Organization World Congress

MetaNovo: a probabilistic pipeline for peptide and polymorphism discovery in complex mass spectrometry datasets (#787)

Matthys MG Potgieter 1 2 , Andrew AJM Nel 2 , David DL Tabb 2 3 4 5 , Suereta S Fortuin 2 , Shaun S Garnett 2 , Jerome JM Wendoh 6 , Katie K Lennard 1 , Jonathan JM Blackburn 2 , Nicola NJ Mulder 1
  1. Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, South Africa
  2. Division of Chemical and Systems Biology, Department of Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, South Africa
  3. Division of Molecular Biology and Human Genetics, Department of Biomedical Sciences, Stellenbosch University, Cape Town, South Africa
  4. Bioinformatics Unit, South African Tuberculosis Bioinformatics Initiative, Stellenbosch University, Cape Town, South Africa
  5. The African Microbiome Institute, The African Microbiome Institute, Stellenbosch University, Stellenbosch, South Africa
  6. Division of Immunology, Department of Pathology, IDM, University of Cape Town , Cape Town, South Africa

The characterization of complex mass spectrometry data obtained from metaproteomics or clinical studies presents unique challenges and potential insights in the pathogenesis of human disease. Previous approaches essentially rely on prior expectation or knowledge of likely sample composition in order to construct focussed search libraries, but this is potentially limiting in many cases. Here we present a novel software pipeline to directly estimate the proteins and species present in complex mass spectrometry samples at the level of expressed proteomes, using de novo sequence tag matching and probabilistic optimization of very large sequence databases prior to target-decoy search. We validated our pipeline against the results obtained from the recently published MetaPro-IQ (Zhang et al., 2016) pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications being found. We then showed that using an unbiased search of the entire release of UniProt (ca. 90 million protein sequences) MetaNovo was able to identify a similar bacterial taxonomic distribution compared to that found using a small, focused matched metagenome database, but now also simultaneously identified proteins present in the samples that are derived from other organisms that are missed by 16S or shotgun sequencing and by previous metaproteomic methods. Using MetaNovo to analyze a set of single-organism human neuroblastoma cell-line samples (SH-SY5Y) against UniProt we achieved a comparable MS/MS identification rate during target-decoy search to using the UniProt human Reference proteome, with 22583  (85.99 %) of the total set of identified peptides shared in common. Taxonomic analysis of 612 peptides not found in the canonical set of human proteins yielded 158 peptides unique to the Chordata phylum as potential human variant identifications. Of these, 40 had previously been predicted and 9 identified using whole genome sequencing in a proteogenomic study of the same cell line. 

  1. Zhang, X., Ning, Z., Mayne, J., Moore, J. I., Li, J., Butcher, J., … Figeys, D. (2016). MetaPro-IQ: a universal metaproteomic approach to studying human and mouse gut microbiota. Microbiome, 4(1), 31. https://doi.org/10.1186/s40168-016-0176-z