Poster Presentation HUPO 2019 - 18th Human Proteome Organization World Congress

Harnessing machine-learning techniques to accurately identify protein complexes and their changes based on SEC-SWATH/DIA data (#644)

Chen Li 1 2 , Andrea Fossati 1 3 , Moritz Heusel 4 , Fabian Frommelt 1 , Federico Uliana 1 , Isabell Bludau 1 , Varun Sharma 1 , Matteo Manica 5 , Peng Xue 1 6 , María Rodríguez Martínez 5 , Patrick G.A. Pedrioli 1 , Anthony W. Purcell 2 , Matthias Gstaiger 1 , Rudolf Aebersold 1 7
  1. Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Zürich, Switzerland
  2. Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia
  3. Department of Biology, Institute of Molecular Health Sciences, ETH Zürich, Zürich, Switzerland
  4. Division of Infection Medicine (BMC), Department of Clinical Sciences, Lund University, Lund, Sweden
  5. IBM Research Zürich, Zürich, Switzerland
  6. Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
  7. Faculty of Science, University of Zürich, Zürich, Switzerland

Biological functions are usually performed and regulated by complexes of interacting proteins. In most cases, transcriptomic and proteomic measurements have focused on individual molecules and their functions. Despite the availability of a range of methods for the analysis of specific protein complexes, systematic analysis of the ensemble of protein complexes in a sample has remained challenging, highlighting the urgent need to identify protein complexes and their functions using high-throughput proteomic techniques. Approaches based on biochemical fractionation of intact complexes and correlation of protein profiles have shown promising performance when combining size exclusion chromatography (SEC) with highly accurate protein quantification by SWATH-MS. However, interpretation of co-fractionation datasets to protein complex composition, abundance and dynamic rearrangements remains challenging due to the limited experimentally-verified protein complexes in public knowledgebases1, because of which novel protein complexes and their changes are not identified without prior evidence. Here, we used advanced machine-learning techniques to construct a systematic framework based on Random Forest (RF) to identify novel protein complexes from SEC-SWATH-MS data and to characterize their changes across different experimental/medical conditions. With raw protein matrices of different conditions, our framework can accurately identify novel protein complexes and their changes across conditions, using hypothesis generation, data rescaling and differential analysis modules in the software package. Experiments demonstrated that the RF model achieved outstanding performance with an average AUC of 0.948, accuracy of 88.3% and MCC of 0.762 via five-fold cross-validation on three in-house/published SEC-SWATH-MS datasets (mapped to the CORUM database). Independent tests across different data acquisition methods (DDA vs. DIA) and species (human vs. mouse) also highlighted the strong generalizability of the RF model and the robustness of the prediction performance. The software is being implemented and will be freely available for prediction and analysis of novel protein complexes and their changes across conditions from co-fractionation-MS datasets.

  1. Moritz Heusel, Isabell Bludau, George Rosenberger, Robin Hafen, Max Frank, Amir Banaei‐Esfahani, Audrey van Drogen, Ben C. Collins, Matthias Gstaiger, and Ruedi Aebersold. Complex‐centric proteome profiling by SEC‐SWATH‐MS. Molecular Systems Biology (2019) 15, e8438