Mass spectrometry-based proteomics plays an important role in identifying peptides. Peptide identification strongly depends on a precursor mass estimated from the preceding precursor scan. However, we often observe that estimated precursor masses include isotopic errors. Determining isotope clusters is the first step in determining correct precursor masses. Existing tools such as RAPID (1) and MS-Deconv (2) adopted heuristic functions to recognize correct isotope clusters. Such heuristic functions have been developed based on the characteristics of theoretical isotope clusters, but in isotope clusters of an experimental scan include noise and may overlap with different isotope clusters. Here, we propose a machine learning approach to identify correct isotope clusters, and it has a benefit of accommodating characteristics of experimental isotope clusters.
We designed an artificial neural network model to train characteristics of isotope clusters. The model takes a monoisotopic mass and intensities of the first to the twelfth peaks in a cluster as input, and predicts whether the given cluster is an isotope cluster or not.
To train the model, we collected ~4.2M peptide spectrum matches (PSMs) from a previous study (3). Detected isotope clusters (DICs) corresponding to the precursor of each PSM were extracted using both RAPID and MS-Deconv, and we filtered out ~2.95M DICs, whose spectral contrast angle (4) against theoretical isotope clusters (5) is below 0.80. We generated ~1.25M negative isotope clusters (NICs) consisting of partial peaks of selected ~1.25M DICs.
We applied 5-fold cross validation to prevent overfitting. The accuracy was 96.68% on average. We used DICs and NICs derived from different experimental methods (6,7) to test the model. The sensitivity and specificity were 97.35% and 85.85%, respectively.