To maximize the number of identified novel peptides from proteome data, variations derived from genomic and transcriptomic data can be used to generate variant peptides that are absent from the reference sequences. We developed AWVP, automated workflow for variant peptide production, to generate variant peptides from FASTQs of whole exome sequencing (WES) and RNA sequencing (RNA-Seq) data.
AWVP utilizes GATK4 to produce analysis-ready BAMs from WES and RNA-Seq FASTQs. HaplotypeCaller will be used to identify germline variations from BAM files of normal samples. Mutect2, VarScan2, MuSE, and SomaticSniper are utilized to call somatic variations from normal-tumor matched WES and RNA-Seq BAMs. Customprodbj is used to create SNV variant peptides. To identify additional variant peptides from RNA-Seq data, AWVP uses StringTie to perform transcript assembly from RNA-Seq BAMs with annotation-guided mode and de novo assembly mode. Open reading frames (ORF) will be translated from the assembled transcripts by TransDecoder.LongOrfs. AWVP also uses MiXCR to predict and translate T-cell receptors into CDR3 peptides from RNA-Seq. Besides, AWVP also integrates KNIFE to retrieve back-splicing junctions of circular RNAs (circRNAs) from RNA-Seq data. A six-reading-frame translation tool, sixpack, is used to identify peptides that across back-splicing junctions of circRNAs. Finally, proteins or peptides that are identical to the standard reference proteins will be discarded.
We used AWVP to generate variant peptides for 50 WES and 39 exome capture RNA-Seq data of normal-tumor matched OSCC samples. The preliminary study identified 69,359 germline and 9,389 somatic mutations. We found ~410,000 proteins derived from assembled transcripts. About 2,253 and 2,684 distinct T-cell receptors were identified in normal and tumor samples, respectively. Finally, we found 74,349 and 78,673 peptides across circRNA back-splicing junctions in normal and tumor samples, respectively.
With AWVP, we can automatically produce comprehensive variant peptides for proteogenomic studies in standardized, repeatable, and reproducible manners.