Interdisciplinary consortia like the Clinical Proteomic Tumor Analysis Consortium (CPTAC) drive scientific discovery by creating coordinated and structured datasets on a large scale, amassing both a number of samples and diversity of measurements that requires a collaborative effort. Although the dataset is analyzed in a primary publication, these consortia encourage and promote reanalysis by the scientific community to explore new questions or apply novel methodologies. Therefore, expanding access to data is an important goal for publicly funded research.
To more seamlessly enable secondary analyses, data dissemination technologies need to meet their intended audiences in the most convenient and accessible way possible. Although storing data in supplemental tables or cloud-based archives is fine for historical records, it is not the optimal dissemination method. To facilitate easy reanalysis, data needs to be accessible within an analytical environment - it needs to be accessible to software via APIs. No such mechanism currently exists for the quantitative molecular data tables which form the foundation of data analysis tasks.
We present a unified API for accessing all CPTAC proteogenomic data for cancer cohorts. Each cancer type contains data for ~100 tumors, which are comprehensively characterized with genomics, transcriptomics, and proteomics, as well as relevant clinical information. This data is packaged within a python module, cptac, that is freely distributed through the Python Package Index (PyPI). This removes many of the common barriers to re-analysis. Most importantly, the data is accessible via code without parsing, formatting, web-browsing or passwords. The data tables of omics and clinical information come pre-formatted as dataframes ready for statistical and visual analytic packages, and the API handles any complex merging between data types. The module contains extensive tutorials and documentation to assist users in understanding the data. Additionally, the API wraps common algorithms used by CPTAC in their primary analyses.