1. The Cancer Genome Atlas (TCGA) Dataset¶

1.1 About¶

This dataset collection comprises the pretraining foundation of the COMPASS model, encompassing large-scale transcriptomic data across multiple cancer types. Data were obtained from The Cancer Genome Atlas (TCGA) via the Genomic Data Commons (GDC) Portal (version 37) using the TCGAbiolinks package \cite{colaprico2016tcgabiolinks}. To ensure cross-cohort compatibility with downstream immunotherapy response (ICI) analyses, all RNA-seq data were processed through a unified pipeline standardized within the COMPASS framework. Please download the preprocessed TCGA dataset from Figshare:

Figshare

1.2 Processing Pipeline¶

Step Description Tool / Reference
Data Acquisition Downloaded harmonized RNA-seq data for all TCGA projects via the GDC portal (v37) using TCGAbiolinks. GDC Portal; Colaprico et al., NAR, 2016
Read Alignment Reads were aligned to the GRCh38/hg38 reference genome using STAR (v2.7.5c). Dobin et al., Bioinformatics, 2013
Gene Annotation Gene features annotated according to GENCODE v36. GENCODE Consortium
Normalization Raw counts normalized by gene effective length and converted to TPM: (\displaystyle \mathrm{TPM}_i = \left(\frac{\mathrm{RPK}_i}{\sum_j \mathrm{RPK}_j}\right)\times10^6), where (\mathrm{RPK}_i = \frac{\text{count}_i}{\text{gene length}_i\text{ (kb)}}). —
Sample Filtering Removed normal tissue samples (n = 740), pretreated samples, and non-FFPE specimens, resulting in 10 305 tumor samples. —
Aggregation Samples aggregated to the patient level using bcr_patient_barcode, yielding 10 184 unique patient tumors. —
Feature Selection Retained 15 672 protein-coding genes, ensuring overlap with downstream ICI datasets for cross-study compatibility. —

1.3 Dataset Summary¶

Category Description
Source The Cancer Genome Atlas (TCGA)
Total Patients 10 184 unique tumors
Gene Features 15 672 protein-coding genes
Data Type Bulk RNA-seq (TPM-normalized)
Genome Reference GRCh38 / hg38
Annotation GENCODE v36
Processing Tools STAR v2.7.5c, TCGAbiolinks R package
Use in COMPASS Pretraining of the concept-bottleneck encoder and projector modules

1.4 Notes on Pretraining Utility¶

The TCGA dataset provides a diverse, multi-cancer transcriptomic landscape that enables COMPASS to learn transferable and biologically meaningful representations before fine-tuning on smaller ICI cohorts. During pretraining, the model captures robust cross-tumor relationships among immune and stromal gene programs, forming well-disentangled concept embeddings that are largely non-overlapping. This separation enhances generalization and interpretability in downstream tasks such as response prediction, survival analysis, and biomarker discovery.

1.5 Acknowledgments¶

We thank the TCGA Research Network and GDC consortium for providing open access to high-quality multi-omics cancer data. Their efforts have been foundational to advancing computational oncology and systems-level modeling efforts like COMPASS.

1.6 Ethical Compliance¶

All TCGA data used in this study are publicly available and de-identified. Analyses comply with TCGA data-use policies and applicable ethical standards. No additional patient recruitment or intervention was performed.

2. Immunotherapy Response Prediction (ITRP) Datasets¶

2.1 About¶

This collection pertains to datasets utilized for analyzing the response to immune therapy across various cancer types. Each dataset is characterized by the cohort size, the specific type of cancer studied, the count of patients categorized into responders and non-responders, the sequencing technology used, and associated scholarly references.

These datasets are crucial for researching the effectiveness of immune therapies, understanding the molecular and genetic factors influencing patient response, and developing personalized treatment strategies. Please download all of these datasets via Doanload Hub or FigShare

Group Cohort Cancer Type Patients(R/NR) Sequencer Reference
Small cohort Choueiri KIRC 16(3/13) HiSeq2500 Choueiri et al. Clinical Cancer Research, 2016
Small cohort Miao KIRC 17(5/12) HiSeq2000 Miao et al. Science, 2018
Small cohort Snyder BLCA 21(7/14) HiSeq4000 Snyder et al. PLoS Med. 2017
Small cohort Zhao GBM 25(11/14) HiSeq2000 Zhao et al. Nature Medicine, 2019
Small cohort Ravi-2(SU2CLC2) LUSC 25(8/17) HiSeq2500 Ravi et al. Nature Genetics, 2023
Small cohort Hugo SKCM 26(14/12) HiSeq2000 Hugo et al. Cell, 2016
Medium cohort Allen SKCM 39(13/26) HiSeq2500 Van Allen et al. Science, 2015
Medium cohort MGH SKCM 34(12/22) HiSeq2500 Freeman et al Cell Rep. Med, 2022
Medium cohort Kim STAD 45(12/33) HiSeq2500 Kim et al. Nature Medicine, 2018
Medium cohort Riaz SKCM 51(10/41) HiSeq2000/2500 Riaz et al. Cell, 2017
Medium cohort Rose BLCA 89(16/73) NovaSeq6000 Rose et al. BJC 2021
Medium cohort Gide SKCM 73(40/33) HiSeq2500 Gide et al. Cancer Cell, 2019
Large cohort Ravi-1(SU2CLC1) LUAD 102(38/64) HiSeq2500 Ravi et al. Nature Genetics, 2023
Large cohort Liu SKCM 107(41/66) HiSeq2500 Liu et al. Nature Medicine. 2019
Large cohort IMmotion150 KIRC 165(48/117) HiSeq2500 McDermott et al. Nature Medicine, 2018
Large cohort IMVigor210 BLCA 298(68/230) HiSeq2500 IMvigor210 Study Group. The Lancet, 2017

2.2 Acknowledgments¶

We extend our heartfelt gratitude to all the patients who participated in these studies, offering invaluable contributions to the advancement of medical science. Their willingness to be part of this research has played a crucial role in enhancing our understanding of immune therapy responses. We also express our sincere appreciation to the doctors, whose expertise and dedication have been instrumental in the meticulous collection and analysis of the data. Their commitment to excellence ensures the reliability and significance of these datasets. Furthermore, we acknowledge the hard work and perseverance of the scientific workers involved in these projects. Their rigorous efforts in data processing, analysis, and maintenance have been essential in developing these comprehensive datasets. Together, the collaboration of patients, doctors, and scientific workers has paved the way for significant advancements in the field of oncology and personalized medicine.

2.3 Statement of Ethical Compliance and Liability¶

We hereby declare that in the process of acquiring, handling, and analyzing these datasets, we have strictly adhered to the highest standards of ethics and scientific integrity. No actions were taken that would harm the principles of ethics or the integrity of scientific research. Additionally, all procedures and analyses were conducted in full compliance with the applicable legal and regulatory requirements.

We emphasize that our role was confined to the development and analysis of these datasets, and we bear no responsibility or liability for the outcomes or results produced by third parties using these data sets. Users of these datasets should independently verify the data and ensure their analyses are compliant with ethical and legal standards. We encourage the responsible use of these datasets in a manner that respects the dignity, privacy, and rights of all participants involved in the studies.es.