1. The Cancer Genome Atlas (TCGA) Dataset¶
1.1 About¶
This dataset collection comprises the pretraining foundation of the COMPASS model, encompassing large-scale transcriptomic data across multiple cancer types. Data were obtained from The Cancer Genome Atlas (TCGA) via the Genomic Data Commons (GDC) Portal (version 37) using the TCGAbiolinks package \cite{colaprico2016tcgabiolinks}. To ensure cross-cohort compatibility with downstream immunotherapy response (ICI) analyses, all RNA-seq data were processed through a unified pipeline standardized within the COMPASS framework. Please download the preprocessed TCGA dataset from Figshare:
1.2 Processing Pipeline¶
| Step | Description | Tool / Reference |
|---|---|---|
| Data Acquisition | Downloaded harmonized RNA-seq data for all TCGA projects via the GDC portal (v37) using TCGAbiolinks. |
GDC Portal; Colaprico et al., NAR, 2016 |
| Read Alignment | Reads were aligned to the GRCh38/hg38 reference genome using STAR (v2.7.5c). | Dobin et al., Bioinformatics, 2013 |
| Gene Annotation | Gene features annotated according to GENCODE v36. | GENCODE Consortium |
| Normalization | Raw counts normalized by gene effective length and converted to TPM: (\displaystyle \mathrm{TPM}_i = \left(\frac{\mathrm{RPK}_i}{\sum_j \mathrm{RPK}_j}\right)\times10^6), where (\mathrm{RPK}_i = \frac{\text{count}_i}{\text{gene length}_i\text{ (kb)}}). | — |
| Sample Filtering | Removed normal tissue samples (n = 740), pretreated samples, and non-FFPE specimens, resulting in 10 305 tumor samples. | — |
| Aggregation | Samples aggregated to the patient level using bcr_patient_barcode, yielding 10 184 unique patient tumors. | — |
| Feature Selection | Retained 15 672 protein-coding genes, ensuring overlap with downstream ICI datasets for cross-study compatibility. | — |
1.3 Dataset Summary¶
| Category | Description |
|---|---|
| Source | The Cancer Genome Atlas (TCGA) |
| Total Patients | 10 184 unique tumors |
| Gene Features | 15 672 protein-coding genes |
| Data Type | Bulk RNA-seq (TPM-normalized) |
| Genome Reference | GRCh38 / hg38 |
| Annotation | GENCODE v36 |
| Processing Tools | STAR v2.7.5c, TCGAbiolinks R package |
| Use in COMPASS | Pretraining of the concept-bottleneck encoder and projector modules |
1.4 Notes on Pretraining Utility¶
The TCGA dataset provides a diverse, multi-cancer transcriptomic landscape that enables COMPASS to learn transferable and biologically meaningful representations before fine-tuning on smaller ICI cohorts. During pretraining, the model captures robust cross-tumor relationships among immune and stromal gene programs, forming well-disentangled concept embeddings that are largely non-overlapping. This separation enhances generalization and interpretability in downstream tasks such as response prediction, survival analysis, and biomarker discovery.
1.5 Acknowledgments¶
We thank the TCGA Research Network and GDC consortium for providing open access to high-quality multi-omics cancer data. Their efforts have been foundational to advancing computational oncology and systems-level modeling efforts like COMPASS.
1.6 Ethical Compliance¶
All TCGA data used in this study are publicly available and de-identified. Analyses comply with TCGA data-use policies and applicable ethical standards. No additional patient recruitment or intervention was performed.
2. Immunotherapy Response Prediction (ITRP) Datasets¶
2.1 About¶
This collection pertains to datasets utilized for analyzing the response to immune therapy across various cancer types. Each dataset is characterized by the cohort size, the specific type of cancer studied, the count of patients categorized into responders and non-responders, the sequencing technology used, and associated scholarly references.
These datasets are crucial for researching the effectiveness of immune therapies, understanding the molecular and genetic factors influencing patient response, and developing personalized treatment strategies. Please download all of these datasets via Doanload Hub or FigShare
| Group | Cohort | Cancer Type | Patients(R/NR) | Sequencer | Reference |
|---|---|---|---|---|---|
| Small cohort | Choueiri | KIRC | 16(3/13) | HiSeq2500 | Choueiri et al. Clinical Cancer Research, 2016 |
| Small cohort | Miao | KIRC | 17(5/12) | HiSeq2000 | Miao et al. Science, 2018 |
| Small cohort | Snyder | BLCA | 21(7/14) | HiSeq4000 | Snyder et al. PLoS Med. 2017 |
| Small cohort | Zhao | GBM | 25(11/14) | HiSeq2000 | Zhao et al. Nature Medicine, 2019 |
| Small cohort | Ravi-2(SU2CLC2) | LUSC | 25(8/17) | HiSeq2500 | Ravi et al. Nature Genetics, 2023 |
| Small cohort | Hugo | SKCM | 26(14/12) | HiSeq2000 | Hugo et al. Cell, 2016 |
| Medium cohort | Allen | SKCM | 39(13/26) | HiSeq2500 | Van Allen et al. Science, 2015 |
| Medium cohort | MGH | SKCM | 34(12/22) | HiSeq2500 | Freeman et al Cell Rep. Med, 2022 |
| Medium cohort | Kim | STAD | 45(12/33) | HiSeq2500 | Kim et al. Nature Medicine, 2018 |
| Medium cohort | Riaz | SKCM | 51(10/41) | HiSeq2000/2500 | Riaz et al. Cell, 2017 |
| Medium cohort | Rose | BLCA | 89(16/73) | NovaSeq6000 | Rose et al. BJC 2021 |
| Medium cohort | Gide | SKCM | 73(40/33) | HiSeq2500 | Gide et al. Cancer Cell, 2019 |
| Large cohort | Ravi-1(SU2CLC1) | LUAD | 102(38/64) | HiSeq2500 | Ravi et al. Nature Genetics, 2023 |
| Large cohort | Liu | SKCM | 107(41/66) | HiSeq2500 | Liu et al. Nature Medicine. 2019 |
| Large cohort | IMmotion150 | KIRC | 165(48/117) | HiSeq2500 | McDermott et al. Nature Medicine, 2018 |
| Large cohort | IMVigor210 | BLCA | 298(68/230) | HiSeq2500 | IMvigor210 Study Group. The Lancet, 2017 |
2.2 Acknowledgments¶
We extend our heartfelt gratitude to all the patients who participated in these studies, offering invaluable contributions to the advancement of medical science. Their willingness to be part of this research has played a crucial role in enhancing our understanding of immune therapy responses. We also express our sincere appreciation to the doctors, whose expertise and dedication have been instrumental in the meticulous collection and analysis of the data. Their commitment to excellence ensures the reliability and significance of these datasets. Furthermore, we acknowledge the hard work and perseverance of the scientific workers involved in these projects. Their rigorous efforts in data processing, analysis, and maintenance have been essential in developing these comprehensive datasets. Together, the collaboration of patients, doctors, and scientific workers has paved the way for significant advancements in the field of oncology and personalized medicine.
2.3 Statement of Ethical Compliance and Liability¶
We hereby declare that in the process of acquiring, handling, and analyzing these datasets, we have strictly adhered to the highest standards of ethics and scientific integrity. No actions were taken that would harm the principles of ethics or the integrity of scientific research. Additionally, all procedures and analyses were conducted in full compliance with the applicable legal and regulatory requirements.
We emphasize that our role was confined to the development and analysis of these datasets, and we bear no responsibility or liability for the outcomes or results produced by third parties using these data sets. Users of these datasets should independently verify the data and ensure their analyses are compliant with ethical and legal standards. We encourage the responsible use of these datasets in a manner that respects the dignity, privacy, and rights of all participants involved in the studies.es.