01_TCGA_ITRP_datasets

1. The Cancer Genome Atlas (TCGA) Dataset¶

1.1 About¶

This dataset collection comprises the pretraining foundation of the COMPASS model, encompassing large-scale transcriptomic data across multiple cancer types. Data were obtained from The Cancer Genome Atlas (TCGA) via the Genomic Data Commons (GDC) Portal (version 37) using the TCGAbiolinks package \cite{colaprico2016tcgabiolinks}. To ensure cross-cohort compatibility with downstream immunotherapy response (ICI) analyses, all RNA-seq data were processed through a unified pipeline standardized within the COMPASS framework. Please download the preprocessed TCGA dataset from Figshare:

1.2 Processing Pipeline¶

Step	Description	Tool / Reference
Data Acquisition	Downloaded harmonized RNA-seq data for all TCGA projects via the GDC portal (v37) using `TCGAbiolinks`.	GDC Portal; Colaprico et al., NAR, 2016
Read Alignment	Reads were aligned to the GRCh38/hg38 reference genome using STAR (v2.7.5c).	Dobin et al., Bioinformatics, 2013
Gene Annotation	Gene features annotated according to GENCODE v36.	GENCODE Consortium
Normalization	Raw counts normalized by gene effective length and converted to TPM: (\displaystyle \mathrm{TPM}_i = \left(\frac{\mathrm{RPK}_i}{\sum_j \mathrm{RPK}_j}\right)\times10^6), where (\mathrm{RPK}_i = \frac{\text{count}_i}{\text{gene length}_i\text{ (kb)}}).	—
Sample Filtering	Removed normal tissue samples (n = 740), pretreated samples, and non-FFPE specimens, resulting in 10 305 tumor samples.	—
Aggregation	Samples aggregated to the patient level using bcr_patient_barcode, yielding 10 184 unique patient tumors.	—
Feature Selection	Retained 15 672 protein-coding genes, ensuring overlap with downstream ICI datasets for cross-study compatibility.	—

1.3 Dataset Summary¶

Category	Description
Source	The Cancer Genome Atlas (TCGA)
Total Patients	10 184 unique tumors
Gene Features	15 672 protein-coding genes
Data Type	Bulk RNA-seq (TPM-normalized)
Genome Reference	GRCh38 / hg38
Annotation	GENCODE v36
Processing Tools	STAR v2.7.5c, TCGAbiolinks R package
Use in COMPASS	Pretraining of the concept-bottleneck encoder and projector modules

1.4 Notes on Pretraining Utility¶

The TCGA dataset provides a diverse, multi-cancer transcriptomic landscape that enables COMPASS to learn transferable and biologically meaningful representations before fine-tuning on smaller ICI cohorts. During pretraining, the model captures robust cross-tumor relationships among immune and stromal gene programs, forming well-disentangled concept embeddings that are largely non-overlapping. This separation enhances generalization and interpretability in downstream tasks such as response prediction, survival analysis, and biomarker discovery.

1.5 Acknowledgments¶

We thank the TCGA Research Network and GDC consortium for providing open access to high-quality multi-omics cancer data. Their efforts have been foundational to advancing computational oncology and systems-level modeling efforts like COMPASS.

1.6 Ethical Compliance¶

All TCGA data used in this study are publicly available and de-identified. Analyses comply with TCGA data-use policies and applicable ethical standards. No additional patient recruitment or intervention was performed.

2. Immunotherapy Response Prediction (ITRP) Datasets¶

2.1 About¶

This collection pertains to datasets utilized for analyzing the response to immune therapy across various cancer types. Each dataset is characterized by the cohort size, the specific type of cancer studied, the count of patients categorized into responders and non-responders, the sequencing technology used, and associated scholarly references.

These datasets are crucial for researching the effectiveness of immune therapies, understanding the molecular and genetic factors influencing patient response, and developing personalized treatment strategies. Please download all of these datasets via Doanload Hub or FigShare

Group	Cohort	Cancer Type	Patients(R/NR)	Sequencer	Reference
Small cohort	Choueiri	KIRC	16(3/13)	HiSeq2500	Choueiri et al. Clinical Cancer Research, 2016
Small cohort	Miao	KIRC	17(5/12)	HiSeq2000	Miao et al. Science, 2018
Small cohort	Snyder	BLCA	21(7/14)	HiSeq4000	Snyder et al. PLoS Med. 2017
Small cohort	Zhao	GBM	25(11/14)	HiSeq2000	Zhao et al. Nature Medicine, 2019
Small cohort	Ravi-2(SU2CLC2)	LUSC	25(8/17)	HiSeq2500	Ravi et al. Nature Genetics, 2023
Small cohort	Hugo	SKCM	26(14/12)	HiSeq2000	Hugo et al. Cell, 2016
Medium cohort	Allen	SKCM	39(13/26)	HiSeq2500	Van Allen et al. Science, 2015
Medium cohort	MGH	SKCM	34(12/22)	HiSeq2500	Freeman et al Cell Rep. Med, 2022
Medium cohort	Kim	STAD	45(12/33)	HiSeq2500	Kim et al. Nature Medicine, 2018
Medium cohort	Riaz	SKCM	51(10/41)	HiSeq2000/2500	Riaz et al. Cell, 2017
Medium cohort	Rose	BLCA	89(16/73)	NovaSeq6000	Rose et al. BJC 2021
Medium cohort	Gide	SKCM	73(40/33)	HiSeq2500	Gide et al. Cancer Cell, 2019
Large cohort	Ravi-1(SU2CLC1)	LUAD	102(38/64)	HiSeq2500	Ravi et al. Nature Genetics, 2023
Large cohort	Liu	SKCM	107(41/66)	HiSeq2500	Liu et al. Nature Medicine. 2019
Large cohort	IMmotion150	KIRC	165(48/117)	HiSeq2500	McDermott et al. Nature Medicine, 2018
Large cohort	IMVigor210	BLCA	298(68/230)	HiSeq2500	IMvigor210 Study Group. The Lancet, 2017

2.2 Acknowledgments¶

We extend our heartfelt gratitude to all the patients who participated in these studies, offering invaluable contributions to the advancement of medical science. Their willingness to be part of this research has played a crucial role in enhancing our understanding of immune therapy responses. We also express our sincere appreciation to the doctors, whose expertise and dedication have been instrumental in the meticulous collection and analysis of the data. Their commitment to excellence ensures the reliability and significance of these datasets. Furthermore, we acknowledge the hard work and perseverance of the scientific workers involved in these projects. Their rigorous efforts in data processing, analysis, and maintenance have been essential in developing these comprehensive datasets. Together, the collaboration of patients, doctors, and scientific workers has paved the way for significant advancements in the field of oncology and personalized medicine.

2.3 Statement of Ethical Compliance and Liability¶

We hereby declare that in the process of acquiring, handling, and analyzing these datasets, we have strictly adhered to the highest standards of ethics and scientific integrity. No actions were taken that would harm the principles of ethics or the integrity of scientific research. Additionally, all procedures and analyses were conducted in full compliance with the applicable legal and regulatory requirements.

We emphasize that our role was confined to the development and analysis of these datasets, and we bear no responsibility or liability for the outcomes or results produced by third parties using these data sets. Users of these datasets should independently verify the data and ensure their analyses are compliant with ethical and legal standards. We encourage the responsible use of these datasets in a manner that respects the dignity, privacy, and rights of all participants involved in the studies.es.