Introduction¶
This offline tutorial corresponds to the feature-extraction module available at https://www.immuno-compass.com/extract/.
COMPASS serves a dual purpose — it functions both as a response predictor and as a biologically grounded feature extractor. As a feature extractor, COMPASS captures multi-scale representations of transcriptomic data at the gene, gene-set, and concept levels. These hierarchical features enable diverse downstream analyses, such as predicting patient survival using Cox proportional hazards models or building logistic regression models for immunotherapy response prediction.
COMPASS provides two core functions for generating these representations:
extract()— returns the full multi-level feature matrices (gene, gene-set, and concept).dfg, dfgs, dfct = finetuner.extract(dfcx, batch_size=128, with_gene_level=True)
This function produces three outputs corresponding to gene-level, gene-set-level, and concept-level features.
project()— returns compact, vectorized embeddings of the gene-set and concept levels.dfgs_vector, dfct_vector = finetuner.project(dfcx, batch_size=128)
Use this function when you need the extracted features in a vector format (dim = 32) suitable for downstream machine-learning models or statistical analyses.
Together, these functions make COMPASS a flexible interface for bridging biological interpretability and computational modeling, providing a standardized pipeline for obtaining structured feature embeddings across multiple biological scales.
from compass.utils import plot_embed_with_label
from compass import PreTrainer, FineTuner, loadcompass #, get_minmal_epoch
from compass.utils import plot_embed_with_label, plot_performance, score2
from compass.tokenizer import CANCER_CODE
import os
from tqdm import tqdm
from itertools import chain
import pandas as pd
import numpy as np
import random, torch
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = 'white', font_scale=1.3)
import warnings
warnings.filterwarnings("ignore")
def onehot(S):
assert type(S) == pd.Series, 'Input type should be pd.Series'
dfd = pd.get_dummies(S, dummy_na=True)
nanidx = dfd[dfd[np.nan].astype(bool)].index
dfd.loc[nanidx, :] = np.nan
dfd = dfd.drop(columns=[np.nan])*1.
cols = dfd.sum().sort_values(ascending=False).index.tolist()
dfd = dfd[cols]
return dfd
## load finetuner, your can load any finetuners
## Here we load finetuner_without_gide.pt to test the Gide cohort performance:
finetuner = loadcompass('./tmpignore/pft_leave_Gide.pt', map_location = 'cpu')
## read data
df_label = pd.read_pickle('./tmpignore/ITRP.PATIENT.TABLE')
df_tpm = pd.read_pickle('./tmpignore/ITRP.TPM.TABLE')
df_label = df_label[df_label.cohort == 'Gide']
df_tpm = df_tpm.loc[df_label.index]
df_tpm.shape, df_label.shape
((73, 15672), (73, 110))
Prepare model inputs¶
dfcx = df_label.cancer_type.map(CANCER_CODE).to_frame('cancer_code').join(df_tpm)
df_task = onehot(df_label.response_label)
dfcx.head()
| cancer_code | A1BG | A1CF | A2M | A2ML1 | A4GALT | A4GNT | AAAS | AACS | AADAC | ... | ZWILCH | ZWINT | ZXDA | ZXDB | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Index | |||||||||||||||||||||
| 1_ipiPD1_PRE | 25 | 5.23 | 0.02 | 82.96 | 0.10 | 0.75 | 0.03 | 27.57 | 3.23 | 0.04 | ... | 10.48 | 3.47 | 0.70 | 1.63 | 2.43 | 0.05 | 2.98 | 10.81 | 6.30 | 4.01 |
| 2_ipiPD1_PRE | 25 | 7.39 | 0.00 | 1154.40 | 0.00 | 0.95 | 0.03 | 48.91 | 2.10 | 0.01 | ... | 17.20 | 7.46 | 0.44 | 0.79 | 5.81 | 0.00 | 5.02 | 37.27 | 13.47 | 8.14 |
| 6_ipiPD1_PRE | 25 | 3.91 | 0.00 | 168.14 | 0.11 | 0.52 | 0.01 | 18.20 | 2.08 | 0.00 | ... | 4.73 | 1.54 | 0.57 | 1.06 | 1.81 | 0.01 | 2.79 | 4.11 | 6.77 | 3.74 |
| 7_ipiPD1_PRE | 25 | 1.85 | 0.01 | 80.62 | 0.00 | 0.21 | 0.03 | 4.82 | 0.84 | 0.06 | ... | 4.07 | 1.58 | 0.44 | 0.39 | 0.87 | 0.00 | 2.00 | 8.44 | 2.20 | 2.92 |
| 8_ipiPD1_PRE | 25 | 5.39 | 0.00 | 76.01 | 0.02 | 0.81 | 0.09 | 49.43 | 3.93 | 0.00 | ... | 14.25 | 10.21 | 0.89 | 1.91 | 3.05 | 0.03 | 11.61 | 17.74 | 7.96 | 18.08 |
5 rows × 15673 columns
Feature Extraction¶
To obtain multi-scale representations from the COMPASS model, use the extract() function.
This function returns features across three hierarchical levels—gene, gene-set, and concept—providing interpretable biological representations that can be directly used for downstream analyses such as visualization, clustering, or predictive modeling.
# Extract multi-level features (gene, gene-set, and concept)
dfg, dfgs, dfct = finetuner.extract(dfcx, batch_size=128, with_gene_level=True)
Outputs:
dfg— gene-level features (fine-grained transcriptomic representations)dfgs— gene-set–level features (aggregated biological program representations)dfct— concept-level features (44 high-level TIME concept embeddings)
These three feature layers capture complementary biological information: the gene-level captures raw expression patterns, the gene-set–level represents modular immune or pathway activities, and the concept-level provides condensed, interpretable abstractions that reflect the tumor immune microenvironment.
dfg, dfgs, dfct = finetuner.extract(dfcx, batch_size = 128, with_gene_level=True,)
100%|##################################################################################################| 1/1 [00:04<00:00, 4.59s/it]
## Gene score
dfg.head()
| A1BG | A1CF | A2M | A2ML1 | A4GALT | A4GNT | AAAS | AACS | AADAC | AADAT | ... | ZWILCH | ZWINT | ZXDA | ZXDB | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Index | |||||||||||||||||||||
| 1_ipiPD1_PRE | 1.169039 | 2.449609 | 1.683879 | -0.084830 | 1.836729 | 1.253685 | 1.150858 | 0.734530 | 2.245176 | 0.619049 | ... | 1.499457 | 1.419190 | 0.597211 | 1.421927 | 0.878273 | 0.670976 | 1.473282 | 0.575315 | 1.169071 | 1.124358 |
| 2_ipiPD1_PRE | 1.204497 | 2.445823 | 1.605537 | -0.086021 | 1.859456 | 1.249328 | 1.154228 | 0.801568 | 2.243999 | 0.814586 | ... | 1.624799 | 1.293046 | 0.547714 | 1.515578 | 0.933641 | 0.661009 | 1.387423 | 0.428381 | 1.191961 | 1.042395 |
| 6_ipiPD1_PRE | 1.149980 | 2.454264 | 1.673221 | -0.075696 | 1.809510 | 1.261607 | 1.161912 | 0.818099 | 2.248315 | 0.413970 | ... | 1.287323 | 1.538255 | 0.584047 | 1.491358 | 0.878151 | 0.675346 | 1.490548 | 0.713301 | 1.179227 | 1.139689 |
| 7_ipiPD1_PRE | 1.015749 | 2.459732 | 1.698870 | -0.064840 | 1.760481 | 1.270881 | 1.260959 | 0.946761 | 2.251693 | 0.520068 | ... | 1.252374 | 1.540855 | 0.567637 | 1.598653 | 0.880143 | 0.681430 | 1.545337 | 0.622909 | 1.086994 | 1.170458 |
| 8_ipiPD1_PRE | 1.161852 | 2.439999 | 1.674632 | -0.095933 | 1.837104 | 1.244512 | 1.147320 | 0.684741 | 2.240523 | 0.786413 | ... | 1.570173 | 1.233923 | 0.614341 | 1.385503 | 0.877459 | 0.655807 | 1.229304 | 0.502371 | 1.169204 | 0.969461 |
5 rows × 15672 columns
## Geneset score
dfgs.head()
| CANCER | Bcell_l_Danaher17 | Bcell_sc | MemBcell_sc | NaiBcell_sc | Plasma_sc | CD4Tcell_Combes22 | CD4Tcell_IL2_I_Kaptein2022 | CD4Tcell_sc | Th17CD4Tcell_sc | ... | Cell_cycle | Cell_cycle_reg | Nucleotide_excision_repair | Fanconi_anemia | Homologous_recombination | Base_excision_repair | APOBEC_set | Ubiquitous_immune_sc | Ubiquitous_sc | Reference_NanoString09 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Index | |||||||||||||||||||||
| 1_ipiPD1_PRE | -0.352521 | 0.432707 | 0.420027 | 1.172041 | -0.287325 | 1.043323 | 0.311138 | 1.255695 | 1.072178 | 0.189012 | ... | 0.161582 | 0.926270 | 0.573393 | 1.477544 | 1.117383 | 0.104920 | 1.072586 | 1.210557 | 1.071424 | 1.108158 |
| 2_ipiPD1_PRE | -0.352638 | 0.333078 | 0.021753 | 1.508070 | -0.278033 | 0.638910 | 0.289371 | 1.213502 | 1.044827 | 0.233261 | ... | 0.197946 | 0.525268 | 0.553203 | 1.431661 | 1.417456 | 0.106863 | 1.271247 | 1.218733 | 1.066589 | 1.126129 |
| 6_ipiPD1_PRE | -0.352322 | 0.316112 | -0.060261 | 1.218292 | -0.240613 | 0.696811 | 0.261421 | 1.241264 | 1.179401 | 0.187190 | ... | 0.161809 | 0.849286 | 0.646948 | 1.584741 | -0.297700 | 0.172258 | 1.180708 | 1.210495 | 1.040480 | 1.085420 |
| 7_ipiPD1_PRE | -0.352178 | 0.459110 | 0.505379 | 1.238853 | -0.268465 | 0.883710 | 0.355863 | 1.255342 | 1.038085 | 0.193928 | ... | 0.134883 | 0.508664 | 0.698877 | 1.592361 | -0.438019 | 0.267424 | 1.022070 | 1.184555 | 1.001112 | 1.060885 |
| 8_ipiPD1_PRE | -0.352844 | 0.276204 | -0.098767 | 1.260277 | -0.283390 | 0.649083 | 0.504307 | 1.152476 | 1.049000 | 0.161744 | ... | 0.175086 | 0.563415 | 0.483402 | 1.396649 | 1.585068 | 0.100808 | 0.938920 | 1.226078 | 1.109367 | 1.140830 |
5 rows × 133 columns
## Concept score
dfct.head()
| CANCER | Bcell_general | Memory_Bcell | Naive_Bcell | Plasma_cell | CD4_Tcell | CD8_Tcell | Memory_Tcell | Naive_Tcell | Tcell_general | ... | Pancreatic | Pneumocyte | Apoptosis_pathway | IFNg_pathway | TGFb_pathway | Cytokine | Cell_proliferation | TLS | Genome_integrity | Reference | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Index | |||||||||||||||||||||
| 1_ipiPD1_PRE | -0.352521 | 0.429619 | 1.172041 | -0.287325 | 1.043323 | 0.657120 | 0.831903 | 0.935066 | 0.600297 | 0.694314 | ... | 1.087187 | 0.510859 | 0.934440 | 0.604341 | 0.612377 | 0.757836 | 0.524822 | 0.636157 | 0.844297 | 1.179961 |
| 2_ipiPD1_PRE | -0.352638 | 0.257252 | 1.508070 | -0.278033 | 0.638910 | 0.657569 | 0.841095 | 1.014496 | 1.496784 | 0.804392 | ... | 1.018678 | 0.650432 | 0.905410 | 0.608132 | 0.731752 | 0.730579 | 0.528674 | 0.794810 | 0.884266 | 1.188540 |
| 6_ipiPD1_PRE | -0.352322 | 0.224443 | 1.218292 | -0.240613 | 0.696811 | 0.672161 | 0.904819 | 1.076472 | 1.095704 | 0.797161 | ... | 1.049442 | 0.478319 | 0.962430 | 0.699676 | 0.628745 | 0.764214 | 0.524610 | 0.584705 | 0.779095 | 1.173117 |
| 7_ipiPD1_PRE | -0.352178 | 0.470379 | 1.238853 | -0.268465 | 0.883710 | 0.656042 | 0.919244 | 0.963916 | 0.587241 | 0.805069 | ... | 0.996232 | 0.445596 | 1.070265 | 0.785052 | 0.607137 | 0.797074 | 0.545194 | 0.647408 | 0.764777 | 1.146126 |
| 8_ipiPD1_PRE | -0.352844 | 0.184876 | 1.260277 | -0.283390 | 0.649083 | 0.635713 | 0.932623 | 1.168944 | 1.035743 | 0.771475 | ... | 1.134683 | 0.515614 | 0.831206 | 0.677245 | 0.563624 | 0.742530 | 0.589367 | 0.682504 | 0.836550 | 1.200522 |
5 rows × 44 columns
Feature Projection¶
If you want to obtain the extracted features in a compact vector format (dim = 32), you can use the project() function.
This function computes the aggregated embeddings for each sample at the gene-set and concept levels, providing low-dimensional feature vectors suitable for downstream modeling (e.g., logistic regression, clustering, or visualization).
# Project features into compact 32-dimensional vectors
dfgs_vector, dfct_vector = finetuner.project(dfcx, batch_size=128)
Outputs:
dfgs_vector— vectorized gene-set–level representationsdfct_vector— vectorized concept–level representations
These embeddings summarize each patient’s transcriptomic profile in a biologically structured manner, allowing direct integration into machine learning or survival analysis pipelines while preserving interpretability.
dfgs_vector, dfct_vector = finetuner.project(dfcx, batch_size = 128)
100%|##################################################################################################| 1/1 [00:06<00:00, 6.39s/it]
dfgs_vector.head()
| channel_0 | channel_1 | channel_2 | channel_3 | channel_4 | channel_5 | channel_6 | channel_7 | channel_8 | channel_9 | ... | channel_22 | channel_23 | channel_24 | channel_25 | channel_26 | channel_27 | channel_28 | channel_29 | channel_30 | channel_31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1_ipiPD1_PRE$$Bcell_l_Danaher17 | -0.022404 | -0.329064 | -0.581034 | 0.559876 | 0.048853 | 0.345407 | 0.528212 | 0.339944 | 0.185740 | -0.403372 | ... | -0.051746 | -0.292742 | -0.732242 | 0.379193 | -1.078328 | 0.726058 | 0.249185 | -0.282801 | 0.876339 | 0.382448 |
| 1_ipiPD1_PRE$$Bcell_sc | -0.257568 | -0.178850 | -0.861084 | 0.499427 | -0.070942 | 0.378438 | 0.455010 | 0.491471 | 0.079223 | -0.417954 | ... | 0.114543 | 0.023030 | -0.735174 | 0.085576 | -1.097540 | 0.704446 | 0.095132 | -0.160042 | 0.710576 | 0.542211 |
| 1_ipiPD1_PRE$$MemBcell_sc | -0.024015 | -0.401102 | -0.857993 | 1.001392 | -0.120416 | 0.688277 | 0.762714 | 0.626146 | 0.424701 | -0.524912 | ... | 0.132035 | -0.080913 | -0.799368 | 0.098386 | -0.998861 | 0.419123 | 0.029008 | 0.088613 | 1.047200 | 0.369479 |
| 1_ipiPD1_PRE$$NaiBcell_sc | 0.104277 | -0.500872 | -0.786505 | 0.899744 | -0.089183 | 0.598024 | 0.848580 | 0.829809 | 0.468452 | -0.461011 | ... | 0.151107 | 0.107057 | -0.839693 | 0.293571 | -0.956561 | 0.463334 | -0.044371 | -0.032138 | 0.928795 | 0.298198 |
| 1_ipiPD1_PRE$$Plasma_sc | 0.231356 | -0.391666 | -0.884502 | 0.966592 | -0.074167 | 0.647400 | 0.613974 | 0.679338 | 0.501331 | -0.490012 | ... | 0.143821 | -0.216276 | -0.551561 | 0.106869 | -1.076590 | 0.379770 | -0.028028 | -0.018039 | 1.052244 | 0.317931 |
5 rows × 32 columns
dfct_vector.head()
| channel_0 | channel_1 | channel_2 | channel_3 | channel_4 | channel_5 | channel_6 | channel_7 | channel_8 | channel_9 | ... | channel_22 | channel_23 | channel_24 | channel_25 | channel_26 | channel_27 | channel_28 | channel_29 | channel_30 | channel_31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1_ipiPD1_PRE$$Bcell_general | -0.079680 | -0.292478 | -0.649243 | 0.545153 | 0.019675 | 0.353452 | 0.510383 | 0.376850 | 0.159797 | -0.406923 | ... | -0.011245 | -0.215833 | -0.732956 | 0.307680 | -1.083008 | 0.720794 | 0.211664 | -0.252902 | 0.835966 | 0.421360 |
| 1_ipiPD1_PRE$$Memory_Bcell | -0.024015 | -0.401102 | -0.857993 | 1.001392 | -0.120416 | 0.688277 | 0.762714 | 0.626146 | 0.424701 | -0.524912 | ... | 0.132035 | -0.080913 | -0.799368 | 0.098386 | -0.998861 | 0.419123 | 0.029008 | 0.088613 | 1.047200 | 0.369479 |
| 1_ipiPD1_PRE$$Naive_Bcell | 0.104277 | -0.500872 | -0.786505 | 0.899744 | -0.089183 | 0.598024 | 0.848580 | 0.829809 | 0.468452 | -0.461011 | ... | 0.151107 | 0.107057 | -0.839693 | 0.293571 | -0.956561 | 0.463334 | -0.044371 | -0.032138 | 0.928795 | 0.298198 |
| 1_ipiPD1_PRE$$Plasma_cell | 0.231356 | -0.391666 | -0.884502 | 0.966592 | -0.074167 | 0.647400 | 0.613974 | 0.679338 | 0.501331 | -0.490012 | ... | 0.143821 | -0.216276 | -0.551561 | 0.106869 | -1.076590 | 0.379770 | -0.028028 | -0.018039 | 1.052244 | 0.317931 |
| 1_ipiPD1_PRE$$CD4_Tcell | 0.090176 | -0.373335 | -0.712891 | 0.927668 | 0.035459 | 0.473183 | 0.775224 | 0.655536 | 0.241735 | -0.399517 | ... | 0.096202 | -0.191678 | -0.722979 | 0.246003 | -0.943080 | 0.504722 | 0.154668 | -0.042344 | 0.815687 | 0.335990 |
5 rows × 32 columns