Introduction¶

This offline tutorial corresponds to the feature-extraction module available at https://www.immuno-compass.com/extract/.

COMPASS serves a dual purpose — it functions both as a response predictor and as a biologically grounded feature extractor. As a feature extractor, COMPASS captures multi-scale representations of transcriptomic data at the gene, gene-set, and concept levels. These hierarchical features enable diverse downstream analyses, such as predicting patient survival using Cox proportional hazards models or building logistic regression models for immunotherapy response prediction.

COMPASS provides two core functions for generating these representations:

  1. extract() — returns the full multi-level feature matrices (gene, gene-set, and concept).

    dfg, dfgs, dfct = finetuner.extract(dfcx, batch_size=128, with_gene_level=True)
    

    This function produces three outputs corresponding to gene-level, gene-set-level, and concept-level features.

  2. project() — returns compact, vectorized embeddings of the gene-set and concept levels.

    dfgs_vector, dfct_vector = finetuner.project(dfcx, batch_size=128)
    

    Use this function when you need the extracted features in a vector format (dim = 32) suitable for downstream machine-learning models or statistical analyses.

Together, these functions make COMPASS a flexible interface for bridging biological interpretability and computational modeling, providing a standardized pipeline for obtaining structured feature embeddings across multiple biological scales.

In [1]:
from compass.utils import plot_embed_with_label
from compass import PreTrainer, FineTuner, loadcompass #, get_minmal_epoch
from compass.utils import plot_embed_with_label, plot_performance, score2
from compass.tokenizer import CANCER_CODE
In [2]:
import os
from tqdm import tqdm
from itertools import chain
import pandas as pd
import numpy as np
import random, torch
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = 'white', font_scale=1.3)
import warnings
warnings.filterwarnings("ignore")

def onehot(S):
    assert type(S) == pd.Series, 'Input type should be pd.Series'
    dfd = pd.get_dummies(S, dummy_na=True)
    nanidx = dfd[dfd[np.nan].astype(bool)].index
    dfd.loc[nanidx, :] = np.nan
    dfd = dfd.drop(columns=[np.nan])*1.
    cols = dfd.sum().sort_values(ascending=False).index.tolist()
    dfd = dfd[cols]
    return dfd

Download finetuned model¶

dowanload the finetuner models from here¶

In [3]:
## load finetuner, your can load any finetuners
## Here we load finetuner_without_gide.pt to test the Gide cohort performance:

finetuner = loadcompass('./tmpignore/pft_leave_Gide.pt', map_location = 'cpu')

## read data
df_label = pd.read_pickle('./tmpignore/ITRP.PATIENT.TABLE')
df_tpm = pd.read_pickle('./tmpignore/ITRP.TPM.TABLE')

df_label = df_label[df_label.cohort == 'Gide']
df_tpm = df_tpm.loc[df_label.index]

df_tpm.shape, df_label.shape
Out[3]:
((73, 15672), (73, 110))

Prepare model inputs¶

In [4]:
dfcx = df_label.cancer_type.map(CANCER_CODE).to_frame('cancer_code').join(df_tpm)
df_task = onehot(df_label.response_label)
dfcx.head()
Out[4]:
cancer_code A1BG A1CF A2M A2ML1 A4GALT A4GNT AAAS AACS AADAC ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
Index
1_ipiPD1_PRE 25 5.23 0.02 82.96 0.10 0.75 0.03 27.57 3.23 0.04 ... 10.48 3.47 0.70 1.63 2.43 0.05 2.98 10.81 6.30 4.01
2_ipiPD1_PRE 25 7.39 0.00 1154.40 0.00 0.95 0.03 48.91 2.10 0.01 ... 17.20 7.46 0.44 0.79 5.81 0.00 5.02 37.27 13.47 8.14
6_ipiPD1_PRE 25 3.91 0.00 168.14 0.11 0.52 0.01 18.20 2.08 0.00 ... 4.73 1.54 0.57 1.06 1.81 0.01 2.79 4.11 6.77 3.74
7_ipiPD1_PRE 25 1.85 0.01 80.62 0.00 0.21 0.03 4.82 0.84 0.06 ... 4.07 1.58 0.44 0.39 0.87 0.00 2.00 8.44 2.20 2.92
8_ipiPD1_PRE 25 5.39 0.00 76.01 0.02 0.81 0.09 49.43 3.93 0.00 ... 14.25 10.21 0.89 1.91 3.05 0.03 11.61 17.74 7.96 18.08

5 rows × 15673 columns

Feature Extraction¶

To obtain multi-scale representations from the COMPASS model, use the extract() function. This function returns features across three hierarchical levels—gene, gene-set, and concept—providing interpretable biological representations that can be directly used for downstream analyses such as visualization, clustering, or predictive modeling.

# Extract multi-level features (gene, gene-set, and concept)
dfg, dfgs, dfct = finetuner.extract(dfcx, batch_size=128, with_gene_level=True)

Outputs:

  • dfg — gene-level features (fine-grained transcriptomic representations)
  • dfgs — gene-set–level features (aggregated biological program representations)
  • dfct — concept-level features (44 high-level TIME concept embeddings)

These three feature layers capture complementary biological information: the gene-level captures raw expression patterns, the gene-set–level represents modular immune or pathway activities, and the concept-level provides condensed, interpretable abstractions that reflect the tumor immune microenvironment.

In [6]:
dfg, dfgs, dfct = finetuner.extract(dfcx, batch_size = 128, with_gene_level=True,)
100%|##################################################################################################| 1/1 [00:04<00:00,  4.59s/it]
In [7]:
## Gene score
dfg.head()
Out[7]:
A1BG A1CF A2M A2ML1 A4GALT A4GNT AAAS AACS AADAC AADAT ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
Index
1_ipiPD1_PRE 1.169039 2.449609 1.683879 -0.084830 1.836729 1.253685 1.150858 0.734530 2.245176 0.619049 ... 1.499457 1.419190 0.597211 1.421927 0.878273 0.670976 1.473282 0.575315 1.169071 1.124358
2_ipiPD1_PRE 1.204497 2.445823 1.605537 -0.086021 1.859456 1.249328 1.154228 0.801568 2.243999 0.814586 ... 1.624799 1.293046 0.547714 1.515578 0.933641 0.661009 1.387423 0.428381 1.191961 1.042395
6_ipiPD1_PRE 1.149980 2.454264 1.673221 -0.075696 1.809510 1.261607 1.161912 0.818099 2.248315 0.413970 ... 1.287323 1.538255 0.584047 1.491358 0.878151 0.675346 1.490548 0.713301 1.179227 1.139689
7_ipiPD1_PRE 1.015749 2.459732 1.698870 -0.064840 1.760481 1.270881 1.260959 0.946761 2.251693 0.520068 ... 1.252374 1.540855 0.567637 1.598653 0.880143 0.681430 1.545337 0.622909 1.086994 1.170458
8_ipiPD1_PRE 1.161852 2.439999 1.674632 -0.095933 1.837104 1.244512 1.147320 0.684741 2.240523 0.786413 ... 1.570173 1.233923 0.614341 1.385503 0.877459 0.655807 1.229304 0.502371 1.169204 0.969461

5 rows × 15672 columns

In [8]:
## Geneset score
dfgs.head()
Out[8]:
CANCER Bcell_l_Danaher17 Bcell_sc MemBcell_sc NaiBcell_sc Plasma_sc CD4Tcell_Combes22 CD4Tcell_IL2_I_Kaptein2022 CD4Tcell_sc Th17CD4Tcell_sc ... Cell_cycle Cell_cycle_reg Nucleotide_excision_repair Fanconi_anemia Homologous_recombination Base_excision_repair APOBEC_set Ubiquitous_immune_sc Ubiquitous_sc Reference_NanoString09
Index
1_ipiPD1_PRE -0.352521 0.432707 0.420027 1.172041 -0.287325 1.043323 0.311138 1.255695 1.072178 0.189012 ... 0.161582 0.926270 0.573393 1.477544 1.117383 0.104920 1.072586 1.210557 1.071424 1.108158
2_ipiPD1_PRE -0.352638 0.333078 0.021753 1.508070 -0.278033 0.638910 0.289371 1.213502 1.044827 0.233261 ... 0.197946 0.525268 0.553203 1.431661 1.417456 0.106863 1.271247 1.218733 1.066589 1.126129
6_ipiPD1_PRE -0.352322 0.316112 -0.060261 1.218292 -0.240613 0.696811 0.261421 1.241264 1.179401 0.187190 ... 0.161809 0.849286 0.646948 1.584741 -0.297700 0.172258 1.180708 1.210495 1.040480 1.085420
7_ipiPD1_PRE -0.352178 0.459110 0.505379 1.238853 -0.268465 0.883710 0.355863 1.255342 1.038085 0.193928 ... 0.134883 0.508664 0.698877 1.592361 -0.438019 0.267424 1.022070 1.184555 1.001112 1.060885
8_ipiPD1_PRE -0.352844 0.276204 -0.098767 1.260277 -0.283390 0.649083 0.504307 1.152476 1.049000 0.161744 ... 0.175086 0.563415 0.483402 1.396649 1.585068 0.100808 0.938920 1.226078 1.109367 1.140830

5 rows × 133 columns

In [9]:
## Concept score
dfct.head()
Out[9]:
CANCER Bcell_general Memory_Bcell Naive_Bcell Plasma_cell CD4_Tcell CD8_Tcell Memory_Tcell Naive_Tcell Tcell_general ... Pancreatic Pneumocyte Apoptosis_pathway IFNg_pathway TGFb_pathway Cytokine Cell_proliferation TLS Genome_integrity Reference
Index
1_ipiPD1_PRE -0.352521 0.429619 1.172041 -0.287325 1.043323 0.657120 0.831903 0.935066 0.600297 0.694314 ... 1.087187 0.510859 0.934440 0.604341 0.612377 0.757836 0.524822 0.636157 0.844297 1.179961
2_ipiPD1_PRE -0.352638 0.257252 1.508070 -0.278033 0.638910 0.657569 0.841095 1.014496 1.496784 0.804392 ... 1.018678 0.650432 0.905410 0.608132 0.731752 0.730579 0.528674 0.794810 0.884266 1.188540
6_ipiPD1_PRE -0.352322 0.224443 1.218292 -0.240613 0.696811 0.672161 0.904819 1.076472 1.095704 0.797161 ... 1.049442 0.478319 0.962430 0.699676 0.628745 0.764214 0.524610 0.584705 0.779095 1.173117
7_ipiPD1_PRE -0.352178 0.470379 1.238853 -0.268465 0.883710 0.656042 0.919244 0.963916 0.587241 0.805069 ... 0.996232 0.445596 1.070265 0.785052 0.607137 0.797074 0.545194 0.647408 0.764777 1.146126
8_ipiPD1_PRE -0.352844 0.184876 1.260277 -0.283390 0.649083 0.635713 0.932623 1.168944 1.035743 0.771475 ... 1.134683 0.515614 0.831206 0.677245 0.563624 0.742530 0.589367 0.682504 0.836550 1.200522

5 rows × 44 columns

Feature Projection¶

If you want to obtain the extracted features in a compact vector format (dim = 32), you can use the project() function. This function computes the aggregated embeddings for each sample at the gene-set and concept levels, providing low-dimensional feature vectors suitable for downstream modeling (e.g., logistic regression, clustering, or visualization).

# Project features into compact 32-dimensional vectors
dfgs_vector, dfct_vector = finetuner.project(dfcx, batch_size=128)

Outputs:

  • dfgs_vector — vectorized gene-set–level representations
  • dfct_vector — vectorized concept–level representations

These embeddings summarize each patient’s transcriptomic profile in a biologically structured manner, allowing direct integration into machine learning or survival analysis pipelines while preserving interpretability.

In [10]:
dfgs_vector, dfct_vector = finetuner.project(dfcx, batch_size = 128)
100%|##################################################################################################| 1/1 [00:06<00:00,  6.39s/it]
In [11]:
dfgs_vector.head()
Out[11]:
channel_0 channel_1 channel_2 channel_3 channel_4 channel_5 channel_6 channel_7 channel_8 channel_9 ... channel_22 channel_23 channel_24 channel_25 channel_26 channel_27 channel_28 channel_29 channel_30 channel_31
1_ipiPD1_PRE$$Bcell_l_Danaher17 -0.022404 -0.329064 -0.581034 0.559876 0.048853 0.345407 0.528212 0.339944 0.185740 -0.403372 ... -0.051746 -0.292742 -0.732242 0.379193 -1.078328 0.726058 0.249185 -0.282801 0.876339 0.382448
1_ipiPD1_PRE$$Bcell_sc -0.257568 -0.178850 -0.861084 0.499427 -0.070942 0.378438 0.455010 0.491471 0.079223 -0.417954 ... 0.114543 0.023030 -0.735174 0.085576 -1.097540 0.704446 0.095132 -0.160042 0.710576 0.542211
1_ipiPD1_PRE$$MemBcell_sc -0.024015 -0.401102 -0.857993 1.001392 -0.120416 0.688277 0.762714 0.626146 0.424701 -0.524912 ... 0.132035 -0.080913 -0.799368 0.098386 -0.998861 0.419123 0.029008 0.088613 1.047200 0.369479
1_ipiPD1_PRE$$NaiBcell_sc 0.104277 -0.500872 -0.786505 0.899744 -0.089183 0.598024 0.848580 0.829809 0.468452 -0.461011 ... 0.151107 0.107057 -0.839693 0.293571 -0.956561 0.463334 -0.044371 -0.032138 0.928795 0.298198
1_ipiPD1_PRE$$Plasma_sc 0.231356 -0.391666 -0.884502 0.966592 -0.074167 0.647400 0.613974 0.679338 0.501331 -0.490012 ... 0.143821 -0.216276 -0.551561 0.106869 -1.076590 0.379770 -0.028028 -0.018039 1.052244 0.317931

5 rows × 32 columns

In [12]:
dfct_vector.head()
Out[12]:
channel_0 channel_1 channel_2 channel_3 channel_4 channel_5 channel_6 channel_7 channel_8 channel_9 ... channel_22 channel_23 channel_24 channel_25 channel_26 channel_27 channel_28 channel_29 channel_30 channel_31
1_ipiPD1_PRE$$Bcell_general -0.079680 -0.292478 -0.649243 0.545153 0.019675 0.353452 0.510383 0.376850 0.159797 -0.406923 ... -0.011245 -0.215833 -0.732956 0.307680 -1.083008 0.720794 0.211664 -0.252902 0.835966 0.421360
1_ipiPD1_PRE$$Memory_Bcell -0.024015 -0.401102 -0.857993 1.001392 -0.120416 0.688277 0.762714 0.626146 0.424701 -0.524912 ... 0.132035 -0.080913 -0.799368 0.098386 -0.998861 0.419123 0.029008 0.088613 1.047200 0.369479
1_ipiPD1_PRE$$Naive_Bcell 0.104277 -0.500872 -0.786505 0.899744 -0.089183 0.598024 0.848580 0.829809 0.468452 -0.461011 ... 0.151107 0.107057 -0.839693 0.293571 -0.956561 0.463334 -0.044371 -0.032138 0.928795 0.298198
1_ipiPD1_PRE$$Plasma_cell 0.231356 -0.391666 -0.884502 0.966592 -0.074167 0.647400 0.613974 0.679338 0.501331 -0.490012 ... 0.143821 -0.216276 -0.551561 0.106869 -1.076590 0.379770 -0.028028 -0.018039 1.052244 0.317931
1_ipiPD1_PRE$$CD4_Tcell 0.090176 -0.373335 -0.712891 0.927668 0.035459 0.473183 0.775224 0.655536 0.241735 -0.399517 ... 0.096202 -0.191678 -0.722979 0.246003 -0.943080 0.504722 0.154668 -0.042344 0.815687 0.335990

5 rows × 32 columns