Gene Network Analysis¶

This workbook will cover gene network analysis using large language models as an alternative to correlation analysis. Specifically, we will look at using scgpt, which is a foundation model for single-cell multi-omic data and has a wide range of capabilities.

We will use a model pretrained 10.3 million blood and bone marrow cells, since our dataset consists of B cells obtained from peripheral blood samples.

In [ ]:

Copied!





import copy
import json
import os
from pathlib import Path
import sys
import warnings
import gzip
import shutil

import torch
from anndata import AnnData
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pandas as pd
import tqdm
import gseapy as gp

from torchtext.vocab import Vocab
from torchtext._torchtext import (
    Vocab as VocabPybind,
)

sys.path.insert(0, "../")
import scgpt as scg
from scgpt.tasks import GeneEmbedding
from scgpt.tokenizer.gene_tokenizer import GeneVocab
from scgpt.model import TransformerModel
from scgpt.preprocess import Preprocessor
from scgpt.utils import set_seed 

os.environ["KMP_WARNINGS"] = "off"
warnings.filterwarnings('ignore')
import copy
import json
import os
from pathlib import Path
import sys
import warnings
import gzip
import shutil

import torch
from anndata import AnnData
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pandas as pd
import tqdm
import gseapy as gp

from torchtext.vocab import Vocab
from torchtext._torchtext import (
    Vocab as VocabPybind,
)

sys.path.insert(0, "../")
import scgpt as scg
from scgpt.tasks import GeneEmbedding
from scgpt.tokenizer.gene_tokenizer import GeneVocab
from scgpt.model import TransformerModel
from scgpt.preprocess import Preprocessor
from scgpt.utils import set_seed 

os.environ["KMP_WARNINGS"] = "off"
warnings.filterwarnings('ignore')

We need to set the parameters of the model for when we use it later in this workbook.

In [2]:

Copied!





set_seed(42)
pad_token = "<pad>"
special_tokens = [pad_token, "<cls>", "<eoc>"]
n_hvg = 1000
n_bins = 51
mask_value = -1
pad_value = -2
n_input_bins = n_bins
set_seed(42)
pad_token = ""
special_tokens = [pad_token, "", ""]
n_hvg = 1000
n_bins = 51
mask_value = -1
pad_value = -2
n_input_bins = n_bins

Lets go through the parameters so they are better understood:

set_seed is used to set the random seed for reproducibility.
pad_token is a special token used for padding sequences, to ensure that all sequences are of the same length. Therefore it will be inserted where necessary to make sequences of unequal lengths equal.
special_tokens is a list defining a set of special tokens the model may use, such as pad. Cls is commonly a classification token that represents the beginning of a sequence and is used to aggregate information for classification tasks. Eoc stands for 'end of converstion/content' and is used to denote the end of a sequence.
n_hvg is a variable representing the number of highly variable genes to be considered.
n_bins defines the number of bins to group continuous data into discrete intervals.
mask_value defines a value for masking. Masking is a technique to ignore certain parts of the data, often used in sequence models to indicate positions in a sequence that should be ignored during training or evaluation.
pad_value defines the value for padding sequences. Similar to mask_value, it indicates positions in the data that are not part of the original sequence and should be ignored.
n_input_bins assignment set the number of input bins ensuring consistency throughout the code.

Step 1: Load the Pre-Trained Model and Dataset¶

In [ ]:

Copied!





# Specify model path; here we load the pre-trained scGPT blood model
#You will have to directly download the model from here: https://github.com/bowang-lab/scGPT
model_dir = Path("/data/scgpt_blood_model")
model_config_file = model_dir / "args.json"
model_file = model_dir / "best_model.pt"
vocab_file = model_dir / "vocab.json"

vocab = GeneVocab.from_file(vocab_file)
for s in special_tokens:
    if s not in vocab:
        vocab.append_token(s)

# Retrieve model parameters from config files
with open(model_config_file, "r") as f:
    model_configs = json.load(f)
print(
    f"Resume model from {model_file}, the model args will override the "
    f"config {model_config_file}."
)
embsize = model_configs["embsize"]
nhead = model_configs["nheads"]
d_hid = model_configs["d_hid"]
nlayers = model_configs["nlayers"]
n_layers_cls = model_configs["n_layers_cls"]

gene2idx = vocab.get_stoi()
# Specify model path; here we load the pre-trained scGPT blood model
#You will have to directly download the model from here: https://github.com/bowang-lab/scGPT
model_dir = Path("/data/scgpt_blood_model")
model_config_file = model_dir / "args.json"
model_file = model_dir / "best_model.pt"
vocab_file = model_dir / "vocab.json"

vocab = GeneVocab.from_file(vocab_file)
for s in special_tokens:
    if s not in vocab:
        vocab.append_token(s)

# Retrieve model parameters from config files
with open(model_config_file, "r") as f:
    model_configs = json.load(f)
print(
    f"Resume model from {model_file}, the model args will override the "
    f"config {model_config_file}."
)
embsize = model_configs["embsize"]
nhead = model_configs["nheads"]
d_hid = model_configs["d_hid"]
nlayers = model_configs["nlayers"]
n_layers_cls = model_configs["n_layers_cls"]

gene2idx = vocab.get_stoi()

In [4]:

Copied!





#Set up Pytorch model to run on specified device (CPU or GPU), intialise the model with specific parameters, 
#load pre-trained weight, and then transfer the model to the chosen device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ntokens = len(vocab)  # size of vocabulary
model = TransformerModel(
    ntokens,
    embsize,
    nhead,
    d_hid,
    nlayers,
    vocab=vocab,
    pad_value=pad_value,
    n_input_bins=n_input_bins,
)

try:
    model.load_state_dict(torch.load(model_file))
    print(f"Loading all model params from {model_file}")
except:
    # only load params that are in the model and match the size
    model_dict = model.state_dict()
    pretrained_dict = torch.load(model_file)
    pretrained_dict = {
        k: v
        for k, v in pretrained_dict.items()
        if k in model_dict and v.shape == model_dict[k].shape
    }
    for k, v in pretrained_dict.items():
        print(f"Loading params {k} with shape {v.shape}")
        model_dict.update(pretrained_dict)
        model.load_state_dict(model_dict)

model.to(device)
#Set up Pytorch model to run on specified device (CPU or GPU), intialise the model with specific parameters, 
#load pre-trained weight, and then transfer the model to the chosen device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ntokens = len(vocab)  # size of vocabulary
model = TransformerModel(
    ntokens,
    embsize,
    nhead,
    d_hid,
    nlayers,
    vocab=vocab,
    pad_value=pad_value,
    n_input_bins=n_input_bins,
)

try:
    model.load_state_dict(torch.load(model_file))
    print(f"Loading all model params from {model_file}")
except:
    # only load params that are in the model and match the size
    model_dict = model.state_dict()
    pretrained_dict = torch.load(model_file)
    pretrained_dict = {
        k: v
        for k, v in pretrained_dict.items()
        if k in model_dict and v.shape == model_dict[k].shape
    }
    for k, v in pretrained_dict.items():
        print(f"Loading params {k} with shape {v.shape}")
        model_dict.update(pretrained_dict)
        model.load_state_dict(model_dict)

model.to(device)

Loading params encoder.embedding.weight with shape torch.Size([36574, 512])
Loading params encoder.enc_norm.weight with shape torch.Size([512])
Loading params encoder.enc_norm.bias with shape torch.Size([512])
Loading params value_encoder.linear1.weight with shape torch.Size([512, 1])
Loading params value_encoder.linear1.bias with shape torch.Size([512])
Loading params value_encoder.linear2.weight with shape torch.Size([512, 512])
Loading params value_encoder.linear2.bias with shape torch.Size([512])
Loading params value_encoder.norm.weight with shape torch.Size([512])
Loading params value_encoder.norm.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.0.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.0.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.0.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.1.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.1.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.1.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.2.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.2.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.2.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.3.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.3.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.3.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.4.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.4.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.4.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.5.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.5.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.5.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.6.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.6.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.6.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.7.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.7.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.7.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.8.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.8.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.8.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm2.weight with shape torch.Size([512])

Loading params transformer_encoder.layers.8.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.9.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.9.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.9.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.10.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.10.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.10.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.11.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.11.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.11.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm2.bias with shape torch.Size([512])
Loading params decoder.fc.0.weight with shape torch.Size([512, 512])
Loading params decoder.fc.0.bias with shape torch.Size([512])
Loading params decoder.fc.2.weight with shape torch.Size([512, 512])
Loading params decoder.fc.2.bias with shape torch.Size([512])
Loading params decoder.fc.4.weight with shape torch.Size([1, 512])
Loading params decoder.fc.4.bias with shape torch.Size([1])

Out[4]:

TransformerModel(
  (encoder): GeneEncoder(
    (embedding): Embedding(36574, 512, padding_idx=36571)
    (enc_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (value_encoder): ContinuousValueEncoder(
    (dropout): Dropout(p=0.5, inplace=False)
    (linear1): Linear(in_features=1, out_features=512, bias=True)
    (activation): ReLU()
    (linear2): Linear(in_features=512, out_features=512, bias=True)
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-11): 12 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=512, bias=True)
        (dropout): Dropout(p=0.5, inplace=False)
        (linear2): Linear(in_features=512, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.5, inplace=False)
        (dropout2): Dropout(p=0.5, inplace=False)
      )
    )
  )
  (decoder): ExprDecoder(
    (fc): Sequential(
      (0): Linear(in_features=512, out_features=512, bias=True)
      (1): LeakyReLU(negative_slope=0.01)
      (2): Linear(in_features=512, out_features=512, bias=True)
      (3): LeakyReLU(negative_slope=0.01)
      (4): Linear(in_features=512, out_features=1, bias=True)
    )
  )
  (cls_decoder): ClsDecoder(
    (_decoder): ModuleList(
      (0): Linear(in_features=512, out_features=512, bias=True)
      (1): ReLU()
      (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (3): Linear(in_features=512, out_features=512, bias=True)
      (4): ReLU()
      (5): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (out_layer): Linear(in_features=512, out_features=1, bias=True)
  )
  (sim): Similarity(
    (cos): CosineSimilarity()
  )
  (creterion_cce): CrossEntropyLoss()
)

Load dataset of interest.

In [5]:

Copied!





# Paths to the compressed and uncompressed files
compressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_compressed.h5ad.gz'
uncompressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_uncompressed.h5ad'

# Decompress the file
with gzip.open(compressed_file, 'rb') as f_in:
    with open(uncompressed_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Read the decompressed file using scanpy
adata = sc.read(uncompressed_file)

# Optionally, remove the uncompressed file if no longer needed
os.remove(uncompressed_file)
# Paths to the compressed and uncompressed files
compressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_compressed.h5ad.gz'
uncompressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_uncompressed.h5ad'

# Decompress the file
with gzip.open(compressed_file, 'rb') as f_in:
    with open(uncompressed_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Read the decompressed file using scanpy
adata = sc.read(uncompressed_file)

# Optionally, remove the uncompressed file if no longer needed
os.remove(uncompressed_file)

In [6]:

Copied!

adata
adata

Out[6]:

AnnData object with n_obs × n_vars = 3540 × 25198
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

Check that the gene variable names are in the correct format i.e. gene symbol names and not Ensembl IDs.

In [7]:

Copied!

adata.var
adata.var

Out[7]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length	highly_variable	means	dispersions	dispersions_norm
feature_name
MIR1302-2HG	False	MIR1302-2HG	NCBITaxon:9606	gene	1021	False	1.000000e-12	NaN	NaN
FAM138A	False	FAM138A	NCBITaxon:9606	gene	1219	False	1.000000e-12	NaN	NaN
OR4F5	False	OR4F5	NCBITaxon:9606	gene	2618	False	1.000000e-12	NaN	NaN
OR4F29	False	OR4F29	NCBITaxon:9606	gene	939	False	1.000000e-12	NaN	NaN
OR4F16	False	OR4F16	NCBITaxon:9606	gene	939	False	1.000000e-12	NaN	NaN
...	...	...	...	...	...	...	...	...	...
TTTY17C	False	TTTY17C	NCBITaxon:9606	gene	776	False	1.000000e-12	NaN	NaN
SEPTIN14P23	False	SEPTIN14P23	NCBITaxon:9606	gene	1192	False	2.503351e-04	-0.120703	-1.405214
CDY1	False	CDY1	NCBITaxon:9606	gene	2670	False	1.000000e-12	NaN	NaN
TTTY3	False	TTTY3	NCBITaxon:9606	gene	344	False	1.000000e-12	NaN	NaN
MAFIP	False	MAFIP	NCBITaxon:9606	gene	1599	False	6.082392e-03	0.002191	0.189369

25198 rows × 9 columns

In [8]:

Copied!





# Preprocess the data following the scGPT data pre-processing pipeline
preprocessor = Preprocessor(
    use_key="X",  # the key in adata.layers to use as raw data
    filter_gene_by_counts=3,  # step 1
    filter_cell_by_counts=False,  # step 2
    normalize_total=1e4,  # step 3 whether to normalize the raw data and to what sum
    result_normed_key="X_normed",  # the key in adata.layers to store the normalized data
    log1p=data_is_raw,  # step 4 whether to log1p the normalised data
    result_log1p_key="X_log1p",
    subset_hvg=n_hvg,  # 5. whether to subset the raw data to highly variable genes
    hvg_flavor="seurat_v3" if data_is_raw else "cell_ranger",
    binning=n_bins,  # 6. whether to bin the raw data and to what number of bins
    result_binned_key="X_binned",  # the key in adata.layers to store the binned data
)
preprocessor(adata)
# Preprocess the data following the scGPT data pre-processing pipeline
preprocessor = Preprocessor(
    use_key="X",  # the key in adata.layers to use as raw data
    filter_gene_by_counts=3,  # step 1
    filter_cell_by_counts=False,  # step 2
    normalize_total=1e4,  # step 3 whether to normalize the raw data and to what sum
    result_normed_key="X_normed",  # the key in adata.layers to store the normalized data
    log1p=data_is_raw,  # step 4 whether to log1p the normalised data
    result_log1p_key="X_log1p",
    subset_hvg=n_hvg,  # 5. whether to subset the raw data to highly variable genes
    hvg_flavor="seurat_v3" if data_is_raw else "cell_ranger",
    binning=n_bins,  # 6. whether to bin the raw data and to what number of bins
    result_binned_key="X_binned",  # the key in adata.layers to store the binned data
)
preprocessor(adata)

scGPT - INFO - Filtering genes by counts ...
scGPT - INFO - Normalizing total counts ...
scGPT - INFO - Subsetting highly variable genes ...
scGPT - WARNING - No batch_key is provided, will use all cells for HVG selection.
scGPT - INFO - Binning data ...

Step 2: Retrieve Scgpt's Gene Embeddings¶

Overall, the pre-trained foundation model contains 30+K genes. Here for simplicity, we focus on a subset of HVGs specific to the data at hand.

In [9]:

Copied!





# Retrieve the data-independent gene embeddings from scGPT
gene_ids = np.array([id for id in gene2idx.values()])
gene_embeddings = model.encoder(torch.tensor(gene_ids, dtype=torch.long).to(device))
gene_embeddings = gene_embeddings.detach().cpu().numpy()
# Retrieve the data-independent gene embeddings from scGPT
gene_ids = np.array([id for id in gene2idx.values()])
gene_embeddings = model.encoder(torch.tensor(gene_ids, dtype=torch.long).to(device))
gene_embeddings = gene_embeddings.detach().cpu().numpy()

In [10]:

Copied!

# Filter on the intersection between the Blood HVGs found and scGPT's 30+K foundation model vocab
gene_embeddings = {gene: gene_embeddings[i] for i, gene in enumerate(gene2idx.keys()) if gene in adata.var.index.tolist()}
print('Retrieved gene embeddings for {} genes.'.format(len(gene_embeddings)))
# Filter on the intersection between the Blood HVGs found and scGPT's 30+K foundation model vocab
gene_embeddings = {gene: gene_embeddings[i] for i, gene in enumerate(gene2idx.keys()) if gene in adata.var.index.tolist()}
print('Retrieved gene embeddings for {} genes.'.format(len(gene_embeddings)))

Retrieved gene embeddings for 962 genes.

In [11]:

Copied!

# Construct gene embedding network
embed = GeneEmbedding(gene_embeddings)
# Construct gene embedding network
embed = GeneEmbedding(gene_embeddings)

100%|████████████████████████████████████████████████████████████████████████████| 962/962 [00:00<00:00, 1095850.20it/s]

Step 3: Extract gene programmes from gene embedding network¶

Perform Louvain clustering on the gene embedding network.

In [12]:

Copied!





# Perform Louvain clustering with desired resolution; here we specify resolution=40
gdata = embed.get_adata(resolution=40)
# Retrieve the gene clusters
metagenes = embed.get_metagenes(gdata)
# Perform Louvain clustering with desired resolution; here we specify resolution=40
gdata = embed.get_adata(resolution=40)
# Retrieve the gene clusters
metagenes = embed.get_metagenes(gdata)

In [13]:

Copied!

gdata
gdata

Out[13]:

AnnData object with n_obs × n_vars = 962 × 512
    obs: 'leiden'
    uns: 'neighbors', 'leiden', 'umap'
    obsm: 'X_umap'
    obsp: 'distances', 'connectivities'

In [14]:

Copied!

metagenes
metagenes

Out[14]:

defaultdict(list,
            {'32': ['ZNF880', 'SNHG21', 'APOM', 'SC5D'],
             '214': ['ZNF837', 'SMYD5'],
             '215': ['ZNF713', 'MIR762HG'],
             '254': ['ZNF618', 'ZNF235'],
             '144': ['ZNF608', 'APBB2', 'ZNF43'],
             '57': ['ZNF594-DT', 'TSPYL5', 'ZNF530', 'MTFR2'],
             '120': ['ZNF528-AS1', 'WDR27', 'CORT'],
             '37': ['ZNF426', 'ZFP14', 'TTC28-AS1', 'ZNF76'],
             '182': ['ZNF415', 'KRCC1', 'PTPRN2'],
             '109': ['ZNF391', 'ZCWPW1', 'IFT88'],
             '74': ['ZNF318', 'SIPA1L3', 'KBTBD8', 'GUCD1'],
             '21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
             '91': ['ZNF140', 'SMIM8', 'ZBED5-AS1', 'FCF1'],
             '99': ['ZMYND8', 'ITPR1', 'CDKAL1'],
             '268': ['ZMAT1', 'EFCAB13'],
             '12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
             '217': ['ZEB2-AS1', 'PTCH2'],
             '95': ['ZDHHC19', 'SERHL2', 'NEK8'],
             '218': ['ZC3H12C', 'PACS2'],
             '211': ['ZBTB32', 'IL2RA'],
             '5': ['ZBTB16',
              'TLE1',
              'TBC1D5',
              'CRIM1',
              'PRKCB',
              'VAV3',
              'MCTP2'],
             '175': ['XIST', 'ITGA4', 'LPP'],
             '114': ['XCL2', 'TFPI2', 'CCL4L2'],
             '126': ['XCL1', 'GNG4', 'TNFRSF18'],
             '166': ['WDR86', 'CNKSR2', 'GRAP'],
             '53': ['WBP4', 'SREBF2-AS1', 'COQ7', 'KMT2E-AS1'],
             '98': ['WASF1', 'SGO1', 'CCDC18'],
             '100': ['VPREB3', 'CD24', 'FLI1'],
             '101': ['USP7-AS1', 'CATSPERZ', 'CERNA1'],
             '205': ['USF1', 'RASSF2', 'RIN3'],
             '102': ['UMAD1', 'SCN3A', 'BEND5'],
             '103': ['UACA', 'PDE7B', 'EPS8'],
             '158': ['TYROBP', 'FCER1G', 'HCST'],
             '137': ['TUSC1', 'MYL5', 'LINC01003'],
             '51': ['TUBB6', 'EIF1AY', 'PDLIM7', 'HMGN5'],
             '104': ['TTC9', 'CXCR5', 'DBNDD1'],
             '105': ['TTC23L-AS1', 'LNP1', 'MSS51'],
             '106': ['TSC22D1', 'MEG3', 'H2AC6'],
             '107': ['TRIM66', 'EPM2A', 'MCF2L'],
             '108': ['TRIM52-AS1', 'NUBPL', 'THAP7-AS1'],
             '22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
             '131': ['TRBC1', 'TRDC', 'C12orf75'],
             '81': ['TRAF4', 'METTL8', 'BTLA', 'ADAM19'],
             '1': ['TNFRSF13B',
              'FCRL2',
              'PNOC',
              'S100B',
              'LINC01781',
              'CD1C',
              'LINC01857',
              'OSBPL10'],
             '110': ['TNFAIP6', 'ADM', 'PTGS2'],
             '111': ['TMEM62', 'CLCN4', 'LETM2'],
             '132': ['TMEM44-AS1', 'LINC02551', 'TNRC6B-DT'],
             '149': ['TMEM38A', 'EIF2AK3-DT', 'GCNT2'],
             '112': ['TMEM131L', 'ACSM3', 'MARCHF3'],
             '10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
             '68': ['TUNAR', 'TVP23A', 'ELL3', 'GLI1'],
             '113': ['THOC1-DT', 'LINGO3', 'GORAB-AS1'],
             '65': ['THBS1', 'SERPINB2', 'CCL7', 'EREG'],
             '40': ['TGM1', 'DTX4', 'ECE1-AS1', 'SERPINB9P1'],
             '56': ['TTC30A', 'ATP10D', 'MUL1', 'SPRYD7'],
             '59': ['TEX9', 'STK33', 'TMEM67', 'PPIL6'],
             '216': ['TESC', 'SCNM1'],
             '201': ['TEDC1', 'ARHGEF18', 'PCGF1'],
             '251': ['TCEAL4', 'INO80C'],
             '77': ['TBXA2R', 'SDK2', 'CR2', 'MYEF2'],
             '97': ['ZNF667', 'WDR5B-DT', 'ZNF10'],
             '174': ['TAT-AS1', 'CUBN', 'LINC02175'],
             '177': ['TAPT1', 'BCL7A', 'IL4R'],
             '94': ['TAL1', 'SMIM5', 'PDZK1IP1'],
             '25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
             '178': ['SUV39H2', 'DPPA4', 'IMPACT'],
             '119': ['SULF2', 'S100Z', 'DOCK8-AS1'],
             '302': ['STAG1-DT'],
             '79': ['UQCRHL', 'GIHCG', 'NREP', 'AZU1'],
             '219': ['SPRY1', 'ZNF667-AS1'],
             '66': ['SPINK2', 'ZNF844', 'ERICH6-AS1', 'ABCB1'],
             '235': ['SPATA6', 'MAPK10'],
             '90': ['SPARC', 'CLU', 'MYH11', 'MYL9'],
             '220': ['SPAG16', 'PTPRK'],
             '2': ['SP2',
              'H2AC21',
              'H2AC11',
              'H2BC13',
              'H2BC18',
              'H2BC4',
              'H2BC7',
              'H3C3'],
             '184': ['SOX2-OT', 'STAR', 'MYL3'],
             '194': ['SOS1-IT1', 'CAPN14', 'MTCL1'],
             '33': ['SOCS1', 'CSRNP1', 'GADD45G', 'RGS16'],
             '115': ['SOBP', 'TPK1', 'KRBOX1'],
             '85': ['SNX29', 'CYSLTR1', 'UVRAG', 'LHPP'],
             '55': ['SNX22', 'TCTN1', 'FAM111B', 'CNFN'],
             '116': ['SNAPC1', 'HEIH', 'MXI1'],
             '117': ['SMAGP', 'SCPEP1', 'ANXA4'],
             '148': ['SLC39A10', 'CEP290', 'ANKRD26'],
             '222': ['ST20-AS1', 'ERMN'],
             '63': ['SLC2A6', 'SIGLEC10', 'SIDT2', 'SCIMP'],
             '62': ['SKAP1', 'NCR3', 'KLRB1', 'HOPX'],
             '186': ['SIMC1', 'LYRM1', 'PIGC'],
             '118': ['SH3BP2', 'FAM120A', 'VPS41'],
             '163': ['SFR1', 'RAB30-DT', 'CHML'],
             '206': ['SETBP1', 'OR2T11', 'LAMC1'],
             '64': ['SERPINI1', 'SLC2A1', 'ANKRD37', 'PWWP3A'],
             '41': ['TPST1', 'RAPGEF5', 'FLT1', 'ST6GALNAC3'],
             '188': ['SEMA6B', 'GASAL1', 'INSIG1-DT'],
             '17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
             '139': ['SAV1', 'USP6NL', 'POU2F1'],
             '223': ['TRMT1', 'LINC01678'],
             '123': ['SATB1', 'PDE7A', 'BCL11B'],
             '13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
             '86': ['RUNX1', 'GNA15', 'ANKRD28', 'LRRC75A'],
             '16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
             '224': ['RRAD', 'DNAAF1'],
             '0': ['TXNDC5',
              'IGHA2',
              'IGLC3',
              'IGHA1',
              'IGHG4',
              'JCHAIN',
              'IGHG1',
              'IGHG3',
              'IGLC2'],
             '241': ['RP9', 'AGPAT5'],
             '83': ['SNX10', 'SLC11A1', 'MCEMP1', 'PTPN6'],
             '121': ['RPGR', 'SEMA4B', 'RRM2B'],
             '225': ['USP9Y', 'CCDC141'],
             '34': ['UBE4A', 'OGT', 'TRIB2', 'CCND2'],
             '156': ['TBCK', 'EPG5', 'NOTCH1'],
             '36': ['SBNO1-AS1', 'LAIR1', 'CRELD2', 'CD38'],
             '252': ['WAC-AS1', 'TLR1'],
             '226': ['WWC3', 'ZDHHC21'],
             '35': ['RPAP2', 'DUSP12', 'SLC25A45', 'OSGEPL1'],
             '44': ['ZNF585A', 'EID2B', 'HS3ST3B1', 'AKAP7'],
             '122': ['ZRSR2', 'DIPK1B', 'SCAMP4'],
             '67': ['STEAP1B', 'DTWD2', 'BCAS4', 'CCSER1'],
             '197': ['ZNF83', 'SESN1', 'NUAK2'],
             '269': ['SCAI', 'ARPP21'],
             '296': ['TNFSF13B', 'CLEC7A'],
             '39': ['RNF170', 'BMP3', 'ALKAL2', 'CBR3-AS1'],
             '227': ['WDR19', 'IQCN'],
             '287': ['RNF138', 'NAF1'],
             '69': ['RNF103', 'R3HDM2', 'ZYG11B', 'ZBTB18'],
             '9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
             '80': ['RNASEH2B-AS1', 'SPA17', 'EFHB', 'IQCD'],
             '48': ['RIOK1', 'URB1-AS1', 'LINC00526', 'NUFIP1'],
             '140': ['RGS1', 'DUSP4', 'CCL2'],
             '30': ['REXO1', 'CORO7', 'AKAP8', 'BTBD2'],
             '124': ['RDX', 'TSPAN13', 'IL3RA_ENSG00000185291'],
             '58': ['RBP7', 'NFAM1', 'HCAR2', 'ASGR1'],
             '46': ['RASSF8', 'RBM38-AS1', 'LINC01811', 'PCP2'],
             '280': ['RASSF1-AS1', 'FMO5'],
             '125': ['RAP2C-AS1', 'MAP3K2-DT', 'LINC00494'],
             '19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
             '212': ['TNFAIP2', 'UBXN11'],
             '18': ['RAB33B-AS1',
              'MZF1-AS1',
              'SDR42E2',
              'C16orf74',
              'MIR4458HG'],
             '7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
             '29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
             '60': ['GNMT', 'ERCC2', 'TMEM167B-DT', 'PSMA8'],
             '6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
             '23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
             '180': ['GAPT', 'IGLC5', 'CD200'],
             '4': ['IGHV4-39',
              'IGKV1-9',
              'IGHV4-34',
              'IGHV1-3',
              'IGLV2-14',
              'IGHV1-18',
              'IGHV3-23'],
             '266': ['CLMN', 'TFEB'],
             '15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
             '14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
             '127': ['GALNT6', 'GOLPH3-DT', 'GPR137C'],
             '229': ['CXCL10', 'CXCL2'],
             '27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
             '92': ['BBOF1', 'TWSG1-DT', 'WDR5B', 'CFAP45'],
             '70': ['CLLU1-AS1', 'ARHGAP22', 'DNAJC12', 'AVP'],
             '199': ['CCDC112', 'FADS3', 'ZNF821'],
             '8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
             '246': ['FCER2', 'ABCB4'],
             '190': ['POC1B', 'ASXL2', 'MBNL3'],
             '168': ['FAM241A', 'SDCCAG8', 'CERS6'],
             '230': ['ARL13B', 'HNRNPD-DT'],
             '231': ['NR4A1', 'ATF3'],
             '221': ['FAM111A', 'PLCL2'],
             '232': ['HMGA1P4', 'C1orf220'],
             '233': ['EVI5', 'ATP6V0A1'],
             '128': ['C11orf24', 'LARGE2', 'KCNG1'],
             '234': ['ARFIP1', 'GCNT1'],
             '129': ['DNAL4', 'CD2AP-DT', 'PRXL2B'],
             '147': ['EPB41L2', 'ELL2', 'ABCA1'],
             '3': ['MX1',
              'HERC5',
              'XAF1',
              'IFI44L',
              'IFIT3',
              'MX2',
              'EPSTI1',
              'PLSCR1'],
             '54': ['CYTOR', 'SLAMF7', 'BHLHE41', 'IFNG'],
             '183': ['FAAP24', 'ZNF252P-AS1', 'HCN3'],
             '130': ['PACSIN2', 'DDI2', 'RYBP'],
             '75': ['GBP5', 'GBP1', 'GBP4', 'GBP2'],
             '259': ['DOCK9-DT', 'NUMBL'],
             '150': ['USP11', 'BTN2A2', 'HIP1R'],
             '73': ['DCTN6-DT', 'CSKMT', 'ATF7IP2', 'ILF3-DT'],
             '203': ['NRGN', 'RGS18', 'PRKAR2B'],
             '71': ['DNAAF4',
              'LINC00106_ENSG00000236871',
              'CSGALNACT2-DT',
              'JARID2-AS1'],
             '236': ['DDN-AS1', 'ICOSLG'],
             '204': ['DTX1', 'GPR160', 'HRK'],
             '237': ['ERCC1', 'HDAC9'],
             '45': ['CYB5D2', 'LINC01004', 'ZFP28-DT', 'ALG9'],
             '26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
             '213': ['DYNLT2', 'CAGE1'],
             '289': ['PPP1R14A', 'FAM30A'],
             '52': ['NOTCH2', 'SERTAD2', 'KLF7', 'MAML3'],
             '61': ['TSC22D2', 'ICA1L', 'TTN', 'KCNQ1OT1'],
             '141': ['FAM177B', 'CCL22', 'IL6'],
             '185': ['CYP2R1', 'NHLH1', 'LINC02848'],
             '198': ['IRAK3', 'BAZ2B', 'PECAM1'],
             '238': ['EFHC1', 'METTL7A'],
             '239': ['ZNF165', 'CFAP44'],
             '50': ['IGHV3-15', 'IGLC6', 'IGHV3-49', 'IGLC7'],
             '260': ['NME4', 'MYC'],
             '78': ['SOX4', 'ARMH1', 'MIR181A1HG', 'MTHFD2L'],
             '133': ['VCAN', 'CSF3R', 'CD14'],
             '43': ['DOP1B', 'MIR3936HG', 'DLEU2', 'NRDE2'],
             '249': ['GDAP1', 'CENPP'],
             '291': ['PTGDS', 'IGFBP4'],
             '72': ['CA5B', 'RBM43', 'INTS6-AS1', 'LRRC8C-DT'],
             '240': ['LIX1-AS1', 'CLEC20A'],
             '169': ['NUP58', 'SPPL2B', 'CLASRP'],
             '160': ['CRIP2', 'BDH2', 'PCSK1N'],
             '297': ['COL19A1'],
             '134': ['PIDD1', 'NUP210', 'SPATA33'],
             '292': ['CORO2B', 'AMDHD2'],
             '153': ['CLEC4A', 'GM2A', 'MIR223HG'],
             '208': ['PRCD', 'ZNF528', 'MYBPC2'],
             '243': ['KCNH8', 'LINC02397'],
             '88': ['CNN3', 'CCND1', 'APP', 'PLK2'],
             '162': ['FRY', 'LINC02029', 'GABBR1'],
             '275': ['CNTNAP2', 'MAP3K9'],
             '136': ['GABARAPL1', 'IL1RN', 'PLAUR'],
             '244': ['CLIC4', 'FCHSD2'],
             '159': ['HEXIM1', 'GADD45A', 'IRS2'],
             '38': ['MARCHF1', 'HLA-DRB5', 'CPVL', 'SHTN1'],
             '179': ['ANXA1', 'CD9', 'LGALS3'],
             '193': ['TGFBI', 'RNASE6', 'MPEG1'],
             '245': ['OPN3', 'MAP3K13'],
             '84': ['IGHV3-53', 'IGKV2D-29', 'IGHV3-30', 'IGHV5-51'],
             '242': ['IGHE', 'WNT10A'],
             '191': ['H2BC21', 'H2BC8', 'H2BC5'],
             '87': ['DSP', 'FXYD7', 'SH2D3A', 'LSR'],
             '279': ['ADPGK-AS1', 'LINC00467'],
             '138': ['RHOB', 'CITED2', 'H2AC19'],
             '164': ['KCNIP2', 'RORA-AS1', 'PGBD4'],
             '248': ['CEMIP2', 'KLRD1'],
             '192': ['THRB', 'KLK1', 'MOB3B'],
             '207': ['ADARB1', 'EPN2', 'MICAL3'],
             '247': ['CFL2', 'PAIP2B'],
             '264': ['IFNG-AS1', 'SLAMF1'],
             '24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
             '28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
             '195': ['CCDC168', 'KLF3-AS1', 'NARF-AS2'],
             '228': ['MSMP', 'PLD5'],
             '89': ['CXCL3', 'CCL20', 'MIR3945HG', 'MMP7'],
             '143': ['VAV3-AS1', 'DPEP2', 'IL6-AS1'],
             '285': ['NETO1-DT', 'LRRIQ3'],
             '31': ['ARHGAP9', 'ARHGAP4', 'FAM200B', 'CMTM7'],
             '135': ['GALNT4', 'ARRDC4', 'LPP-AS2'],
             '145': ['FCRL5', 'LINC01480', 'GCM1'],
             '200': ['EGR2', 'C15orf48', 'EGR3'],
             '146': ['FCRL1', 'KLHL14', 'MYO7B'],
             '93': ['GPR35', 'LILRA1', 'LILRA6', 'HYAL3'],
             '11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
             '250': ['AKAP6', 'LINC01685'],
             '271': ['LUCAT1', 'IER5L'],
             '170': ['P2RX5', 'CDCA7L', 'GPR18'],
             '282': ['CLEC17A', 'HLA-DOB'],
             '76': ['DBNDD2', 'MANSC1', 'OXT', 'PPP1R21-DT'],
             '151': ['COLCA1', 'LIX1L-AS1', 'TAGAP-AS1'],
             '152': ['ARHGAP12', 'SLC16A11', 'HAGHL'],
             '154': ['LEKR1', 'OXTR', 'P2RX7'],
             '284': ['CD300E', 'DUSP6'],
             '299': ['E2F5'],
             '265': ['CAMK2D', 'CIITA'],
             '286': ['RAB37', 'C11orf21'],
             '82': ['DNASE1L3', 'KHDRBS2', 'MACROD2', 'MIR4432HG'],
             '288': ['CCDC88A', 'OGFRL1'],
             '47': ['MILR1', 'CBFA2T3', 'PAWR', 'HS2ST1'],
             '278': ['NKILA', 'KLHL22'],
             '283': ['TCL6', 'YPEL2'],
             '294': ['IFRD1', 'NEIL1'],
             '210': ['TMEM176A', 'G0S2', 'MAFB'],
             '253': ['ARHGEF35-AS1', 'FAM24B'],
             '181': ['KLRC1', 'GZMK', 'GPR65'],
             '155': ['IRF4', 'TP53INP1', 'PHF6'],
             '161': ['CSRP2', 'BEX1', 'PEG10'],
             '42': ['ZCCHC9', 'FBXO22', 'KIAA0586', 'PNKD'],
             '157': ['PPP1R14B-AS1', 'TP53RK-DT', 'PTK6'],
             '202': ['OR2A1-AS1', 'PPP1R12A-AS1', 'MYLK-AS1'],
             '209': ['COL9A3', 'N4BP3', 'LCN8'],
             '96': ['CHL1', 'SYT17', 'MAP2'],
             '257': ['ATP2B1-AS1', 'TMEM107'],
             '196': ['ARMCX4', 'MAP3K6', 'NPTX1'],
             '258': ['BCL6', 'ATP11A'],
             '171': ['CTSW', 'GZMM', 'FEZ1'],
             '20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44'],
             '290': ['CBR3', 'LOH12CR2'],
             '165': ['IRAK2', 'LINC00265', 'NRARP'],
             '172': ['H1-10', 'H1-2', 'H1-3'],
             '301': ['KLHDC7B-DT'],
             '274': ['CD180', 'IRAG1-AS1'],
             '167': ['JAG2', 'P2RY14', 'PPM1J'],
             '262': ['CYP2U1-AS1', 'KCTD16'],
             '263': ['CLDN11', 'NTNG1'],
             '293': ['LIX1', 'GNG3'],
             '298': ['IGLC1'],
             '49': ['MED12', 'ERLIN2', 'IFT122', 'P2RX4'],
             '176': ['HES1', 'ID1', 'ID2'],
             '173': ['HES4', 'PPP1R17', 'ODAD3'],
             '267': ['HHIP-AS1', 'LINC01013'],
             '142': ['MZB1', 'DERL3', 'AIM2'],
             '270': ['EAF2', 'PLPP5'],
             '272': ['INSIG2', 'KHDRBS3'],
             '273': ['IGHV3-7', 'STAG3'],
             '189': ['DPYSL3', 'CERCAM', 'NRP2'],
             '300': ['PRMT5-AS1'],
             '276': ['CHST15', 'LINC02728'],
             '277': ['OXCT1-AS1', 'MYZAP'],
             '255': ['PCDH9', 'NIBAN3'],
             '256': ['LINC01703', 'NDUFV2-AS1'],
             '281': ['LINC02132', 'NUDT8'],
             '295': ['LINC02422', 'SLC38A11'],
             '187': ['AMN1', 'NT5E', 'OLMALINC'],
             '261': ['GZMA', 'IL32']})

Filter on clusters with 5 or more genes

In [15]:

Copied!





# Obtain the set of gene programmes from clusters with #genes >= 5
mgs = dict()
for mg, genes in metagenes.items():
    if len(genes) > 4:
        mgs[mg] = genes
# Obtain the set of gene programmes from clusters with #genes >= 5
mgs = dict()
for mg, genes in metagenes.items():
    if len(genes) > 4:
        mgs[mg] = genes

In [16]:

Copied!

# Here are the gene programmes identified
mgs
# Here are the gene programmes identified
mgs

Out[16]:

{'21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
 '12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
 '5': ['ZBTB16', 'TLE1', 'TBC1D5', 'CRIM1', 'PRKCB', 'VAV3', 'MCTP2'],
 '22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
 '1': ['TNFRSF13B',
  'FCRL2',
  'PNOC',
  'S100B',
  'LINC01781',
  'CD1C',
  'LINC01857',
  'OSBPL10'],
 '10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
 '25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
 '2': ['SP2',
  'H2AC21',
  'H2AC11',
  'H2BC13',
  'H2BC18',
  'H2BC4',
  'H2BC7',
  'H3C3'],
 '17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
 '13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
 '16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
 '0': ['TXNDC5',
  'IGHA2',
  'IGLC3',
  'IGHA1',
  'IGHG4',
  'JCHAIN',
  'IGHG1',
  'IGHG3',
  'IGLC2'],
 '9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
 '19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
 '18': ['RAB33B-AS1', 'MZF1-AS1', 'SDR42E2', 'C16orf74', 'MIR4458HG'],
 '7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
 '29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
 '6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
 '23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
 '4': ['IGHV4-39',
  'IGKV1-9',
  'IGHV4-34',
  'IGHV1-3',
  'IGLV2-14',
  'IGHV1-18',
  'IGHV3-23'],
 '15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
 '14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
 '27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
 '8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
 '3': ['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1'],
 '26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
 '24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
 '28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
 '11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
 '20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44']}

Step 4: Visualise Network Connectivity within Desired Gene Programme¶

In [17]:

Copied!





# Retrieve gene programme 3
GP_genes = mgs['3']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
    df = embed.compute_similarities(i, GP_genes)
    df['Gene1'] = i
    df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)
# Retrieve gene programme 3
GP_genes = mgs['3']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
    df = embed.compute_similarities(i, GP_genes)
    df['Gene1'] = i
    df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)

['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1']

100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 236.27it/s]

      Gene  Similarity   Gene1
60  EPSTI1    0.225607  PLSCR1
45  EPSTI1    0.188223     MX2
6   EPSTI1    0.256395     MX1
35  EPSTI1    0.274948   IFIT3
29  EPSTI1    0.241366  IFI44L
19  EPSTI1    0.254587    XAF1
14  EPSTI1    0.119632   HERC5
31   HERC5    0.168532  IFI44L
55   HERC5    0.119632  EPSTI1
47   HERC5    0.163126     MX2
38   HERC5    0.202499   IFIT3
23   HERC5    0.168315    XAF1
63   HERC5    0.091875  PLSCR1
7    HERC5    0.174051     MX1
1   IFI44L    0.372201     MX1
17  IFI44L    0.337581    XAF1
59  IFI44L    0.232156  PLSCR1
11  IFI44L    0.168532   HERC5
34  IFI44L    0.296725   IFIT3
52  IFI44L    0.241366  EPSTI1
41  IFI44L    0.306452     MX2
44   IFIT3    0.198912     MX2
49   IFIT3    0.274948  EPSTI1
9    IFIT3    0.202499   HERC5
28   IFIT3    0.296725  IFI44L
3    IFIT3    0.313495     MX1
58   IFIT3    0.240587  PLSCR1
20   IFIT3    0.240173    XAF1
10     MX1    0.174051   HERC5
50     MX1    0.256395  EPSTI1
33     MX1    0.313495   IFIT3
25     MX1    0.372201  IFI44L
57     MX1    0.266255  PLSCR1
18     MX1    0.323817    XAF1
42     MX1    0.295364     MX2
4      MX2    0.295364     MX1
54     MX2    0.188223  EPSTI1
39     MX2    0.198912   IFIT3
13     MX2    0.163126   HERC5
21     MX2    0.228178    XAF1
27     MX2    0.306452  IFI44L
62     MX2    0.182111  PLSCR1
22  PLSCR1    0.212136    XAF1
30  PLSCR1    0.232156  IFI44L
53  PLSCR1    0.225607  EPSTI1
36  PLSCR1    0.240587   IFIT3
46  PLSCR1    0.182111     MX2
5   PLSCR1    0.266255     MX1
15  PLSCR1    0.091875   HERC5
43    XAF1    0.228178     MX2
37    XAF1    0.240173   IFIT3
26    XAF1    0.337581  IFI44L
12    XAF1    0.168315   HERC5
2     XAF1    0.323817     MX1
61    XAF1    0.212136  PLSCR1
51    XAF1    0.254587  EPSTI1

In [18]:

Copied!





# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)
# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)

In [19]:

Copied!





# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')

elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]

pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)

width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
    if v > thresh:
        width_large[list(widths.keys())[i]] = v*10
    else:
        width_small[list(widths.keys())[i]] = max(v, 0)*10

nx.draw_networkx_edges(G, pos,
                       edgelist = width_small.keys(),
                       width=list(width_small.values()),
                       edge_color='lightblue',
                       alpha=0.8)
nx.draw_networkx_edges(G, pos, 
                       edgelist = width_large.keys(), 
                       width = list(width_large.values()), 
                       alpha = 0.5, 
                       edge_color = "blue", 
                      )
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)

ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()
# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')

elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]

pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)

width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
    if v > thresh:
        width_large[list(widths.keys())[i]] = v*10
    else:
        width_small[list(widths.keys())[i]] = max(v, 0)*10

nx.draw_networkx_edges(G, pos,
                       edgelist = width_small.keys(),
                       width=list(width_small.values()),
                       edge_color='lightblue',
                       alpha=0.8)
nx.draw_networkx_edges(G, pos, 
                       edgelist = width_large.keys(), 
                       width = list(width_large.values()), 
                       alpha = 0.5, 
                       edge_color = "blue", 
                      )
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)

ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()

No description has been provided for this image

Step 6: Reactome Pathway Analysis¶

Reactome pathway is another form of pathway analysis that can be carried out on the data.

In [20]:

Copied!





# Meta info about the number of terms (tests) in the databases
df_database = pd.DataFrame(
data = [['GO_Biological_Process_2021', 6036],
['GO_Molecular_Function_2021', 1274],
['Reactome_2022', 1818]],
columns = ['dataset', 'term'])
# Meta info about the number of terms (tests) in the databases
df_database = pd.DataFrame(
data = [['GO_Biological_Process_2021', 6036],
['GO_Molecular_Function_2021', 1274],
['Reactome_2022', 1818]],
columns = ['dataset', 'term'])

In [21]:

Copied!





# Select desired database for query; here use Reactome as an example
databases = ['Reactome_2022']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m
# Select desired database for query; here use Reactome as an example
databases = ['Reactome_2022']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m

In [22]:

Copied!





# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
                          gene_sets=databases,
                          organism='Human', 
                          outdir='test/enr_Reactome',
                          cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)

# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df
# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
                          gene_sets=databases,
                          organism='Human', 
                          outdir='test/enr_Reactome',
                          cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)

# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df

Out[22]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
0	Reactome_2022	Interferon Signaling R-HSA-913531	5/200	5.197809e-09	8.836276e-08	169.205128	3227.592653	HERC5;MX2;MX1;XAF1;IFIT3
1	Reactome_2022	Interferon Alpha/Beta Signaling R-HSA-909733	4/72	1.068672e-08	9.083711e-08	293.000000	5377.799365	MX2;MX1;XAF1;IFIT3
2	Reactome_2022	Cytokine Signaling In Immune System R-HSA-1280215	5/702	2.693582e-06	1.189224e-05	46.138211	591.705896	HERC5;MX2;MX1;XAF1;IFIT3
3	Reactome_2022	ISG15 Antiviral Mechanism R-HSA-1169408	3/75	2.798175e-06	1.189224e-05	166.000000	2122.566166	HERC5;MX2;MX1
4	Reactome_2022	Antiviral Mechanism By IFN-stimulated Genes R-...	3/83	3.801759e-06	1.292598e-05	149.340000	1863.770163	HERC5;MX2;MX1

In [ ]:

External Reading:

SCGPT paper: https://www.nature.com/articles/s41592-024-02201-0
Feature selection: https://bioconductor.org/books/3.15/OSCA.basic/feature-selection.html
P-values: https://www.nature.com/articles/nmeth.2900
Community detection algorithms: https://www.nature.com/articles/s41598-019-41695-z
Reactome: https://reactome.org/

Exercises:

Describe the process of loading the pre-trained scGPT model. What are the key parameters used?
Explain the significance of the n_hvg and n_bins parameters used in the preprocessing pipeline.
Why is it important to filter the model gene embeddings to match the HVGs found in the input dataset?
Describe the process of constructing the gene embedding network and performing Louvain clustering. What is the role of the resolution parameter?
Explain how the cosine similarity among genes in a specific gene programme is computed.
What is the purpose of performing pathway enrichment analysis in this context?
Explain the significance of the p-value correction applied during the pathway enrichment analysis.
Change the resolution parameter for the Louvain clustering step and analyse how the number of identified gene programmes changes. Plot the distribution of gene programmes for different resolution values.
Choose a different gene programme and create a similar gene similarity network to visualise.
Perform pathway enrichment analysis using a different database, such as GO_Biological_Process_2021 as well as report the top enriched pathways/processes and their significance levels.

Answers:

Answer 8:

In [ ]:

Copied!

# Define a range of resolution values to test
# Define a range of resolution values to test

Answer 9:

In [ ]:

Copied!

# Retrieve gene programme
# Retrieve gene programme

Answer 10:

In [ ]:

Copied!

# Set pandas options to display the entire term
pd.set_option('display.max_colwidth', None)

# Select desired database for query
# Set pandas options to display the entire term
pd.set_option('display.max_colwidth', None)

# Select desired database for query