Gene Network Analysis¶

This workbook will cover gene network analysis using large language models as an alternative to correlation analysis. Specifically, we will look at using scgpt, which is a foundation model for single-cell multi-omic data and has a wide range of capabilities.

We will use a model pretrained 10.3 million blood and bone marrow cells, since our dataset consists of B cells obtained from peripheral blood samples.

In [ ]:

Copied!





import copy
import json
import os
from pathlib import Path
import sys
import warnings
import gzip
import shutil

import torch
from anndata import AnnData
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pandas as pd
import tqdm
import gseapy as gp

from torchtext.vocab import Vocab
from torchtext._torchtext import (
    Vocab as VocabPybind,
)

sys.path.insert(0, "../")
import scgpt as scg
from scgpt.tasks import GeneEmbedding
from scgpt.tokenizer.gene_tokenizer import GeneVocab
from scgpt.model import TransformerModel
from scgpt.preprocess import Preprocessor
from scgpt.utils import set_seed 

os.environ["KMP_WARNINGS"] = "off"
warnings.filterwarnings('ignore')
import copy
import json
import os
from pathlib import Path
import sys
import warnings
import gzip
import shutil

import torch
from anndata import AnnData
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pandas as pd
import tqdm
import gseapy as gp

from torchtext.vocab import Vocab
from torchtext._torchtext import (
    Vocab as VocabPybind,
)

sys.path.insert(0, "../")
import scgpt as scg
from scgpt.tasks import GeneEmbedding
from scgpt.tokenizer.gene_tokenizer import GeneVocab
from scgpt.model import TransformerModel
from scgpt.preprocess import Preprocessor
from scgpt.utils import set_seed 

os.environ["KMP_WARNINGS"] = "off"
warnings.filterwarnings('ignore')

We need to set the parameters of the model for when we use it later in this workbook.

In [2]:

Copied!





set_seed(42)
pad_token = "<pad>"
special_tokens = [pad_token, "<cls>", "<eoc>"]
n_hvg = 1000
n_bins = 51
mask_value = -1
pad_value = -2
n_input_bins = n_bins
set_seed(42)
pad_token = ""
special_tokens = [pad_token, "", ""]
n_hvg = 1000
n_bins = 51
mask_value = -1
pad_value = -2
n_input_bins = n_bins

Lets go through the parameters so they are better understood:

set_seed is used to set the random seed for reproducibility.
pad_token is a special token used for padding sequences, to ensure that all sequences are of the same length. Therefore it will be inserted where necessary to make sequences of unequal lengths equal.
special_tokens is a list defining a set of special tokens the model may use, such as pad. Cls is commonly a classification token that represents the beginning of a sequence and is used to aggregate information for classification tasks. Eoc stands for 'end of converstion/content' and is used to denote the end of a sequence.
n_hvg is a variable representing the number of highly variable genes to be considered.
n_bins defines the number of bins to group continuous data into discrete intervals.
mask_value defines a value for masking. Masking is a technique to ignore certain parts of the data, often used in sequence models to indicate positions in a sequence that should be ignored during training or evaluation.
pad_value defines the value for padding sequences. Similar to mask_value, it indicates positions in the data that are not part of the original sequence and should be ignored.
n_input_bins assignment set the number of input bins ensuring consistency throughout the code.

Step 1: Load the Pre-Trained Model and Dataset¶

In [ ]:

Copied!





# Specify model path; here we load the pre-trained scGPT blood model
#You will have to directly download the model from here: https://github.com/bowang-lab/scGPT
model_dir = Path("/data/scgpt_blood_model")
model_config_file = model_dir / "args.json"
model_file = model_dir / "best_model.pt"
vocab_file = model_dir / "vocab.json"

vocab = GeneVocab.from_file(vocab_file)
for s in special_tokens:
    if s not in vocab:
        vocab.append_token(s)

# Retrieve model parameters from config files
with open(model_config_file, "r") as f:
    model_configs = json.load(f)
print(
    f"Resume model from {model_file}, the model args will override the "
    f"config {model_config_file}."
)
embsize = model_configs["embsize"]
nhead = model_configs["nheads"]
d_hid = model_configs["d_hid"]
nlayers = model_configs["nlayers"]
n_layers_cls = model_configs["n_layers_cls"]

gene2idx = vocab.get_stoi()
# Specify model path; here we load the pre-trained scGPT blood model
#You will have to directly download the model from here: https://github.com/bowang-lab/scGPT
model_dir = Path("/data/scgpt_blood_model")
model_config_file = model_dir / "args.json"
model_file = model_dir / "best_model.pt"
vocab_file = model_dir / "vocab.json"

vocab = GeneVocab.from_file(vocab_file)
for s in special_tokens:
    if s not in vocab:
        vocab.append_token(s)

# Retrieve model parameters from config files
with open(model_config_file, "r") as f:
    model_configs = json.load(f)
print(
    f"Resume model from {model_file}, the model args will override the "
    f"config {model_config_file}."
)
embsize = model_configs["embsize"]
nhead = model_configs["nheads"]
d_hid = model_configs["d_hid"]
nlayers = model_configs["nlayers"]
n_layers_cls = model_configs["n_layers_cls"]

gene2idx = vocab.get_stoi()

In [4]:

Copied!





#Set up Pytorch model to run on specified device (CPU or GPU), intialise the model with specific parameters, 
#load pre-trained weight, and then transfer the model to the chosen device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ntokens = len(vocab)  # size of vocabulary
model = TransformerModel(
    ntokens,
    embsize,
    nhead,
    d_hid,
    nlayers,
    vocab=vocab,
    pad_value=pad_value,
    n_input_bins=n_input_bins,
)

try:
    model.load_state_dict(torch.load(model_file))
    print(f"Loading all model params from {model_file}")
except:
    # only load params that are in the model and match the size
    model_dict = model.state_dict()
    pretrained_dict = torch.load(model_file)
    pretrained_dict = {
        k: v
        for k, v in pretrained_dict.items()
        if k in model_dict and v.shape == model_dict[k].shape
    }
    for k, v in pretrained_dict.items():
        print(f"Loading params {k} with shape {v.shape}")
        model_dict.update(pretrained_dict)
        model.load_state_dict(model_dict)

model.to(device)
#Set up Pytorch model to run on specified device (CPU or GPU), intialise the model with specific parameters, 
#load pre-trained weight, and then transfer the model to the chosen device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ntokens = len(vocab)  # size of vocabulary
model = TransformerModel(
    ntokens,
    embsize,
    nhead,
    d_hid,
    nlayers,
    vocab=vocab,
    pad_value=pad_value,
    n_input_bins=n_input_bins,
)

try:
    model.load_state_dict(torch.load(model_file))
    print(f"Loading all model params from {model_file}")
except:
    # only load params that are in the model and match the size
    model_dict = model.state_dict()
    pretrained_dict = torch.load(model_file)
    pretrained_dict = {
        k: v
        for k, v in pretrained_dict.items()
        if k in model_dict and v.shape == model_dict[k].shape
    }
    for k, v in pretrained_dict.items():
        print(f"Loading params {k} with shape {v.shape}")
        model_dict.update(pretrained_dict)
        model.load_state_dict(model_dict)

model.to(device)

Loading params encoder.embedding.weight with shape torch.Size([36574, 512])
Loading params encoder.enc_norm.weight with shape torch.Size([512])
Loading params encoder.enc_norm.bias with shape torch.Size([512])
Loading params value_encoder.linear1.weight with shape torch.Size([512, 1])
Loading params value_encoder.linear1.bias with shape torch.Size([512])
Loading params value_encoder.linear2.weight with shape torch.Size([512, 512])
Loading params value_encoder.linear2.bias with shape torch.Size([512])
Loading params value_encoder.norm.weight with shape torch.Size([512])
Loading params value_encoder.norm.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.0.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.0.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.0.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.0.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.1.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.1.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.1.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.1.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.2.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.2.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.2.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.2.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.3.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.3.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.3.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.3.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.4.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.4.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.4.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.4.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.5.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.5.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.5.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.5.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.6.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.6.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.6.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.6.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.7.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.7.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.7.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.7.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.8.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.8.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.8.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm2.weight with shape torch.Size([512])

Loading params transformer_encoder.layers.8.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.9.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.9.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.9.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.9.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.10.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.10.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.10.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.10.norm2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.self_attn.out_proj.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.11.self_attn.out_proj.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.linear1.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.11.linear1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.linear2.weight with shape torch.Size([512, 512])
Loading params transformer_encoder.layers.11.linear2.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm1.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm1.bias with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.11.norm2.bias with shape torch.Size([512])
Loading params decoder.fc.0.weight with shape torch.Size([512, 512])
Loading params decoder.fc.0.bias with shape torch.Size([512])
Loading params decoder.fc.2.weight with shape torch.Size([512, 512])
Loading params decoder.fc.2.bias with shape torch.Size([512])
Loading params decoder.fc.4.weight with shape torch.Size([1, 512])
Loading params decoder.fc.4.bias with shape torch.Size([1])

Out[4]:

TransformerModel(
  (encoder): GeneEncoder(
    (embedding): Embedding(36574, 512, padding_idx=36571)
    (enc_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (value_encoder): ContinuousValueEncoder(
    (dropout): Dropout(p=0.5, inplace=False)
    (linear1): Linear(in_features=1, out_features=512, bias=True)
    (activation): ReLU()
    (linear2): Linear(in_features=512, out_features=512, bias=True)
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-11): 12 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=512, bias=True)
        (dropout): Dropout(p=0.5, inplace=False)
        (linear2): Linear(in_features=512, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.5, inplace=False)
        (dropout2): Dropout(p=0.5, inplace=False)
      )
    )
  )
  (decoder): ExprDecoder(
    (fc): Sequential(
      (0): Linear(in_features=512, out_features=512, bias=True)
      (1): LeakyReLU(negative_slope=0.01)
      (2): Linear(in_features=512, out_features=512, bias=True)
      (3): LeakyReLU(negative_slope=0.01)
      (4): Linear(in_features=512, out_features=1, bias=True)
    )
  )
  (cls_decoder): ClsDecoder(
    (_decoder): ModuleList(
      (0): Linear(in_features=512, out_features=512, bias=True)
      (1): ReLU()
      (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (3): Linear(in_features=512, out_features=512, bias=True)
      (4): ReLU()
      (5): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (out_layer): Linear(in_features=512, out_features=1, bias=True)
  )
  (sim): Similarity(
    (cos): CosineSimilarity()
  )
  (creterion_cce): CrossEntropyLoss()
)

Load dataset of interest

In [ ]:

Copied!





# Paths to the compressed and uncompressed files
compressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_compressed.h5ad.gz'
uncompressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_uncompressed.h5ad'

# Decompress the file
with gzip.open(compressed_file, 'rb') as f_in:
    with open(uncompressed_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Read the decompressed file using scanpy
adata = sc.read(uncompressed_file)

# Optionally, remove the uncompressed file if no longer needed
os.remove(uncompressed_file)
# Paths to the compressed and uncompressed files
compressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_compressed.h5ad.gz'
uncompressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_uncompressed.h5ad'

# Decompress the file
with gzip.open(compressed_file, 'rb') as f_in:
    with open(uncompressed_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Read the decompressed file using scanpy
adata = sc.read(uncompressed_file)

# Optionally, remove the uncompressed file if no longer needed
os.remove(uncompressed_file)

In [6]:

Copied!

adata
adata

Out[6]:

AnnData object with n_obs × n_vars = 3540 × 25198
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

Check that the gene variable names are in the correct format i.e. gene symbol names and not Ensembl IDs.

In [7]:

Copied!

adata.var
adata.var

Out[7]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length	highly_variable	means	dispersions	dispersions_norm
feature_name
MIR1302-2HG	False	MIR1302-2HG	NCBITaxon:9606	gene	1021	False	1.000000e-12	NaN	NaN
FAM138A	False	FAM138A	NCBITaxon:9606	gene	1219	False	1.000000e-12	NaN	NaN
OR4F5	False	OR4F5	NCBITaxon:9606	gene	2618	False	1.000000e-12	NaN	NaN
OR4F29	False	OR4F29	NCBITaxon:9606	gene	939	False	1.000000e-12	NaN	NaN
OR4F16	False	OR4F16	NCBITaxon:9606	gene	939	False	1.000000e-12	NaN	NaN
...	...	...	...	...	...	...	...	...	...
TTTY17C	False	TTTY17C	NCBITaxon:9606	gene	776	False	1.000000e-12	NaN	NaN
SEPTIN14P23	False	SEPTIN14P23	NCBITaxon:9606	gene	1192	False	2.503351e-04	-0.120703	-1.405214
CDY1	False	CDY1	NCBITaxon:9606	gene	2670	False	1.000000e-12	NaN	NaN
TTTY3	False	TTTY3	NCBITaxon:9606	gene	344	False	1.000000e-12	NaN	NaN
MAFIP	False	MAFIP	NCBITaxon:9606	gene	1599	False	6.082392e-03	0.002191	0.189369

25198 rows × 9 columns

In [8]:

Copied!





# Preprocess the data following the scGPT data pre-processing pipeline
preprocessor = Preprocessor(
    use_key="X",  # the key in adata.layers to use as raw data
    filter_gene_by_counts=3,  # step 1
    filter_cell_by_counts=False,  # step 2
    normalize_total=1e4,  # step 3 whether to normalize the raw data and to what sum
    result_normed_key="X_normed",  # the key in adata.layers to store the normalized data
    log1p=data_is_raw,  # step 4 whether to log1p the normalised data
    result_log1p_key="X_log1p",
    subset_hvg=n_hvg,  # 5. whether to subset the raw data to highly variable genes
    hvg_flavor="seurat_v3" if data_is_raw else "cell_ranger",
    binning=n_bins,  # 6. whether to bin the raw data and to what number of bins
    result_binned_key="X_binned",  # the key in adata.layers to store the binned data
)
preprocessor(adata)
# Preprocess the data following the scGPT data pre-processing pipeline
preprocessor = Preprocessor(
    use_key="X",  # the key in adata.layers to use as raw data
    filter_gene_by_counts=3,  # step 1
    filter_cell_by_counts=False,  # step 2
    normalize_total=1e4,  # step 3 whether to normalize the raw data and to what sum
    result_normed_key="X_normed",  # the key in adata.layers to store the normalized data
    log1p=data_is_raw,  # step 4 whether to log1p the normalised data
    result_log1p_key="X_log1p",
    subset_hvg=n_hvg,  # 5. whether to subset the raw data to highly variable genes
    hvg_flavor="seurat_v3" if data_is_raw else "cell_ranger",
    binning=n_bins,  # 6. whether to bin the raw data and to what number of bins
    result_binned_key="X_binned",  # the key in adata.layers to store the binned data
)
preprocessor(adata)

scGPT - INFO - Filtering genes by counts ...
scGPT - INFO - Normalizing total counts ...
scGPT - INFO - Subsetting highly variable genes ...
scGPT - WARNING - No batch_key is provided, will use all cells for HVG selection.
scGPT - INFO - Binning data ...

Step 2: Retrieve Scgpt's Gene Embeddings¶

Overall, the pre-trained foundation model contains 30+K genes. Here for simplicity, we focus on a subset of HVGs specific to the data at hand.

In [9]:

Copied!





# Retrieve the data-independent gene embeddings from scGPT
gene_ids = np.array([id for id in gene2idx.values()])
gene_embeddings = model.encoder(torch.tensor(gene_ids, dtype=torch.long).to(device))
gene_embeddings = gene_embeddings.detach().cpu().numpy()
# Retrieve the data-independent gene embeddings from scGPT
gene_ids = np.array([id for id in gene2idx.values()])
gene_embeddings = model.encoder(torch.tensor(gene_ids, dtype=torch.long).to(device))
gene_embeddings = gene_embeddings.detach().cpu().numpy()

In [10]:

Copied!

# Filter on the intersection between the Blood HVGs found and scGPT's 30+K foundation model vocab
gene_embeddings = {gene: gene_embeddings[i] for i, gene in enumerate(gene2idx.keys()) if gene in adata.var.index.tolist()}
print('Retrieved gene embeddings for {} genes.'.format(len(gene_embeddings)))
# Filter on the intersection between the Blood HVGs found and scGPT's 30+K foundation model vocab
gene_embeddings = {gene: gene_embeddings[i] for i, gene in enumerate(gene2idx.keys()) if gene in adata.var.index.tolist()}
print('Retrieved gene embeddings for {} genes.'.format(len(gene_embeddings)))

Retrieved gene embeddings for 962 genes.

In [11]:

Copied!

# Construct gene embedding network
embed = GeneEmbedding(gene_embeddings)
# Construct gene embedding network
embed = GeneEmbedding(gene_embeddings)

100%|████████████████████████████████████████████████████████████████████████████| 962/962 [00:00<00:00, 1095850.20it/s]

Step 3: Extract gene programmes from gene embedding network¶

Perform Louvain clustering on the gene embedding network.

In [12]:

Copied!





# Perform Louvain clustering with desired resolution; here we specify resolution=40
gdata = embed.get_adata(resolution=40)
# Retrieve the gene clusters
metagenes = embed.get_metagenes(gdata)
# Perform Louvain clustering with desired resolution; here we specify resolution=40
gdata = embed.get_adata(resolution=40)
# Retrieve the gene clusters
metagenes = embed.get_metagenes(gdata)

In [13]:

Copied!

gdata
gdata

Out[13]:

AnnData object with n_obs × n_vars = 962 × 512
    obs: 'leiden'
    uns: 'neighbors', 'leiden', 'umap'
    obsm: 'X_umap'
    obsp: 'distances', 'connectivities'

In [14]:

Copied!

metagenes
metagenes

Out[14]:

defaultdict(list,
            {'32': ['ZNF880', 'SNHG21', 'APOM', 'SC5D'],
             '214': ['ZNF837', 'SMYD5'],
             '215': ['ZNF713', 'MIR762HG'],
             '254': ['ZNF618', 'ZNF235'],
             '144': ['ZNF608', 'APBB2', 'ZNF43'],
             '57': ['ZNF594-DT', 'TSPYL5', 'ZNF530', 'MTFR2'],
             '120': ['ZNF528-AS1', 'WDR27', 'CORT'],
             '37': ['ZNF426', 'ZFP14', 'TTC28-AS1', 'ZNF76'],
             '182': ['ZNF415', 'KRCC1', 'PTPRN2'],
             '109': ['ZNF391', 'ZCWPW1', 'IFT88'],
             '74': ['ZNF318', 'SIPA1L3', 'KBTBD8', 'GUCD1'],
             '21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
             '91': ['ZNF140', 'SMIM8', 'ZBED5-AS1', 'FCF1'],
             '99': ['ZMYND8', 'ITPR1', 'CDKAL1'],
             '268': ['ZMAT1', 'EFCAB13'],
             '12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
             '217': ['ZEB2-AS1', 'PTCH2'],
             '95': ['ZDHHC19', 'SERHL2', 'NEK8'],
             '218': ['ZC3H12C', 'PACS2'],
             '211': ['ZBTB32', 'IL2RA'],
             '5': ['ZBTB16',
              'TLE1',
              'TBC1D5',
              'CRIM1',
              'PRKCB',
              'VAV3',
              'MCTP2'],
             '175': ['XIST', 'ITGA4', 'LPP'],
             '114': ['XCL2', 'TFPI2', 'CCL4L2'],
             '126': ['XCL1', 'GNG4', 'TNFRSF18'],
             '166': ['WDR86', 'CNKSR2', 'GRAP'],
             '53': ['WBP4', 'SREBF2-AS1', 'COQ7', 'KMT2E-AS1'],
             '98': ['WASF1', 'SGO1', 'CCDC18'],
             '100': ['VPREB3', 'CD24', 'FLI1'],
             '101': ['USP7-AS1', 'CATSPERZ', 'CERNA1'],
             '205': ['USF1', 'RASSF2', 'RIN3'],
             '102': ['UMAD1', 'SCN3A', 'BEND5'],
             '103': ['UACA', 'PDE7B', 'EPS8'],
             '158': ['TYROBP', 'FCER1G', 'HCST'],
             '137': ['TUSC1', 'MYL5', 'LINC01003'],
             '51': ['TUBB6', 'EIF1AY', 'PDLIM7', 'HMGN5'],
             '104': ['TTC9', 'CXCR5', 'DBNDD1'],
             '105': ['TTC23L-AS1', 'LNP1', 'MSS51'],
             '106': ['TSC22D1', 'MEG3', 'H2AC6'],
             '107': ['TRIM66', 'EPM2A', 'MCF2L'],
             '108': ['TRIM52-AS1', 'NUBPL', 'THAP7-AS1'],
             '22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
             '131': ['TRBC1', 'TRDC', 'C12orf75'],
             '81': ['TRAF4', 'METTL8', 'BTLA', 'ADAM19'],
             '1': ['TNFRSF13B',
              'FCRL2',
              'PNOC',
              'S100B',
              'LINC01781',
              'CD1C',
              'LINC01857',
              'OSBPL10'],
             '110': ['TNFAIP6', 'ADM', 'PTGS2'],
             '111': ['TMEM62', 'CLCN4', 'LETM2'],
             '132': ['TMEM44-AS1', 'LINC02551', 'TNRC6B-DT'],
             '149': ['TMEM38A', 'EIF2AK3-DT', 'GCNT2'],
             '112': ['TMEM131L', 'ACSM3', 'MARCHF3'],
             '10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
             '68': ['TUNAR', 'TVP23A', 'ELL3', 'GLI1'],
             '113': ['THOC1-DT', 'LINGO3', 'GORAB-AS1'],
             '65': ['THBS1', 'SERPINB2', 'CCL7', 'EREG'],
             '40': ['TGM1', 'DTX4', 'ECE1-AS1', 'SERPINB9P1'],
             '56': ['TTC30A', 'ATP10D', 'MUL1', 'SPRYD7'],
             '59': ['TEX9', 'STK33', 'TMEM67', 'PPIL6'],
             '216': ['TESC', 'SCNM1'],
             '201': ['TEDC1', 'ARHGEF18', 'PCGF1'],
             '251': ['TCEAL4', 'INO80C'],
             '77': ['TBXA2R', 'SDK2', 'CR2', 'MYEF2'],
             '97': ['ZNF667', 'WDR5B-DT', 'ZNF10'],
             '174': ['TAT-AS1', 'CUBN', 'LINC02175'],
             '177': ['TAPT1', 'BCL7A', 'IL4R'],
             '94': ['TAL1', 'SMIM5', 'PDZK1IP1'],
             '25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
             '178': ['SUV39H2', 'DPPA4', 'IMPACT'],
             '119': ['SULF2', 'S100Z', 'DOCK8-AS1'],
             '302': ['STAG1-DT'],
             '79': ['UQCRHL', 'GIHCG', 'NREP', 'AZU1'],
             '219': ['SPRY1', 'ZNF667-AS1'],
             '66': ['SPINK2', 'ZNF844', 'ERICH6-AS1', 'ABCB1'],
             '235': ['SPATA6', 'MAPK10'],
             '90': ['SPARC', 'CLU', 'MYH11', 'MYL9'],
             '220': ['SPAG16', 'PTPRK'],
             '2': ['SP2',
              'H2AC21',
              'H2AC11',
              'H2BC13',
              'H2BC18',
              'H2BC4',
              'H2BC7',
              'H3C3'],
             '184': ['SOX2-OT', 'STAR', 'MYL3'],
             '194': ['SOS1-IT1', 'CAPN14', 'MTCL1'],
             '33': ['SOCS1', 'CSRNP1', 'GADD45G', 'RGS16'],
             '115': ['SOBP', 'TPK1', 'KRBOX1'],
             '85': ['SNX29', 'CYSLTR1', 'UVRAG', 'LHPP'],
             '55': ['SNX22', 'TCTN1', 'FAM111B', 'CNFN'],
             '116': ['SNAPC1', 'HEIH', 'MXI1'],
             '117': ['SMAGP', 'SCPEP1', 'ANXA4'],
             '148': ['SLC39A10', 'CEP290', 'ANKRD26'],
             '222': ['ST20-AS1', 'ERMN'],
             '63': ['SLC2A6', 'SIGLEC10', 'SIDT2', 'SCIMP'],
             '62': ['SKAP1', 'NCR3', 'KLRB1', 'HOPX'],
             '186': ['SIMC1', 'LYRM1', 'PIGC'],
             '118': ['SH3BP2', 'FAM120A', 'VPS41'],
             '163': ['SFR1', 'RAB30-DT', 'CHML'],
             '206': ['SETBP1', 'OR2T11', 'LAMC1'],
             '64': ['SERPINI1', 'SLC2A1', 'ANKRD37', 'PWWP3A'],
             '41': ['TPST1', 'RAPGEF5', 'FLT1', 'ST6GALNAC3'],
             '188': ['SEMA6B', 'GASAL1', 'INSIG1-DT'],
             '17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
             '139': ['SAV1', 'USP6NL', 'POU2F1'],
             '223': ['TRMT1', 'LINC01678'],
             '123': ['SATB1', 'PDE7A', 'BCL11B'],
             '13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
             '86': ['RUNX1', 'GNA15', 'ANKRD28', 'LRRC75A'],
             '16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
             '224': ['RRAD', 'DNAAF1'],
             '0': ['TXNDC5',
              'IGHA2',
              'IGLC3',
              'IGHA1',
              'IGHG4',
              'JCHAIN',
              'IGHG1',
              'IGHG3',
              'IGLC2'],
             '241': ['RP9', 'AGPAT5'],
             '83': ['SNX10', 'SLC11A1', 'MCEMP1', 'PTPN6'],
             '121': ['RPGR', 'SEMA4B', 'RRM2B'],
             '225': ['USP9Y', 'CCDC141'],
             '34': ['UBE4A', 'OGT', 'TRIB2', 'CCND2'],
             '156': ['TBCK', 'EPG5', 'NOTCH1'],
             '36': ['SBNO1-AS1', 'LAIR1', 'CRELD2', 'CD38'],
             '252': ['WAC-AS1', 'TLR1'],
             '226': ['WWC3', 'ZDHHC21'],
             '35': ['RPAP2', 'DUSP12', 'SLC25A45', 'OSGEPL1'],
             '44': ['ZNF585A', 'EID2B', 'HS3ST3B1', 'AKAP7'],
             '122': ['ZRSR2', 'DIPK1B', 'SCAMP4'],
             '67': ['STEAP1B', 'DTWD2', 'BCAS4', 'CCSER1'],
             '197': ['ZNF83', 'SESN1', 'NUAK2'],
             '269': ['SCAI', 'ARPP21'],
             '296': ['TNFSF13B', 'CLEC7A'],
             '39': ['RNF170', 'BMP3', 'ALKAL2', 'CBR3-AS1'],
             '227': ['WDR19', 'IQCN'],
             '287': ['RNF138', 'NAF1'],
             '69': ['RNF103', 'R3HDM2', 'ZYG11B', 'ZBTB18'],
             '9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
             '80': ['RNASEH2B-AS1', 'SPA17', 'EFHB', 'IQCD'],
             '48': ['RIOK1', 'URB1-AS1', 'LINC00526', 'NUFIP1'],
             '140': ['RGS1', 'DUSP4', 'CCL2'],
             '30': ['REXO1', 'CORO7', 'AKAP8', 'BTBD2'],
             '124': ['RDX', 'TSPAN13', 'IL3RA_ENSG00000185291'],
             '58': ['RBP7', 'NFAM1', 'HCAR2', 'ASGR1'],
             '46': ['RASSF8', 'RBM38-AS1', 'LINC01811', 'PCP2'],
             '280': ['RASSF1-AS1', 'FMO5'],
             '125': ['RAP2C-AS1', 'MAP3K2-DT', 'LINC00494'],
             '19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
             '212': ['TNFAIP2', 'UBXN11'],
             '18': ['RAB33B-AS1',
              'MZF1-AS1',
              'SDR42E2',
              'C16orf74',
              'MIR4458HG'],
             '7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
             '29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
             '60': ['GNMT', 'ERCC2', 'TMEM167B-DT', 'PSMA8'],
             '6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
             '23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
             '180': ['GAPT', 'IGLC5', 'CD200'],
             '4': ['IGHV4-39',
              'IGKV1-9',
              'IGHV4-34',
              'IGHV1-3',
              'IGLV2-14',
              'IGHV1-18',
              'IGHV3-23'],
             '266': ['CLMN', 'TFEB'],
             '15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
             '14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
             '127': ['GALNT6', 'GOLPH3-DT', 'GPR137C'],
             '229': ['CXCL10', 'CXCL2'],
             '27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
             '92': ['BBOF1', 'TWSG1-DT', 'WDR5B', 'CFAP45'],
             '70': ['CLLU1-AS1', 'ARHGAP22', 'DNAJC12', 'AVP'],
             '199': ['CCDC112', 'FADS3', 'ZNF821'],
             '8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
             '246': ['FCER2', 'ABCB4'],
             '190': ['POC1B', 'ASXL2', 'MBNL3'],
             '168': ['FAM241A', 'SDCCAG8', 'CERS6'],
             '230': ['ARL13B', 'HNRNPD-DT'],
             '231': ['NR4A1', 'ATF3'],
             '221': ['FAM111A', 'PLCL2'],
             '232': ['HMGA1P4', 'C1orf220'],
             '233': ['EVI5', 'ATP6V0A1'],
             '128': ['C11orf24', 'LARGE2', 'KCNG1'],
             '234': ['ARFIP1', 'GCNT1'],
             '129': ['DNAL4', 'CD2AP-DT', 'PRXL2B'],
             '147': ['EPB41L2', 'ELL2', 'ABCA1'],
             '3': ['MX1',
              'HERC5',
              'XAF1',
              'IFI44L',
              'IFIT3',
              'MX2',
              'EPSTI1',
              'PLSCR1'],
             '54': ['CYTOR', 'SLAMF7', 'BHLHE41', 'IFNG'],
             '183': ['FAAP24', 'ZNF252P-AS1', 'HCN3'],
             '130': ['PACSIN2', 'DDI2', 'RYBP'],
             '75': ['GBP5', 'GBP1', 'GBP4', 'GBP2'],
             '259': ['DOCK9-DT', 'NUMBL'],
             '150': ['USP11', 'BTN2A2', 'HIP1R'],
             '73': ['DCTN6-DT', 'CSKMT', 'ATF7IP2', 'ILF3-DT'],
             '203': ['NRGN', 'RGS18', 'PRKAR2B'],
             '71': ['DNAAF4',
              'LINC00106_ENSG00000236871',
              'CSGALNACT2-DT',
              'JARID2-AS1'],
             '236': ['DDN-AS1', 'ICOSLG'],
             '204': ['DTX1', 'GPR160', 'HRK'],
             '237': ['ERCC1', 'HDAC9'],
             '45': ['CYB5D2', 'LINC01004', 'ZFP28-DT', 'ALG9'],
             '26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
             '213': ['DYNLT2', 'CAGE1'],
             '289': ['PPP1R14A', 'FAM30A'],
             '52': ['NOTCH2', 'SERTAD2', 'KLF7', 'MAML3'],
             '61': ['TSC22D2', 'ICA1L', 'TTN', 'KCNQ1OT1'],
             '141': ['FAM177B', 'CCL22', 'IL6'],
             '185': ['CYP2R1', 'NHLH1', 'LINC02848'],
             '198': ['IRAK3', 'BAZ2B', 'PECAM1'],
             '238': ['EFHC1', 'METTL7A'],
             '239': ['ZNF165', 'CFAP44'],
             '50': ['IGHV3-15', 'IGLC6', 'IGHV3-49', 'IGLC7'],
             '260': ['NME4', 'MYC'],
             '78': ['SOX4', 'ARMH1', 'MIR181A1HG', 'MTHFD2L'],
             '133': ['VCAN', 'CSF3R', 'CD14'],
             '43': ['DOP1B', 'MIR3936HG', 'DLEU2', 'NRDE2'],
             '249': ['GDAP1', 'CENPP'],
             '291': ['PTGDS', 'IGFBP4'],
             '72': ['CA5B', 'RBM43', 'INTS6-AS1', 'LRRC8C-DT'],
             '240': ['LIX1-AS1', 'CLEC20A'],
             '169': ['NUP58', 'SPPL2B', 'CLASRP'],
             '160': ['CRIP2', 'BDH2', 'PCSK1N'],
             '297': ['COL19A1'],
             '134': ['PIDD1', 'NUP210', 'SPATA33'],
             '292': ['CORO2B', 'AMDHD2'],
             '153': ['CLEC4A', 'GM2A', 'MIR223HG'],
             '208': ['PRCD', 'ZNF528', 'MYBPC2'],
             '243': ['KCNH8', 'LINC02397'],
             '88': ['CNN3', 'CCND1', 'APP', 'PLK2'],
             '162': ['FRY', 'LINC02029', 'GABBR1'],
             '275': ['CNTNAP2', 'MAP3K9'],
             '136': ['GABARAPL1', 'IL1RN', 'PLAUR'],
             '244': ['CLIC4', 'FCHSD2'],
             '159': ['HEXIM1', 'GADD45A', 'IRS2'],
             '38': ['MARCHF1', 'HLA-DRB5', 'CPVL', 'SHTN1'],
             '179': ['ANXA1', 'CD9', 'LGALS3'],
             '193': ['TGFBI', 'RNASE6', 'MPEG1'],
             '245': ['OPN3', 'MAP3K13'],
             '84': ['IGHV3-53', 'IGKV2D-29', 'IGHV3-30', 'IGHV5-51'],
             '242': ['IGHE', 'WNT10A'],
             '191': ['H2BC21', 'H2BC8', 'H2BC5'],
             '87': ['DSP', 'FXYD7', 'SH2D3A', 'LSR'],
             '279': ['ADPGK-AS1', 'LINC00467'],
             '138': ['RHOB', 'CITED2', 'H2AC19'],
             '164': ['KCNIP2', 'RORA-AS1', 'PGBD4'],
             '248': ['CEMIP2', 'KLRD1'],
             '192': ['THRB', 'KLK1', 'MOB3B'],
             '207': ['ADARB1', 'EPN2', 'MICAL3'],
             '247': ['CFL2', 'PAIP2B'],
             '264': ['IFNG-AS1', 'SLAMF1'],
             '24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
             '28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
             '195': ['CCDC168', 'KLF3-AS1', 'NARF-AS2'],
             '228': ['MSMP', 'PLD5'],
             '89': ['CXCL3', 'CCL20', 'MIR3945HG', 'MMP7'],
             '143': ['VAV3-AS1', 'DPEP2', 'IL6-AS1'],
             '285': ['NETO1-DT', 'LRRIQ3'],
             '31': ['ARHGAP9', 'ARHGAP4', 'FAM200B', 'CMTM7'],
             '135': ['GALNT4', 'ARRDC4', 'LPP-AS2'],
             '145': ['FCRL5', 'LINC01480', 'GCM1'],
             '200': ['EGR2', 'C15orf48', 'EGR3'],
             '146': ['FCRL1', 'KLHL14', 'MYO7B'],
             '93': ['GPR35', 'LILRA1', 'LILRA6', 'HYAL3'],
             '11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
             '250': ['AKAP6', 'LINC01685'],
             '271': ['LUCAT1', 'IER5L'],
             '170': ['P2RX5', 'CDCA7L', 'GPR18'],
             '282': ['CLEC17A', 'HLA-DOB'],
             '76': ['DBNDD2', 'MANSC1', 'OXT', 'PPP1R21-DT'],
             '151': ['COLCA1', 'LIX1L-AS1', 'TAGAP-AS1'],
             '152': ['ARHGAP12', 'SLC16A11', 'HAGHL'],
             '154': ['LEKR1', 'OXTR', 'P2RX7'],
             '284': ['CD300E', 'DUSP6'],
             '299': ['E2F5'],
             '265': ['CAMK2D', 'CIITA'],
             '286': ['RAB37', 'C11orf21'],
             '82': ['DNASE1L3', 'KHDRBS2', 'MACROD2', 'MIR4432HG'],
             '288': ['CCDC88A', 'OGFRL1'],
             '47': ['MILR1', 'CBFA2T3', 'PAWR', 'HS2ST1'],
             '278': ['NKILA', 'KLHL22'],
             '283': ['TCL6', 'YPEL2'],
             '294': ['IFRD1', 'NEIL1'],
             '210': ['TMEM176A', 'G0S2', 'MAFB'],
             '253': ['ARHGEF35-AS1', 'FAM24B'],
             '181': ['KLRC1', 'GZMK', 'GPR65'],
             '155': ['IRF4', 'TP53INP1', 'PHF6'],
             '161': ['CSRP2', 'BEX1', 'PEG10'],
             '42': ['ZCCHC9', 'FBXO22', 'KIAA0586', 'PNKD'],
             '157': ['PPP1R14B-AS1', 'TP53RK-DT', 'PTK6'],
             '202': ['OR2A1-AS1', 'PPP1R12A-AS1', 'MYLK-AS1'],
             '209': ['COL9A3', 'N4BP3', 'LCN8'],
             '96': ['CHL1', 'SYT17', 'MAP2'],
             '257': ['ATP2B1-AS1', 'TMEM107'],
             '196': ['ARMCX4', 'MAP3K6', 'NPTX1'],
             '258': ['BCL6', 'ATP11A'],
             '171': ['CTSW', 'GZMM', 'FEZ1'],
             '20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44'],
             '290': ['CBR3', 'LOH12CR2'],
             '165': ['IRAK2', 'LINC00265', 'NRARP'],
             '172': ['H1-10', 'H1-2', 'H1-3'],
             '301': ['KLHDC7B-DT'],
             '274': ['CD180', 'IRAG1-AS1'],
             '167': ['JAG2', 'P2RY14', 'PPM1J'],
             '262': ['CYP2U1-AS1', 'KCTD16'],
             '263': ['CLDN11', 'NTNG1'],
             '293': ['LIX1', 'GNG3'],
             '298': ['IGLC1'],
             '49': ['MED12', 'ERLIN2', 'IFT122', 'P2RX4'],
             '176': ['HES1', 'ID1', 'ID2'],
             '173': ['HES4', 'PPP1R17', 'ODAD3'],
             '267': ['HHIP-AS1', 'LINC01013'],
             '142': ['MZB1', 'DERL3', 'AIM2'],
             '270': ['EAF2', 'PLPP5'],
             '272': ['INSIG2', 'KHDRBS3'],
             '273': ['IGHV3-7', 'STAG3'],
             '189': ['DPYSL3', 'CERCAM', 'NRP2'],
             '300': ['PRMT5-AS1'],
             '276': ['CHST15', 'LINC02728'],
             '277': ['OXCT1-AS1', 'MYZAP'],
             '255': ['PCDH9', 'NIBAN3'],
             '256': ['LINC01703', 'NDUFV2-AS1'],
             '281': ['LINC02132', 'NUDT8'],
             '295': ['LINC02422', 'SLC38A11'],
             '187': ['AMN1', 'NT5E', 'OLMALINC'],
             '261': ['GZMA', 'IL32']})

Filter on clusters with 5 or more genes

In [15]:

Copied!





# Obtain the set of gene programmes from clusters with #genes >= 5
mgs = dict()
for mg, genes in metagenes.items():
    if len(genes) > 4:
        mgs[mg] = genes
# Obtain the set of gene programmes from clusters with #genes >= 5
mgs = dict()
for mg, genes in metagenes.items():
    if len(genes) > 4:
        mgs[mg] = genes

In [16]:

Copied!

# Here are the gene programmes identified
mgs
# Here are the gene programmes identified
mgs

Out[16]:

{'21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
 '12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
 '5': ['ZBTB16', 'TLE1', 'TBC1D5', 'CRIM1', 'PRKCB', 'VAV3', 'MCTP2'],
 '22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
 '1': ['TNFRSF13B',
  'FCRL2',
  'PNOC',
  'S100B',
  'LINC01781',
  'CD1C',
  'LINC01857',
  'OSBPL10'],
 '10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
 '25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
 '2': ['SP2',
  'H2AC21',
  'H2AC11',
  'H2BC13',
  'H2BC18',
  'H2BC4',
  'H2BC7',
  'H3C3'],
 '17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
 '13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
 '16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
 '0': ['TXNDC5',
  'IGHA2',
  'IGLC3',
  'IGHA1',
  'IGHG4',
  'JCHAIN',
  'IGHG1',
  'IGHG3',
  'IGLC2'],
 '9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
 '19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
 '18': ['RAB33B-AS1', 'MZF1-AS1', 'SDR42E2', 'C16orf74', 'MIR4458HG'],
 '7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
 '29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
 '6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
 '23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
 '4': ['IGHV4-39',
  'IGKV1-9',
  'IGHV4-34',
  'IGHV1-3',
  'IGLV2-14',
  'IGHV1-18',
  'IGHV3-23'],
 '15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
 '14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
 '27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
 '8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
 '3': ['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1'],
 '26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
 '24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
 '28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
 '11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
 '20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44']}

Step 4: Visualise Network Connectivity within Desired Gene Programme¶

In [17]:

Copied!





# Retrieve gene programme 3
GP_genes = mgs['3']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
    df = embed.compute_similarities(i, GP_genes)
    df['Gene1'] = i
    df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)
# Retrieve gene programme 3
GP_genes = mgs['3']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
    df = embed.compute_similarities(i, GP_genes)
    df['Gene1'] = i
    df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)

['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1']

100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 236.27it/s]

      Gene  Similarity   Gene1
60  EPSTI1    0.225607  PLSCR1
45  EPSTI1    0.188223     MX2
6   EPSTI1    0.256395     MX1
35  EPSTI1    0.274948   IFIT3
29  EPSTI1    0.241366  IFI44L
19  EPSTI1    0.254587    XAF1
14  EPSTI1    0.119632   HERC5
31   HERC5    0.168532  IFI44L
55   HERC5    0.119632  EPSTI1
47   HERC5    0.163126     MX2
38   HERC5    0.202499   IFIT3
23   HERC5    0.168315    XAF1
63   HERC5    0.091875  PLSCR1
7    HERC5    0.174051     MX1
1   IFI44L    0.372201     MX1
17  IFI44L    0.337581    XAF1
59  IFI44L    0.232156  PLSCR1
11  IFI44L    0.168532   HERC5
34  IFI44L    0.296725   IFIT3
52  IFI44L    0.241366  EPSTI1
41  IFI44L    0.306452     MX2
44   IFIT3    0.198912     MX2
49   IFIT3    0.274948  EPSTI1
9    IFIT3    0.202499   HERC5
28   IFIT3    0.296725  IFI44L
3    IFIT3    0.313495     MX1
58   IFIT3    0.240587  PLSCR1
20   IFIT3    0.240173    XAF1
10     MX1    0.174051   HERC5
50     MX1    0.256395  EPSTI1
33     MX1    0.313495   IFIT3
25     MX1    0.372201  IFI44L
57     MX1    0.266255  PLSCR1
18     MX1    0.323817    XAF1
42     MX1    0.295364     MX2
4      MX2    0.295364     MX1
54     MX2    0.188223  EPSTI1
39     MX2    0.198912   IFIT3
13     MX2    0.163126   HERC5
21     MX2    0.228178    XAF1
27     MX2    0.306452  IFI44L
62     MX2    0.182111  PLSCR1
22  PLSCR1    0.212136    XAF1
30  PLSCR1    0.232156  IFI44L
53  PLSCR1    0.225607  EPSTI1
36  PLSCR1    0.240587   IFIT3
46  PLSCR1    0.182111     MX2
5   PLSCR1    0.266255     MX1
15  PLSCR1    0.091875   HERC5
43    XAF1    0.228178     MX2
37    XAF1    0.240173   IFIT3
26    XAF1    0.337581  IFI44L
12    XAF1    0.168315   HERC5
2     XAF1    0.323817     MX1
61    XAF1    0.212136  PLSCR1
51    XAF1    0.254587  EPSTI1

In [18]:

Copied!





# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)
# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)

In [19]:

Copied!





# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')

elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]

pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)

width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
    if v > thresh:
        width_large[list(widths.keys())[i]] = v*10
    else:
        width_small[list(widths.keys())[i]] = max(v, 0)*10

nx.draw_networkx_edges(G, pos,
                       edgelist = width_small.keys(),
                       width=list(width_small.values()),
                       edge_color='lightblue',
                       alpha=0.8)
nx.draw_networkx_edges(G, pos, 
                       edgelist = width_large.keys(), 
                       width = list(width_large.values()), 
                       alpha = 0.5, 
                       edge_color = "blue", 
                      )
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)

ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()
# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')

elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]

pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)

width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
    if v > thresh:
        width_large[list(widths.keys())[i]] = v*10
    else:
        width_small[list(widths.keys())[i]] = max(v, 0)*10

nx.draw_networkx_edges(G, pos,
                       edgelist = width_small.keys(),
                       width=list(width_small.values()),
                       edge_color='lightblue',
                       alpha=0.8)
nx.draw_networkx_edges(G, pos, 
                       edgelist = width_large.keys(), 
                       width = list(width_large.values()), 
                       alpha = 0.5, 
                       edge_color = "blue", 
                      )
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)

ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()

No description has been provided for this image

Step 6: Reactome Pathway Analysis¶

Reactome pathway is another form of pathway analysis that can be carried out on the data.

In [20]:

Copied!





# Meta info about the number of terms (tests) in the databases
df_database = pd.DataFrame(
data = [['GO_Biological_Process_2021', 6036],
['GO_Molecular_Function_2021', 1274],
['Reactome_2022', 1818]],
columns = ['dataset', 'term'])
# Meta info about the number of terms (tests) in the databases
df_database = pd.DataFrame(
data = [['GO_Biological_Process_2021', 6036],
['GO_Molecular_Function_2021', 1274],
['Reactome_2022', 1818]],
columns = ['dataset', 'term'])

In [21]:

Copied!





# Select desired database for query; here use Reactome as an example
databases = ['Reactome_2022']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m
# Select desired database for query; here use Reactome as an example
databases = ['Reactome_2022']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m

In [22]:

Copied!





# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
                          gene_sets=databases,
                          organism='Human', 
                          outdir='test/enr_Reactome',
                          cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)

# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df
# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
                          gene_sets=databases,
                          organism='Human', 
                          outdir='test/enr_Reactome',
                          cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)

# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df

Out[22]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
0	Reactome_2022	Interferon Signaling R-HSA-913531	5/200	5.197809e-09	8.836276e-08	169.205128	3227.592653	HERC5;MX2;MX1;XAF1;IFIT3
1	Reactome_2022	Interferon Alpha/Beta Signaling R-HSA-909733	4/72	1.068672e-08	9.083711e-08	293.000000	5377.799365	MX2;MX1;XAF1;IFIT3
2	Reactome_2022	Cytokine Signaling In Immune System R-HSA-1280215	5/702	2.693582e-06	1.189224e-05	46.138211	591.705896	HERC5;MX2;MX1;XAF1;IFIT3
3	Reactome_2022	ISG15 Antiviral Mechanism R-HSA-1169408	3/75	2.798175e-06	1.189224e-05	166.000000	2122.566166	HERC5;MX2;MX1
4	Reactome_2022	Antiviral Mechanism By IFN-stimulated Genes R-...	3/83	3.801759e-06	1.292598e-05	149.340000	1863.770163	HERC5;MX2;MX1

In [ ]:

External Reading:

SCGPT paper: https://www.nature.com/articles/s41592-024-02201-0
Feature selection: https://bioconductor.org/books/3.15/OSCA.basic/feature-selection.html
P-values: https://www.nature.com/articles/nmeth.2900
Community detection algorithms: https://www.nature.com/articles/s41598-019-41695-z
Reactome: https://reactome.org/

Exercises:

Describe the process of loading the pre-trained scGPT model. What are the key parameters used?
Explain the significance of the n_hvg and n_bins parameters used in the preprocessing pipeline.
Why is it important to filter the model gene embeddings to match the HVGs found in the input dataset?
Describe the process of constructing the gene embedding network and performing Louvain clustering. What is the role of the resolution parameter?
Explain how the cosine similarity among genes in a specific gene programme is computed.
What is the purpose of performing pathway enrichment analysis in this context?
Explain the significance of the p-value correction applied during the pathway enrichment analysis.
Change the resolution parameter for the Louvain clustering step and analyse how the number of identified gene programmes changes. Plot the distribution of gene programmes for different resolution values.
Choose a different gene programme and create a similar gene similarity network to visualise.
Perform pathway enrichment analysis using a different database, such as GO_Biological_Process_2021 as well as report the top enriched pathways/processes and their significance levels.

Answers:

The process of loading the pre-trained scGPT model involves several steps:
- The vocabulary is loaded from a file and special tokens are appended if they are not already present.
- Model parameters such as embedding size, number of heads, number of layers amongst others are read from a configuration file.
- A transformer model object is intitialised with the retrieved parameters and the vocabulary size.
- The model weights are loaded from a checkpoint file. If there are mismatches in the parameter shapes, only matching parameters are updated in the model state.
n_hvg parameter specifies the number of highly variable genes to retain from the dataset.Selecting HVGs helps in focusing on genes that have most significant variations across cells, which are typically more informative for downstream analyses like clustering and classification. n_bins parameter determines how many bins the gene expression values are divided into. Binning is used to discretise continuous gene expression data into discrete categories, which can simplify the input for models and help in reducing noise.
Filtering genes ensures that only the embeddings of genes that are relevant (HVGs) to the curren analysis are used. HVGs capture the most important variations in the dataset. This alignment also guarantees that the model embeddings correspond to the actual genes present in the dataset, preventing discrepancies that could arise if embeddings for non-existent or non-variable genes were included. By focusing onlhy on HVGs, the computational complexity and memory usage are reduced, making the analysis more efficient.
The process of constructing the network and clustering:
- The gene embeddings obtained from the model are used to create a graph where nodes represent genes, and edges represent similarities (e.g. cosine similarities) between gene embeddings.
- The Louivain method is applied to the graph to partition the network into clusters. The algorithm optimises modularity to find dense clusters of nodes (genes) that are highly interconnected.
- The resolution parameter controls the granularity of the clustering. A higher resolution value typically results in a larger number of smaller clusters, whereas a lower resolution value produces fewer but larger clusters. Adjusting it helps fine-tune the balance between cluster size and number.
The cosine similarity calculation: For each gene in the gene programme, the cosine similarity with every other gene in the same programme is computed. The cosine similarity between two vectors is computed by the dot product of vector A and B divided by both vectors' magnitudes.
The purpose is to help identify which biological pathways are significantly associated with the genes in a given gene programme. This can provide insights into the biological functions and processes that are active or relevant in the dataset. It also aids in generating hypotheses about the roles of specific genes and their interactions within known biological pathways. It also helps in annotating gene clusters with biological functions, which can be important for interpreting results of gene expression studies and understanding the underlying biology.
When performing pathway enrichment analysis, multiple hypothesis tests are conducted (one for each pathway). This increases the likelihood of false positives. P-value correction methods, such as the Bonferroni correction used here, adjust the significance threshold to account for the number of tests performed, reducing the probability of false positive results. The corrected p-value threshold ensures that the results reported as significant are truly so, given the large number of tests. This helps in maintaining the overall type I error rate (false positive rate) at the desired level (typically 0.05).

Answer 8:

In [23]:

Copied!





# Define a range of resolution values to test
resolutions = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]

# Dictionary to store the number of gene programmes for each resolution
gene_program_counts = {}

for res in resolutions:
    # Perform Louvain clustering with the current resolution
    gdata = embed.get_adata(resolution=res)
    metagenes = embed.get_metagenes(gdata)
    
    # Count the number of gene programs (clusters with >= 5 genes)
    gene_program_counts[res] = sum(1 for genes in metagenes.values() if len(genes) >= 5)

# Plot the number of gene programs for each resolution
plt.figure(figsize=(10, 6))
plt.plot(resolutions, list(gene_program_counts.values()), marker='o')
plt.xlabel('Resolution')
plt.ylabel('Number of Gene Programmes')
plt.title('Number of Gene Programmes vs. Resolution')
plt.grid(True)
plt.show()
# Define a range of resolution values to test
resolutions = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]

# Dictionary to store the number of gene programmes for each resolution
gene_program_counts = {}

for res in resolutions:
    # Perform Louvain clustering with the current resolution
    gdata = embed.get_adata(resolution=res)
    metagenes = embed.get_metagenes(gdata)
    
    # Count the number of gene programs (clusters with >= 5 genes)
    gene_program_counts[res] = sum(1 for genes in metagenes.values() if len(genes) >= 5)

# Plot the number of gene programs for each resolution
plt.figure(figsize=(10, 6))
plt.plot(resolutions, list(gene_program_counts.values()), marker='o')
plt.xlabel('Resolution')
plt.ylabel('Number of Gene Programmes')
plt.title('Number of Gene Programmes vs. Resolution')
plt.grid(True)
plt.show()

As the resolution increases, so does the number of gene programmes.

Answer 9:

In [24]:

Copied!





# Retrieve gene programme 14
GP_genes = mgs['14']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
    df = embed.compute_similarities(i, GP_genes)
    df['Gene1'] = i
    df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)

# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)

# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')

elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]

pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)

width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
    if v > thresh:
        width_large[list(widths.keys())[i]] = v*10
    else:
        width_small[list(widths.keys())[i]] = max(v, 0)*10

nx.draw_networkx_edges(G, pos,
                       edgelist = width_small.keys(),
                       width=list(width_small.values()),
                       edge_color='lightblue',
                       alpha=0.8)
nx.draw_networkx_edges(G, pos, 
                       edgelist = width_large.keys(), 
                       width = list(width_large.values()), 
                       alpha = 0.5, 
                       edge_color = "blue", 
                      )
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)

ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()
# Retrieve gene programme 14
GP_genes = mgs['14']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
    df = embed.compute_similarities(i, GP_genes)
    df['Gene1'] = i
    df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)

# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)

# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')

elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]

pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)

width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
    if v > thresh:
        width_large[list(widths.keys())[i]] = v*10
    else:
        width_small[list(widths.keys())[i]] = max(v, 0)*10

nx.draw_networkx_edges(G, pos,
                       edgelist = width_small.keys(),
                       width=list(width_small.values()),
                       edge_color='lightblue',
                       alpha=0.8)
nx.draw_networkx_edges(G, pos, 
                       edgelist = width_large.keys(), 
                       width = list(width_large.values()), 
                       alpha = 0.5, 
                       edge_color = "blue", 
                      )
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)

ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()

['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM']

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 240.97it/s]

        Gene  Similarity     Gene1
1       EGR1    0.405794      FOSB
28      EGR1    0.222037  MIR23AHG
34      EGR1    0.179937     MYADM
15      EGR1    0.296500      KLF4
8       EGR1    0.254614       FOS
29       FOS    0.210742  MIR23AHG
21       FOS    0.254614      EGR1
17       FOS    0.245863      KLF4
35       FOS    0.171657     MYADM
5        FOS    0.335028      FOSB
13      FOSB    0.339160      KLF4
7       FOSB    0.335028       FOS
31      FOSB    0.364945     MYADM
19      FOSB    0.405794      EGR1
25      FOSB    0.363793  MIR23AHG
32      KLF4    0.300213     MYADM
20      KLF4    0.296500      EGR1
4       KLF4    0.339160      FOSB
9       KLF4    0.245863       FOS
26      KLF4    0.274826  MIR23AHG
33  MIR23AHG    0.231585     MYADM
3   MIR23AHG    0.363793      FOSB
10  MIR23AHG    0.210742       FOS
16  MIR23AHG    0.274826      KLF4
22  MIR23AHG    0.222037      EGR1
23     MYADM    0.179937      EGR1
2      MYADM    0.364945      FOSB
14     MYADM    0.300213      KLF4
11     MYADM    0.171657       FOS
27     MYADM    0.231585  MIR23AHG

Answer 10:

In [29]:

Copied!





# Set pandas options to display the entire term
pd.set_option('display.max_colwidth', None)

# Select desired database for query
databases = ['GO_Biological_Process_2021']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m

# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
                          gene_sets=databases,
                          organism='Human', 
                          outdir='test/enr_Reactome',
                          cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)

# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df
# Set pandas options to display the entire term
pd.set_option('display.max_colwidth', None)

# Select desired database for query
databases = ['GO_Biological_Process_2021']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m

# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
                          gene_sets=databases,
                          organism='Human', 
                          outdir='test/enr_Reactome',
                          cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)

# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df

Out[29]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
0	GO_Biological_Process_2021	positive regulation of pri-miRNA transcription by RNA polymerase II (GO:1902895)	3/34	8.944946e-08	0.000014	643.967742	10451.333760	EGR1;FOS;KLF4
1	GO_Biological_Process_2021	regulation of pri-miRNA transcription by RNA polymerase II (GO:1902893)	3/45	2.118537e-07	0.000017	475.047619	7300.232566	EGR1;FOS;KLF4
2	GO_Biological_Process_2021	regulation of cell-cell adhesion involved in gastrulation (GO:0070587)	2/10	3.371314e-06	0.000131	1249.125000	15739.234729	MYADM;KLF4
3	GO_Biological_Process_2021	negative regulation of heterotypic cell-cell adhesion (GO:0034115)	2/10	3.371314e-06	0.000131	1249.125000	15739.234729	MYADM;KLF4

The top significant processes are positive regulation of pri-miRNA transcription by RNA polymerase II (p.val <0.05), regulation of cell-cell adhesion involved in gastrulation (p.val <0.05) and negative regulation of heterotypic cell-cell adhesion (p.val <0.5).