Gene Network Analysis¶
This workbook will cover gene network analysis using large language models as an alternative to correlation analysis. Specifically, we will look at using scgpt, which is a foundation model for single-cell multi-omic data and has a wide range of capabilities.
We will use a model pretrained 10.3 million blood and bone marrow cells, since our dataset consists of B cells obtained from peripheral blood samples.
import copy
import json
import os
from pathlib import Path
import sys
import warnings
import gzip
import shutil
import torch
from anndata import AnnData
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pandas as pd
import tqdm
import gseapy as gp
from torchtext.vocab import Vocab
from torchtext._torchtext import (
Vocab as VocabPybind,
)
sys.path.insert(0, "../")
import scgpt as scg
from scgpt.tasks import GeneEmbedding
from scgpt.tokenizer.gene_tokenizer import GeneVocab
from scgpt.model import TransformerModel
from scgpt.preprocess import Preprocessor
from scgpt.utils import set_seed
os.environ["KMP_WARNINGS"] = "off"
warnings.filterwarnings('ignore')
We need to set the parameters of the model for when we use it later in this workbook.
set_seed(42)
pad_token = "<pad>"
special_tokens = [pad_token, "<cls>", "<eoc>"]
n_hvg = 1000
n_bins = 51
mask_value = -1
pad_value = -2
n_input_bins = n_bins
Lets go through the parameters so they are better understood:
- set_seed is used to set the random seed for reproducibility.
- pad_token is a special token used for padding sequences, to ensure that all sequences are of the same length. Therefore it will be inserted where necessary to make sequences of unequal lengths equal.
- special_tokens is a list defining a set of special tokens the model may use, such as pad. Cls is commonly a classification token that represents the beginning of a sequence and is used to aggregate information for classification tasks. Eoc stands for 'end of converstion/content' and is used to denote the end of a sequence.
- n_hvg is a variable representing the number of highly variable genes to be considered.
- n_bins defines the number of bins to group continuous data into discrete intervals.
- mask_value defines a value for masking. Masking is a technique to ignore certain parts of the data, often used in sequence models to indicate positions in a sequence that should be ignored during training or evaluation.
- pad_value defines the value for padding sequences. Similar to mask_value, it indicates positions in the data that are not part of the original sequence and should be ignored.
- n_input_bins assignment set the number of input bins ensuring consistency throughout the code.
Step 1: Load the Pre-Trained Model and Dataset¶
# Specify model path; here we load the pre-trained scGPT blood model
#You will have to directly download the model from here: https://github.com/bowang-lab/scGPT
model_dir = Path("/data/scgpt_blood_model")
model_config_file = model_dir / "args.json"
model_file = model_dir / "best_model.pt"
vocab_file = model_dir / "vocab.json"
vocab = GeneVocab.from_file(vocab_file)
for s in special_tokens:
if s not in vocab:
vocab.append_token(s)
# Retrieve model parameters from config files
with open(model_config_file, "r") as f:
model_configs = json.load(f)
print(
f"Resume model from {model_file}, the model args will override the "
f"config {model_config_file}."
)
embsize = model_configs["embsize"]
nhead = model_configs["nheads"]
d_hid = model_configs["d_hid"]
nlayers = model_configs["nlayers"]
n_layers_cls = model_configs["n_layers_cls"]
gene2idx = vocab.get_stoi()
#Set up Pytorch model to run on specified device (CPU or GPU), intialise the model with specific parameters,
#load pre-trained weight, and then transfer the model to the chosen device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ntokens = len(vocab) # size of vocabulary
model = TransformerModel(
ntokens,
embsize,
nhead,
d_hid,
nlayers,
vocab=vocab,
pad_value=pad_value,
n_input_bins=n_input_bins,
)
try:
model.load_state_dict(torch.load(model_file))
print(f"Loading all model params from {model_file}")
except:
# only load params that are in the model and match the size
model_dict = model.state_dict()
pretrained_dict = torch.load(model_file)
pretrained_dict = {
k: v
for k, v in pretrained_dict.items()
if k in model_dict and v.shape == model_dict[k].shape
}
for k, v in pretrained_dict.items():
print(f"Loading params {k} with shape {v.shape}")
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)
model.to(device)
Loading params encoder.embedding.weight with shape torch.Size([36574, 512]) Loading params encoder.enc_norm.weight with shape torch.Size([512]) Loading params encoder.enc_norm.bias with shape torch.Size([512]) Loading params value_encoder.linear1.weight with shape torch.Size([512, 1]) Loading params value_encoder.linear1.bias with shape torch.Size([512]) Loading params value_encoder.linear2.weight with shape torch.Size([512, 512]) Loading params value_encoder.linear2.bias with shape torch.Size([512]) Loading params value_encoder.norm.weight with shape torch.Size([512]) Loading params value_encoder.norm.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.0.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.0.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.0.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.1.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.1.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.1.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.2.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.2.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.2.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.3.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.3.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.3.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.4.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.4.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.4.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.5.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.5.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.5.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.6.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.6.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.6.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.7.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.7.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.7.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.8.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.8.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.8.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.8.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.9.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.9.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.9.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.10.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.10.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.10.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.11.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.11.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.11.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm2.bias with shape torch.Size([512]) Loading params decoder.fc.0.weight with shape torch.Size([512, 512]) Loading params decoder.fc.0.bias with shape torch.Size([512]) Loading params decoder.fc.2.weight with shape torch.Size([512, 512]) Loading params decoder.fc.2.bias with shape torch.Size([512]) Loading params decoder.fc.4.weight with shape torch.Size([1, 512]) Loading params decoder.fc.4.bias with shape torch.Size([1])
TransformerModel(
(encoder): GeneEncoder(
(embedding): Embedding(36574, 512, padding_idx=36571)
(enc_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(value_encoder): ContinuousValueEncoder(
(dropout): Dropout(p=0.5, inplace=False)
(linear1): Linear(in_features=1, out_features=512, bias=True)
(activation): ReLU()
(linear2): Linear(in_features=512, out_features=512, bias=True)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(transformer_encoder): TransformerEncoder(
(layers): ModuleList(
(0-11): 12 x TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
)
(linear1): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.5, inplace=False)
(linear2): Linear(in_features=512, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.5, inplace=False)
(dropout2): Dropout(p=0.5, inplace=False)
)
)
)
(decoder): ExprDecoder(
(fc): Sequential(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Linear(in_features=512, out_features=1, bias=True)
)
)
(cls_decoder): ClsDecoder(
(_decoder): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): ReLU()
(2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(3): Linear(in_features=512, out_features=512, bias=True)
(4): ReLU()
(5): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(out_layer): Linear(in_features=512, out_features=1, bias=True)
)
(sim): Similarity(
(cos): CosineSimilarity()
)
(creterion_cce): CrossEntropyLoss()
)
Load dataset of interest.
# Paths to the compressed and uncompressed files
compressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_compressed.h5ad.gz'
uncompressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_uncompressed.h5ad'
# Decompress the file
with gzip.open(compressed_file, 'rb') as f_in:
with open(uncompressed_file, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# Read the decompressed file using scanpy
adata = sc.read(uncompressed_file)
# Optionally, remove the uncompressed file if no longer needed
os.remove(uncompressed_file)
adata
AnnData object with n_obs × n_vars = 3540 × 25198
obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_umap'
Check that the gene variable names are in the correct format i.e. gene symbol names and not Ensembl IDs.
adata.var
| feature_is_filtered | feature_name | feature_reference | feature_biotype | feature_length | highly_variable | means | dispersions | dispersions_norm | |
|---|---|---|---|---|---|---|---|---|---|
| feature_name | |||||||||
| MIR1302-2HG | False | MIR1302-2HG | NCBITaxon:9606 | gene | 1021 | False | 1.000000e-12 | NaN | NaN |
| FAM138A | False | FAM138A | NCBITaxon:9606 | gene | 1219 | False | 1.000000e-12 | NaN | NaN |
| OR4F5 | False | OR4F5 | NCBITaxon:9606 | gene | 2618 | False | 1.000000e-12 | NaN | NaN |
| OR4F29 | False | OR4F29 | NCBITaxon:9606 | gene | 939 | False | 1.000000e-12 | NaN | NaN |
| OR4F16 | False | OR4F16 | NCBITaxon:9606 | gene | 939 | False | 1.000000e-12 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTTY17C | False | TTTY17C | NCBITaxon:9606 | gene | 776 | False | 1.000000e-12 | NaN | NaN |
| SEPTIN14P23 | False | SEPTIN14P23 | NCBITaxon:9606 | gene | 1192 | False | 2.503351e-04 | -0.120703 | -1.405214 |
| CDY1 | False | CDY1 | NCBITaxon:9606 | gene | 2670 | False | 1.000000e-12 | NaN | NaN |
| TTTY3 | False | TTTY3 | NCBITaxon:9606 | gene | 344 | False | 1.000000e-12 | NaN | NaN |
| MAFIP | False | MAFIP | NCBITaxon:9606 | gene | 1599 | False | 6.082392e-03 | 0.002191 | 0.189369 |
25198 rows × 9 columns
# Preprocess the data following the scGPT data pre-processing pipeline
preprocessor = Preprocessor(
use_key="X", # the key in adata.layers to use as raw data
filter_gene_by_counts=3, # step 1
filter_cell_by_counts=False, # step 2
normalize_total=1e4, # step 3 whether to normalize the raw data and to what sum
result_normed_key="X_normed", # the key in adata.layers to store the normalized data
log1p=data_is_raw, # step 4 whether to log1p the normalised data
result_log1p_key="X_log1p",
subset_hvg=n_hvg, # 5. whether to subset the raw data to highly variable genes
hvg_flavor="seurat_v3" if data_is_raw else "cell_ranger",
binning=n_bins, # 6. whether to bin the raw data and to what number of bins
result_binned_key="X_binned", # the key in adata.layers to store the binned data
)
preprocessor(adata)
scGPT - INFO - Filtering genes by counts ... scGPT - INFO - Normalizing total counts ... scGPT - INFO - Subsetting highly variable genes ... scGPT - WARNING - No batch_key is provided, will use all cells for HVG selection. scGPT - INFO - Binning data ...
Step 2: Retrieve Scgpt's Gene Embeddings¶
Overall, the pre-trained foundation model contains 30+K genes. Here for simplicity, we focus on a subset of HVGs specific to the data at hand.
# Retrieve the data-independent gene embeddings from scGPT
gene_ids = np.array([id for id in gene2idx.values()])
gene_embeddings = model.encoder(torch.tensor(gene_ids, dtype=torch.long).to(device))
gene_embeddings = gene_embeddings.detach().cpu().numpy()
# Filter on the intersection between the Blood HVGs found and scGPT's 30+K foundation model vocab
gene_embeddings = {gene: gene_embeddings[i] for i, gene in enumerate(gene2idx.keys()) if gene in adata.var.index.tolist()}
print('Retrieved gene embeddings for {} genes.'.format(len(gene_embeddings)))
Retrieved gene embeddings for 962 genes.
# Construct gene embedding network
embed = GeneEmbedding(gene_embeddings)
100%|████████████████████████████████████████████████████████████████████████████| 962/962 [00:00<00:00, 1095850.20it/s]
Step 3: Extract gene programmes from gene embedding network¶
Perform Louvain clustering on the gene embedding network.
# Perform Louvain clustering with desired resolution; here we specify resolution=40
gdata = embed.get_adata(resolution=40)
# Retrieve the gene clusters
metagenes = embed.get_metagenes(gdata)
gdata
AnnData object with n_obs × n_vars = 962 × 512
obs: 'leiden'
uns: 'neighbors', 'leiden', 'umap'
obsm: 'X_umap'
obsp: 'distances', 'connectivities'
metagenes
defaultdict(list,
{'32': ['ZNF880', 'SNHG21', 'APOM', 'SC5D'],
'214': ['ZNF837', 'SMYD5'],
'215': ['ZNF713', 'MIR762HG'],
'254': ['ZNF618', 'ZNF235'],
'144': ['ZNF608', 'APBB2', 'ZNF43'],
'57': ['ZNF594-DT', 'TSPYL5', 'ZNF530', 'MTFR2'],
'120': ['ZNF528-AS1', 'WDR27', 'CORT'],
'37': ['ZNF426', 'ZFP14', 'TTC28-AS1', 'ZNF76'],
'182': ['ZNF415', 'KRCC1', 'PTPRN2'],
'109': ['ZNF391', 'ZCWPW1', 'IFT88'],
'74': ['ZNF318', 'SIPA1L3', 'KBTBD8', 'GUCD1'],
'21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
'91': ['ZNF140', 'SMIM8', 'ZBED5-AS1', 'FCF1'],
'99': ['ZMYND8', 'ITPR1', 'CDKAL1'],
'268': ['ZMAT1', 'EFCAB13'],
'12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
'217': ['ZEB2-AS1', 'PTCH2'],
'95': ['ZDHHC19', 'SERHL2', 'NEK8'],
'218': ['ZC3H12C', 'PACS2'],
'211': ['ZBTB32', 'IL2RA'],
'5': ['ZBTB16',
'TLE1',
'TBC1D5',
'CRIM1',
'PRKCB',
'VAV3',
'MCTP2'],
'175': ['XIST', 'ITGA4', 'LPP'],
'114': ['XCL2', 'TFPI2', 'CCL4L2'],
'126': ['XCL1', 'GNG4', 'TNFRSF18'],
'166': ['WDR86', 'CNKSR2', 'GRAP'],
'53': ['WBP4', 'SREBF2-AS1', 'COQ7', 'KMT2E-AS1'],
'98': ['WASF1', 'SGO1', 'CCDC18'],
'100': ['VPREB3', 'CD24', 'FLI1'],
'101': ['USP7-AS1', 'CATSPERZ', 'CERNA1'],
'205': ['USF1', 'RASSF2', 'RIN3'],
'102': ['UMAD1', 'SCN3A', 'BEND5'],
'103': ['UACA', 'PDE7B', 'EPS8'],
'158': ['TYROBP', 'FCER1G', 'HCST'],
'137': ['TUSC1', 'MYL5', 'LINC01003'],
'51': ['TUBB6', 'EIF1AY', 'PDLIM7', 'HMGN5'],
'104': ['TTC9', 'CXCR5', 'DBNDD1'],
'105': ['TTC23L-AS1', 'LNP1', 'MSS51'],
'106': ['TSC22D1', 'MEG3', 'H2AC6'],
'107': ['TRIM66', 'EPM2A', 'MCF2L'],
'108': ['TRIM52-AS1', 'NUBPL', 'THAP7-AS1'],
'22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
'131': ['TRBC1', 'TRDC', 'C12orf75'],
'81': ['TRAF4', 'METTL8', 'BTLA', 'ADAM19'],
'1': ['TNFRSF13B',
'FCRL2',
'PNOC',
'S100B',
'LINC01781',
'CD1C',
'LINC01857',
'OSBPL10'],
'110': ['TNFAIP6', 'ADM', 'PTGS2'],
'111': ['TMEM62', 'CLCN4', 'LETM2'],
'132': ['TMEM44-AS1', 'LINC02551', 'TNRC6B-DT'],
'149': ['TMEM38A', 'EIF2AK3-DT', 'GCNT2'],
'112': ['TMEM131L', 'ACSM3', 'MARCHF3'],
'10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
'68': ['TUNAR', 'TVP23A', 'ELL3', 'GLI1'],
'113': ['THOC1-DT', 'LINGO3', 'GORAB-AS1'],
'65': ['THBS1', 'SERPINB2', 'CCL7', 'EREG'],
'40': ['TGM1', 'DTX4', 'ECE1-AS1', 'SERPINB9P1'],
'56': ['TTC30A', 'ATP10D', 'MUL1', 'SPRYD7'],
'59': ['TEX9', 'STK33', 'TMEM67', 'PPIL6'],
'216': ['TESC', 'SCNM1'],
'201': ['TEDC1', 'ARHGEF18', 'PCGF1'],
'251': ['TCEAL4', 'INO80C'],
'77': ['TBXA2R', 'SDK2', 'CR2', 'MYEF2'],
'97': ['ZNF667', 'WDR5B-DT', 'ZNF10'],
'174': ['TAT-AS1', 'CUBN', 'LINC02175'],
'177': ['TAPT1', 'BCL7A', 'IL4R'],
'94': ['TAL1', 'SMIM5', 'PDZK1IP1'],
'25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
'178': ['SUV39H2', 'DPPA4', 'IMPACT'],
'119': ['SULF2', 'S100Z', 'DOCK8-AS1'],
'302': ['STAG1-DT'],
'79': ['UQCRHL', 'GIHCG', 'NREP', 'AZU1'],
'219': ['SPRY1', 'ZNF667-AS1'],
'66': ['SPINK2', 'ZNF844', 'ERICH6-AS1', 'ABCB1'],
'235': ['SPATA6', 'MAPK10'],
'90': ['SPARC', 'CLU', 'MYH11', 'MYL9'],
'220': ['SPAG16', 'PTPRK'],
'2': ['SP2',
'H2AC21',
'H2AC11',
'H2BC13',
'H2BC18',
'H2BC4',
'H2BC7',
'H3C3'],
'184': ['SOX2-OT', 'STAR', 'MYL3'],
'194': ['SOS1-IT1', 'CAPN14', 'MTCL1'],
'33': ['SOCS1', 'CSRNP1', 'GADD45G', 'RGS16'],
'115': ['SOBP', 'TPK1', 'KRBOX1'],
'85': ['SNX29', 'CYSLTR1', 'UVRAG', 'LHPP'],
'55': ['SNX22', 'TCTN1', 'FAM111B', 'CNFN'],
'116': ['SNAPC1', 'HEIH', 'MXI1'],
'117': ['SMAGP', 'SCPEP1', 'ANXA4'],
'148': ['SLC39A10', 'CEP290', 'ANKRD26'],
'222': ['ST20-AS1', 'ERMN'],
'63': ['SLC2A6', 'SIGLEC10', 'SIDT2', 'SCIMP'],
'62': ['SKAP1', 'NCR3', 'KLRB1', 'HOPX'],
'186': ['SIMC1', 'LYRM1', 'PIGC'],
'118': ['SH3BP2', 'FAM120A', 'VPS41'],
'163': ['SFR1', 'RAB30-DT', 'CHML'],
'206': ['SETBP1', 'OR2T11', 'LAMC1'],
'64': ['SERPINI1', 'SLC2A1', 'ANKRD37', 'PWWP3A'],
'41': ['TPST1', 'RAPGEF5', 'FLT1', 'ST6GALNAC3'],
'188': ['SEMA6B', 'GASAL1', 'INSIG1-DT'],
'17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
'139': ['SAV1', 'USP6NL', 'POU2F1'],
'223': ['TRMT1', 'LINC01678'],
'123': ['SATB1', 'PDE7A', 'BCL11B'],
'13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
'86': ['RUNX1', 'GNA15', 'ANKRD28', 'LRRC75A'],
'16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
'224': ['RRAD', 'DNAAF1'],
'0': ['TXNDC5',
'IGHA2',
'IGLC3',
'IGHA1',
'IGHG4',
'JCHAIN',
'IGHG1',
'IGHG3',
'IGLC2'],
'241': ['RP9', 'AGPAT5'],
'83': ['SNX10', 'SLC11A1', 'MCEMP1', 'PTPN6'],
'121': ['RPGR', 'SEMA4B', 'RRM2B'],
'225': ['USP9Y', 'CCDC141'],
'34': ['UBE4A', 'OGT', 'TRIB2', 'CCND2'],
'156': ['TBCK', 'EPG5', 'NOTCH1'],
'36': ['SBNO1-AS1', 'LAIR1', 'CRELD2', 'CD38'],
'252': ['WAC-AS1', 'TLR1'],
'226': ['WWC3', 'ZDHHC21'],
'35': ['RPAP2', 'DUSP12', 'SLC25A45', 'OSGEPL1'],
'44': ['ZNF585A', 'EID2B', 'HS3ST3B1', 'AKAP7'],
'122': ['ZRSR2', 'DIPK1B', 'SCAMP4'],
'67': ['STEAP1B', 'DTWD2', 'BCAS4', 'CCSER1'],
'197': ['ZNF83', 'SESN1', 'NUAK2'],
'269': ['SCAI', 'ARPP21'],
'296': ['TNFSF13B', 'CLEC7A'],
'39': ['RNF170', 'BMP3', 'ALKAL2', 'CBR3-AS1'],
'227': ['WDR19', 'IQCN'],
'287': ['RNF138', 'NAF1'],
'69': ['RNF103', 'R3HDM2', 'ZYG11B', 'ZBTB18'],
'9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
'80': ['RNASEH2B-AS1', 'SPA17', 'EFHB', 'IQCD'],
'48': ['RIOK1', 'URB1-AS1', 'LINC00526', 'NUFIP1'],
'140': ['RGS1', 'DUSP4', 'CCL2'],
'30': ['REXO1', 'CORO7', 'AKAP8', 'BTBD2'],
'124': ['RDX', 'TSPAN13', 'IL3RA_ENSG00000185291'],
'58': ['RBP7', 'NFAM1', 'HCAR2', 'ASGR1'],
'46': ['RASSF8', 'RBM38-AS1', 'LINC01811', 'PCP2'],
'280': ['RASSF1-AS1', 'FMO5'],
'125': ['RAP2C-AS1', 'MAP3K2-DT', 'LINC00494'],
'19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
'212': ['TNFAIP2', 'UBXN11'],
'18': ['RAB33B-AS1',
'MZF1-AS1',
'SDR42E2',
'C16orf74',
'MIR4458HG'],
'7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
'29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
'60': ['GNMT', 'ERCC2', 'TMEM167B-DT', 'PSMA8'],
'6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
'23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
'180': ['GAPT', 'IGLC5', 'CD200'],
'4': ['IGHV4-39',
'IGKV1-9',
'IGHV4-34',
'IGHV1-3',
'IGLV2-14',
'IGHV1-18',
'IGHV3-23'],
'266': ['CLMN', 'TFEB'],
'15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
'14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
'127': ['GALNT6', 'GOLPH3-DT', 'GPR137C'],
'229': ['CXCL10', 'CXCL2'],
'27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
'92': ['BBOF1', 'TWSG1-DT', 'WDR5B', 'CFAP45'],
'70': ['CLLU1-AS1', 'ARHGAP22', 'DNAJC12', 'AVP'],
'199': ['CCDC112', 'FADS3', 'ZNF821'],
'8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
'246': ['FCER2', 'ABCB4'],
'190': ['POC1B', 'ASXL2', 'MBNL3'],
'168': ['FAM241A', 'SDCCAG8', 'CERS6'],
'230': ['ARL13B', 'HNRNPD-DT'],
'231': ['NR4A1', 'ATF3'],
'221': ['FAM111A', 'PLCL2'],
'232': ['HMGA1P4', 'C1orf220'],
'233': ['EVI5', 'ATP6V0A1'],
'128': ['C11orf24', 'LARGE2', 'KCNG1'],
'234': ['ARFIP1', 'GCNT1'],
'129': ['DNAL4', 'CD2AP-DT', 'PRXL2B'],
'147': ['EPB41L2', 'ELL2', 'ABCA1'],
'3': ['MX1',
'HERC5',
'XAF1',
'IFI44L',
'IFIT3',
'MX2',
'EPSTI1',
'PLSCR1'],
'54': ['CYTOR', 'SLAMF7', 'BHLHE41', 'IFNG'],
'183': ['FAAP24', 'ZNF252P-AS1', 'HCN3'],
'130': ['PACSIN2', 'DDI2', 'RYBP'],
'75': ['GBP5', 'GBP1', 'GBP4', 'GBP2'],
'259': ['DOCK9-DT', 'NUMBL'],
'150': ['USP11', 'BTN2A2', 'HIP1R'],
'73': ['DCTN6-DT', 'CSKMT', 'ATF7IP2', 'ILF3-DT'],
'203': ['NRGN', 'RGS18', 'PRKAR2B'],
'71': ['DNAAF4',
'LINC00106_ENSG00000236871',
'CSGALNACT2-DT',
'JARID2-AS1'],
'236': ['DDN-AS1', 'ICOSLG'],
'204': ['DTX1', 'GPR160', 'HRK'],
'237': ['ERCC1', 'HDAC9'],
'45': ['CYB5D2', 'LINC01004', 'ZFP28-DT', 'ALG9'],
'26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
'213': ['DYNLT2', 'CAGE1'],
'289': ['PPP1R14A', 'FAM30A'],
'52': ['NOTCH2', 'SERTAD2', 'KLF7', 'MAML3'],
'61': ['TSC22D2', 'ICA1L', 'TTN', 'KCNQ1OT1'],
'141': ['FAM177B', 'CCL22', 'IL6'],
'185': ['CYP2R1', 'NHLH1', 'LINC02848'],
'198': ['IRAK3', 'BAZ2B', 'PECAM1'],
'238': ['EFHC1', 'METTL7A'],
'239': ['ZNF165', 'CFAP44'],
'50': ['IGHV3-15', 'IGLC6', 'IGHV3-49', 'IGLC7'],
'260': ['NME4', 'MYC'],
'78': ['SOX4', 'ARMH1', 'MIR181A1HG', 'MTHFD2L'],
'133': ['VCAN', 'CSF3R', 'CD14'],
'43': ['DOP1B', 'MIR3936HG', 'DLEU2', 'NRDE2'],
'249': ['GDAP1', 'CENPP'],
'291': ['PTGDS', 'IGFBP4'],
'72': ['CA5B', 'RBM43', 'INTS6-AS1', 'LRRC8C-DT'],
'240': ['LIX1-AS1', 'CLEC20A'],
'169': ['NUP58', 'SPPL2B', 'CLASRP'],
'160': ['CRIP2', 'BDH2', 'PCSK1N'],
'297': ['COL19A1'],
'134': ['PIDD1', 'NUP210', 'SPATA33'],
'292': ['CORO2B', 'AMDHD2'],
'153': ['CLEC4A', 'GM2A', 'MIR223HG'],
'208': ['PRCD', 'ZNF528', 'MYBPC2'],
'243': ['KCNH8', 'LINC02397'],
'88': ['CNN3', 'CCND1', 'APP', 'PLK2'],
'162': ['FRY', 'LINC02029', 'GABBR1'],
'275': ['CNTNAP2', 'MAP3K9'],
'136': ['GABARAPL1', 'IL1RN', 'PLAUR'],
'244': ['CLIC4', 'FCHSD2'],
'159': ['HEXIM1', 'GADD45A', 'IRS2'],
'38': ['MARCHF1', 'HLA-DRB5', 'CPVL', 'SHTN1'],
'179': ['ANXA1', 'CD9', 'LGALS3'],
'193': ['TGFBI', 'RNASE6', 'MPEG1'],
'245': ['OPN3', 'MAP3K13'],
'84': ['IGHV3-53', 'IGKV2D-29', 'IGHV3-30', 'IGHV5-51'],
'242': ['IGHE', 'WNT10A'],
'191': ['H2BC21', 'H2BC8', 'H2BC5'],
'87': ['DSP', 'FXYD7', 'SH2D3A', 'LSR'],
'279': ['ADPGK-AS1', 'LINC00467'],
'138': ['RHOB', 'CITED2', 'H2AC19'],
'164': ['KCNIP2', 'RORA-AS1', 'PGBD4'],
'248': ['CEMIP2', 'KLRD1'],
'192': ['THRB', 'KLK1', 'MOB3B'],
'207': ['ADARB1', 'EPN2', 'MICAL3'],
'247': ['CFL2', 'PAIP2B'],
'264': ['IFNG-AS1', 'SLAMF1'],
'24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
'28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
'195': ['CCDC168', 'KLF3-AS1', 'NARF-AS2'],
'228': ['MSMP', 'PLD5'],
'89': ['CXCL3', 'CCL20', 'MIR3945HG', 'MMP7'],
'143': ['VAV3-AS1', 'DPEP2', 'IL6-AS1'],
'285': ['NETO1-DT', 'LRRIQ3'],
'31': ['ARHGAP9', 'ARHGAP4', 'FAM200B', 'CMTM7'],
'135': ['GALNT4', 'ARRDC4', 'LPP-AS2'],
'145': ['FCRL5', 'LINC01480', 'GCM1'],
'200': ['EGR2', 'C15orf48', 'EGR3'],
'146': ['FCRL1', 'KLHL14', 'MYO7B'],
'93': ['GPR35', 'LILRA1', 'LILRA6', 'HYAL3'],
'11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
'250': ['AKAP6', 'LINC01685'],
'271': ['LUCAT1', 'IER5L'],
'170': ['P2RX5', 'CDCA7L', 'GPR18'],
'282': ['CLEC17A', 'HLA-DOB'],
'76': ['DBNDD2', 'MANSC1', 'OXT', 'PPP1R21-DT'],
'151': ['COLCA1', 'LIX1L-AS1', 'TAGAP-AS1'],
'152': ['ARHGAP12', 'SLC16A11', 'HAGHL'],
'154': ['LEKR1', 'OXTR', 'P2RX7'],
'284': ['CD300E', 'DUSP6'],
'299': ['E2F5'],
'265': ['CAMK2D', 'CIITA'],
'286': ['RAB37', 'C11orf21'],
'82': ['DNASE1L3', 'KHDRBS2', 'MACROD2', 'MIR4432HG'],
'288': ['CCDC88A', 'OGFRL1'],
'47': ['MILR1', 'CBFA2T3', 'PAWR', 'HS2ST1'],
'278': ['NKILA', 'KLHL22'],
'283': ['TCL6', 'YPEL2'],
'294': ['IFRD1', 'NEIL1'],
'210': ['TMEM176A', 'G0S2', 'MAFB'],
'253': ['ARHGEF35-AS1', 'FAM24B'],
'181': ['KLRC1', 'GZMK', 'GPR65'],
'155': ['IRF4', 'TP53INP1', 'PHF6'],
'161': ['CSRP2', 'BEX1', 'PEG10'],
'42': ['ZCCHC9', 'FBXO22', 'KIAA0586', 'PNKD'],
'157': ['PPP1R14B-AS1', 'TP53RK-DT', 'PTK6'],
'202': ['OR2A1-AS1', 'PPP1R12A-AS1', 'MYLK-AS1'],
'209': ['COL9A3', 'N4BP3', 'LCN8'],
'96': ['CHL1', 'SYT17', 'MAP2'],
'257': ['ATP2B1-AS1', 'TMEM107'],
'196': ['ARMCX4', 'MAP3K6', 'NPTX1'],
'258': ['BCL6', 'ATP11A'],
'171': ['CTSW', 'GZMM', 'FEZ1'],
'20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44'],
'290': ['CBR3', 'LOH12CR2'],
'165': ['IRAK2', 'LINC00265', 'NRARP'],
'172': ['H1-10', 'H1-2', 'H1-3'],
'301': ['KLHDC7B-DT'],
'274': ['CD180', 'IRAG1-AS1'],
'167': ['JAG2', 'P2RY14', 'PPM1J'],
'262': ['CYP2U1-AS1', 'KCTD16'],
'263': ['CLDN11', 'NTNG1'],
'293': ['LIX1', 'GNG3'],
'298': ['IGLC1'],
'49': ['MED12', 'ERLIN2', 'IFT122', 'P2RX4'],
'176': ['HES1', 'ID1', 'ID2'],
'173': ['HES4', 'PPP1R17', 'ODAD3'],
'267': ['HHIP-AS1', 'LINC01013'],
'142': ['MZB1', 'DERL3', 'AIM2'],
'270': ['EAF2', 'PLPP5'],
'272': ['INSIG2', 'KHDRBS3'],
'273': ['IGHV3-7', 'STAG3'],
'189': ['DPYSL3', 'CERCAM', 'NRP2'],
'300': ['PRMT5-AS1'],
'276': ['CHST15', 'LINC02728'],
'277': ['OXCT1-AS1', 'MYZAP'],
'255': ['PCDH9', 'NIBAN3'],
'256': ['LINC01703', 'NDUFV2-AS1'],
'281': ['LINC02132', 'NUDT8'],
'295': ['LINC02422', 'SLC38A11'],
'187': ['AMN1', 'NT5E', 'OLMALINC'],
'261': ['GZMA', 'IL32']})
Filter on clusters with 5 or more genes
# Obtain the set of gene programmes from clusters with #genes >= 5
mgs = dict()
for mg, genes in metagenes.items():
if len(genes) > 4:
mgs[mg] = genes
# Here are the gene programmes identified
mgs
{'21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
'12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
'5': ['ZBTB16', 'TLE1', 'TBC1D5', 'CRIM1', 'PRKCB', 'VAV3', 'MCTP2'],
'22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
'1': ['TNFRSF13B',
'FCRL2',
'PNOC',
'S100B',
'LINC01781',
'CD1C',
'LINC01857',
'OSBPL10'],
'10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
'25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
'2': ['SP2',
'H2AC21',
'H2AC11',
'H2BC13',
'H2BC18',
'H2BC4',
'H2BC7',
'H3C3'],
'17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
'13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
'16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
'0': ['TXNDC5',
'IGHA2',
'IGLC3',
'IGHA1',
'IGHG4',
'JCHAIN',
'IGHG1',
'IGHG3',
'IGLC2'],
'9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
'19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
'18': ['RAB33B-AS1', 'MZF1-AS1', 'SDR42E2', 'C16orf74', 'MIR4458HG'],
'7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
'29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
'6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
'23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
'4': ['IGHV4-39',
'IGKV1-9',
'IGHV4-34',
'IGHV1-3',
'IGLV2-14',
'IGHV1-18',
'IGHV3-23'],
'15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
'14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
'27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
'8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
'3': ['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1'],
'26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
'24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
'28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
'11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
'20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44']}
Step 4: Visualise Network Connectivity within Desired Gene Programme¶
# Retrieve gene programme 3
GP_genes = mgs['3']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
df = embed.compute_similarities(i, GP_genes)
df['Gene1'] = i
df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)
['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1']
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 236.27it/s]
Gene Similarity Gene1 60 EPSTI1 0.225607 PLSCR1 45 EPSTI1 0.188223 MX2 6 EPSTI1 0.256395 MX1 35 EPSTI1 0.274948 IFIT3 29 EPSTI1 0.241366 IFI44L 19 EPSTI1 0.254587 XAF1 14 EPSTI1 0.119632 HERC5 31 HERC5 0.168532 IFI44L 55 HERC5 0.119632 EPSTI1 47 HERC5 0.163126 MX2 38 HERC5 0.202499 IFIT3 23 HERC5 0.168315 XAF1 63 HERC5 0.091875 PLSCR1 7 HERC5 0.174051 MX1 1 IFI44L 0.372201 MX1 17 IFI44L 0.337581 XAF1 59 IFI44L 0.232156 PLSCR1 11 IFI44L 0.168532 HERC5 34 IFI44L 0.296725 IFIT3 52 IFI44L 0.241366 EPSTI1 41 IFI44L 0.306452 MX2 44 IFIT3 0.198912 MX2 49 IFIT3 0.274948 EPSTI1 9 IFIT3 0.202499 HERC5 28 IFIT3 0.296725 IFI44L 3 IFIT3 0.313495 MX1 58 IFIT3 0.240587 PLSCR1 20 IFIT3 0.240173 XAF1 10 MX1 0.174051 HERC5 50 MX1 0.256395 EPSTI1 33 MX1 0.313495 IFIT3 25 MX1 0.372201 IFI44L 57 MX1 0.266255 PLSCR1 18 MX1 0.323817 XAF1 42 MX1 0.295364 MX2 4 MX2 0.295364 MX1 54 MX2 0.188223 EPSTI1 39 MX2 0.198912 IFIT3 13 MX2 0.163126 HERC5 21 MX2 0.228178 XAF1 27 MX2 0.306452 IFI44L 62 MX2 0.182111 PLSCR1 22 PLSCR1 0.212136 XAF1 30 PLSCR1 0.232156 IFI44L 53 PLSCR1 0.225607 EPSTI1 36 PLSCR1 0.240587 IFIT3 46 PLSCR1 0.182111 MX2 5 PLSCR1 0.266255 MX1 15 PLSCR1 0.091875 HERC5 43 XAF1 0.228178 MX2 37 XAF1 0.240173 IFIT3 26 XAF1 0.337581 IFI44L 12 XAF1 0.168315 HERC5 2 XAF1 0.323817 MX1 61 XAF1 0.212136 PLSCR1 51 XAF1 0.254587 EPSTI1
# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)
# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')
elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]
pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)
width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
if v > thresh:
width_large[list(widths.keys())[i]] = v*10
else:
width_small[list(widths.keys())[i]] = max(v, 0)*10
nx.draw_networkx_edges(G, pos,
edgelist = width_small.keys(),
width=list(width_small.values()),
edge_color='lightblue',
alpha=0.8)
nx.draw_networkx_edges(G, pos,
edgelist = width_large.keys(),
width = list(width_large.values()),
alpha = 0.5,
edge_color = "blue",
)
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)
ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()
Step 6: Reactome Pathway Analysis¶
Reactome pathway is another form of pathway analysis that can be carried out on the data.
# Meta info about the number of terms (tests) in the databases
df_database = pd.DataFrame(
data = [['GO_Biological_Process_2021', 6036],
['GO_Molecular_Function_2021', 1274],
['Reactome_2022', 1818]],
columns = ['dataset', 'term'])
# Select desired database for query; here use Reactome as an example
databases = ['Reactome_2022']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m
# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
gene_sets=databases,
organism='Human',
outdir='test/enr_Reactome',
cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)
# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df
| Gene_set | Term | Overlap | P-value | Adjusted P-value | Old P-value | Old Adjusted P-value | Odds Ratio | Combined Score | Genes | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Reactome_2022 | Interferon Signaling R-HSA-913531 | 5/200 | 5.197809e-09 | 8.836276e-08 | 0 | 0 | 169.205128 | 3227.592653 | HERC5;MX2;MX1;XAF1;IFIT3 |
| 1 | Reactome_2022 | Interferon Alpha/Beta Signaling R-HSA-909733 | 4/72 | 1.068672e-08 | 9.083711e-08 | 0 | 0 | 293.000000 | 5377.799365 | MX2;MX1;XAF1;IFIT3 |
| 2 | Reactome_2022 | Cytokine Signaling In Immune System R-HSA-1280215 | 5/702 | 2.693582e-06 | 1.189224e-05 | 0 | 0 | 46.138211 | 591.705896 | HERC5;MX2;MX1;XAF1;IFIT3 |
| 3 | Reactome_2022 | ISG15 Antiviral Mechanism R-HSA-1169408 | 3/75 | 2.798175e-06 | 1.189224e-05 | 0 | 0 | 166.000000 | 2122.566166 | HERC5;MX2;MX1 |
| 4 | Reactome_2022 | Antiviral Mechanism By IFN-stimulated Genes R-... | 3/83 | 3.801759e-06 | 1.292598e-05 | 0 | 0 | 149.340000 | 1863.770163 | HERC5;MX2;MX1 |
External Reading:
- SCGPT paper: https://www.nature.com/articles/s41592-024-02201-0
- Feature selection: https://bioconductor.org/books/3.15/OSCA.basic/feature-selection.html
- P-values: https://www.nature.com/articles/nmeth.2900
- Community detection algorithms: https://www.nature.com/articles/s41598-019-41695-z
- Reactome: https://reactome.org/
Exercises:
- Describe the process of loading the pre-trained scGPT model. What are the key parameters used?
- Explain the significance of the n_hvg and n_bins parameters used in the preprocessing pipeline.
- Why is it important to filter the model gene embeddings to match the HVGs found in the input dataset?
- Describe the process of constructing the gene embedding network and performing Louvain clustering. What is the role of the resolution parameter?
- Explain how the cosine similarity among genes in a specific gene programme is computed.
- What is the purpose of performing pathway enrichment analysis in this context?
- Explain the significance of the p-value correction applied during the pathway enrichment analysis.
- Change the resolution parameter for the Louvain clustering step and analyse how the number of identified gene programmes changes. Plot the distribution of gene programmes for different resolution values.
- Choose a different gene programme and create a similar gene similarity network to visualise.
- Perform pathway enrichment analysis using a different database, such as GO_Biological_Process_2021 as well as report the top enriched pathways/processes and their significance levels.
Answers:
Answer 8:
# Define a range of resolution values to test
Answer 9:
# Retrieve gene programme
Answer 10:
# Set pandas options to display the entire term
pd.set_option('display.max_colwidth', None)
# Select desired database for query