Gene Network Analysis¶
This workbook will cover gene network analysis using large language models as an alternative to correlation analysis. Specifically, we will look at using scgpt, which is a foundation model for single-cell multi-omic data and has a wide range of capabilities.
We will use a model pretrained 10.3 million blood and bone marrow cells, since our dataset consists of B cells obtained from peripheral blood samples.
import copy
import json
import os
from pathlib import Path
import sys
import warnings
import gzip
import shutil
import torch
from anndata import AnnData
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pandas as pd
import tqdm
import gseapy as gp
from torchtext.vocab import Vocab
from torchtext._torchtext import (
Vocab as VocabPybind,
)
sys.path.insert(0, "../")
import scgpt as scg
from scgpt.tasks import GeneEmbedding
from scgpt.tokenizer.gene_tokenizer import GeneVocab
from scgpt.model import TransformerModel
from scgpt.preprocess import Preprocessor
from scgpt.utils import set_seed
os.environ["KMP_WARNINGS"] = "off"
warnings.filterwarnings('ignore')
We need to set the parameters of the model for when we use it later in this workbook.
set_seed(42)
pad_token = "<pad>"
special_tokens = [pad_token, "<cls>", "<eoc>"]
n_hvg = 1000
n_bins = 51
mask_value = -1
pad_value = -2
n_input_bins = n_bins
Lets go through the parameters so they are better understood:
- set_seed is used to set the random seed for reproducibility.
- pad_token is a special token used for padding sequences, to ensure that all sequences are of the same length. Therefore it will be inserted where necessary to make sequences of unequal lengths equal.
- special_tokens is a list defining a set of special tokens the model may use, such as pad. Cls is commonly a classification token that represents the beginning of a sequence and is used to aggregate information for classification tasks. Eoc stands for 'end of converstion/content' and is used to denote the end of a sequence.
- n_hvg is a variable representing the number of highly variable genes to be considered.
- n_bins defines the number of bins to group continuous data into discrete intervals.
- mask_value defines a value for masking. Masking is a technique to ignore certain parts of the data, often used in sequence models to indicate positions in a sequence that should be ignored during training or evaluation.
- pad_value defines the value for padding sequences. Similar to mask_value, it indicates positions in the data that are not part of the original sequence and should be ignored.
- n_input_bins assignment set the number of input bins ensuring consistency throughout the code.
Step 1: Load the Pre-Trained Model and Dataset¶
# Specify model path; here we load the pre-trained scGPT blood model
#You will have to directly download the model from here: https://github.com/bowang-lab/scGPT
model_dir = Path("/data/scgpt_blood_model")
model_config_file = model_dir / "args.json"
model_file = model_dir / "best_model.pt"
vocab_file = model_dir / "vocab.json"
vocab = GeneVocab.from_file(vocab_file)
for s in special_tokens:
if s not in vocab:
vocab.append_token(s)
# Retrieve model parameters from config files
with open(model_config_file, "r") as f:
model_configs = json.load(f)
print(
f"Resume model from {model_file}, the model args will override the "
f"config {model_config_file}."
)
embsize = model_configs["embsize"]
nhead = model_configs["nheads"]
d_hid = model_configs["d_hid"]
nlayers = model_configs["nlayers"]
n_layers_cls = model_configs["n_layers_cls"]
gene2idx = vocab.get_stoi()
#Set up Pytorch model to run on specified device (CPU or GPU), intialise the model with specific parameters,
#load pre-trained weight, and then transfer the model to the chosen device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ntokens = len(vocab) # size of vocabulary
model = TransformerModel(
ntokens,
embsize,
nhead,
d_hid,
nlayers,
vocab=vocab,
pad_value=pad_value,
n_input_bins=n_input_bins,
)
try:
model.load_state_dict(torch.load(model_file))
print(f"Loading all model params from {model_file}")
except:
# only load params that are in the model and match the size
model_dict = model.state_dict()
pretrained_dict = torch.load(model_file)
pretrained_dict = {
k: v
for k, v in pretrained_dict.items()
if k in model_dict and v.shape == model_dict[k].shape
}
for k, v in pretrained_dict.items():
print(f"Loading params {k} with shape {v.shape}")
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)
model.to(device)
Loading params encoder.embedding.weight with shape torch.Size([36574, 512]) Loading params encoder.enc_norm.weight with shape torch.Size([512]) Loading params encoder.enc_norm.bias with shape torch.Size([512]) Loading params value_encoder.linear1.weight with shape torch.Size([512, 1]) Loading params value_encoder.linear1.bias with shape torch.Size([512]) Loading params value_encoder.linear2.weight with shape torch.Size([512, 512]) Loading params value_encoder.linear2.bias with shape torch.Size([512]) Loading params value_encoder.norm.weight with shape torch.Size([512]) Loading params value_encoder.norm.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.0.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.0.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.0.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.0.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.1.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.1.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.1.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.1.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.2.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.2.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.2.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.2.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.3.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.3.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.3.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.3.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.4.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.4.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.4.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.4.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.5.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.5.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.5.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.5.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.6.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.6.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.6.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.6.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.7.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.7.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.7.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.7.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.8.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.8.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.8.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.8.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.8.norm2.weight with shape torch.Size([512])
Loading params transformer_encoder.layers.8.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.9.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.9.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.9.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.9.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.10.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.10.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.10.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.10.norm2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.self_attn.out_proj.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.11.self_attn.out_proj.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.linear1.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.11.linear1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.linear2.weight with shape torch.Size([512, 512]) Loading params transformer_encoder.layers.11.linear2.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm1.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm1.bias with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm2.weight with shape torch.Size([512]) Loading params transformer_encoder.layers.11.norm2.bias with shape torch.Size([512]) Loading params decoder.fc.0.weight with shape torch.Size([512, 512]) Loading params decoder.fc.0.bias with shape torch.Size([512]) Loading params decoder.fc.2.weight with shape torch.Size([512, 512]) Loading params decoder.fc.2.bias with shape torch.Size([512]) Loading params decoder.fc.4.weight with shape torch.Size([1, 512]) Loading params decoder.fc.4.bias with shape torch.Size([1])
TransformerModel(
(encoder): GeneEncoder(
(embedding): Embedding(36574, 512, padding_idx=36571)
(enc_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(value_encoder): ContinuousValueEncoder(
(dropout): Dropout(p=0.5, inplace=False)
(linear1): Linear(in_features=1, out_features=512, bias=True)
(activation): ReLU()
(linear2): Linear(in_features=512, out_features=512, bias=True)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(transformer_encoder): TransformerEncoder(
(layers): ModuleList(
(0-11): 12 x TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
)
(linear1): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.5, inplace=False)
(linear2): Linear(in_features=512, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.5, inplace=False)
(dropout2): Dropout(p=0.5, inplace=False)
)
)
)
(decoder): ExprDecoder(
(fc): Sequential(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Linear(in_features=512, out_features=1, bias=True)
)
)
(cls_decoder): ClsDecoder(
(_decoder): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): ReLU()
(2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(3): Linear(in_features=512, out_features=512, bias=True)
(4): ReLU()
(5): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(out_layer): Linear(in_features=512, out_features=1, bias=True)
)
(sim): Similarity(
(cos): CosineSimilarity()
)
(creterion_cce): CrossEntropyLoss()
)
Load dataset of interest
# Paths to the compressed and uncompressed files
compressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_compressed.h5ad.gz'
uncompressed_file = '/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered_uncompressed.h5ad'
# Decompress the file
with gzip.open(compressed_file, 'rb') as f_in:
with open(uncompressed_file, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# Read the decompressed file using scanpy
adata = sc.read(uncompressed_file)
# Optionally, remove the uncompressed file if no longer needed
os.remove(uncompressed_file)
adata
AnnData object with n_obs Ć n_vars = 3540 Ć 25198
obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_umap'
Check that the gene variable names are in the correct format i.e. gene symbol names and not Ensembl IDs.
adata.var
| feature_is_filtered | feature_name | feature_reference | feature_biotype | feature_length | highly_variable | means | dispersions | dispersions_norm | |
|---|---|---|---|---|---|---|---|---|---|
| feature_name | |||||||||
| MIR1302-2HG | False | MIR1302-2HG | NCBITaxon:9606 | gene | 1021 | False | 1.000000e-12 | NaN | NaN |
| FAM138A | False | FAM138A | NCBITaxon:9606 | gene | 1219 | False | 1.000000e-12 | NaN | NaN |
| OR4F5 | False | OR4F5 | NCBITaxon:9606 | gene | 2618 | False | 1.000000e-12 | NaN | NaN |
| OR4F29 | False | OR4F29 | NCBITaxon:9606 | gene | 939 | False | 1.000000e-12 | NaN | NaN |
| OR4F16 | False | OR4F16 | NCBITaxon:9606 | gene | 939 | False | 1.000000e-12 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTTY17C | False | TTTY17C | NCBITaxon:9606 | gene | 776 | False | 1.000000e-12 | NaN | NaN |
| SEPTIN14P23 | False | SEPTIN14P23 | NCBITaxon:9606 | gene | 1192 | False | 2.503351e-04 | -0.120703 | -1.405214 |
| CDY1 | False | CDY1 | NCBITaxon:9606 | gene | 2670 | False | 1.000000e-12 | NaN | NaN |
| TTTY3 | False | TTTY3 | NCBITaxon:9606 | gene | 344 | False | 1.000000e-12 | NaN | NaN |
| MAFIP | False | MAFIP | NCBITaxon:9606 | gene | 1599 | False | 6.082392e-03 | 0.002191 | 0.189369 |
25198 rows Ć 9 columns
# Preprocess the data following the scGPT data pre-processing pipeline
preprocessor = Preprocessor(
use_key="X", # the key in adata.layers to use as raw data
filter_gene_by_counts=3, # step 1
filter_cell_by_counts=False, # step 2
normalize_total=1e4, # step 3 whether to normalize the raw data and to what sum
result_normed_key="X_normed", # the key in adata.layers to store the normalized data
log1p=data_is_raw, # step 4 whether to log1p the normalised data
result_log1p_key="X_log1p",
subset_hvg=n_hvg, # 5. whether to subset the raw data to highly variable genes
hvg_flavor="seurat_v3" if data_is_raw else "cell_ranger",
binning=n_bins, # 6. whether to bin the raw data and to what number of bins
result_binned_key="X_binned", # the key in adata.layers to store the binned data
)
preprocessor(adata)
scGPT - INFO - Filtering genes by counts ... scGPT - INFO - Normalizing total counts ... scGPT - INFO - Subsetting highly variable genes ... scGPT - WARNING - No batch_key is provided, will use all cells for HVG selection. scGPT - INFO - Binning data ...
Step 2: Retrieve Scgpt's Gene Embeddings¶
Overall, the pre-trained foundation model contains 30+K genes. Here for simplicity, we focus on a subset of HVGs specific to the data at hand.
# Retrieve the data-independent gene embeddings from scGPT
gene_ids = np.array([id for id in gene2idx.values()])
gene_embeddings = model.encoder(torch.tensor(gene_ids, dtype=torch.long).to(device))
gene_embeddings = gene_embeddings.detach().cpu().numpy()
# Filter on the intersection between the Blood HVGs found and scGPT's 30+K foundation model vocab
gene_embeddings = {gene: gene_embeddings[i] for i, gene in enumerate(gene2idx.keys()) if gene in adata.var.index.tolist()}
print('Retrieved gene embeddings for {} genes.'.format(len(gene_embeddings)))
Retrieved gene embeddings for 962 genes.
# Construct gene embedding network
embed = GeneEmbedding(gene_embeddings)
100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 962/962 [00:00<00:00, 1095850.20it/s]
Step 3: Extract gene programmes from gene embedding network¶
Perform Louvain clustering on the gene embedding network.
# Perform Louvain clustering with desired resolution; here we specify resolution=40
gdata = embed.get_adata(resolution=40)
# Retrieve the gene clusters
metagenes = embed.get_metagenes(gdata)
gdata
AnnData object with n_obs Ć n_vars = 962 Ć 512
obs: 'leiden'
uns: 'neighbors', 'leiden', 'umap'
obsm: 'X_umap'
obsp: 'distances', 'connectivities'
metagenes
defaultdict(list,
{'32': ['ZNF880', 'SNHG21', 'APOM', 'SC5D'],
'214': ['ZNF837', 'SMYD5'],
'215': ['ZNF713', 'MIR762HG'],
'254': ['ZNF618', 'ZNF235'],
'144': ['ZNF608', 'APBB2', 'ZNF43'],
'57': ['ZNF594-DT', 'TSPYL5', 'ZNF530', 'MTFR2'],
'120': ['ZNF528-AS1', 'WDR27', 'CORT'],
'37': ['ZNF426', 'ZFP14', 'TTC28-AS1', 'ZNF76'],
'182': ['ZNF415', 'KRCC1', 'PTPRN2'],
'109': ['ZNF391', 'ZCWPW1', 'IFT88'],
'74': ['ZNF318', 'SIPA1L3', 'KBTBD8', 'GUCD1'],
'21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
'91': ['ZNF140', 'SMIM8', 'ZBED5-AS1', 'FCF1'],
'99': ['ZMYND8', 'ITPR1', 'CDKAL1'],
'268': ['ZMAT1', 'EFCAB13'],
'12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
'217': ['ZEB2-AS1', 'PTCH2'],
'95': ['ZDHHC19', 'SERHL2', 'NEK8'],
'218': ['ZC3H12C', 'PACS2'],
'211': ['ZBTB32', 'IL2RA'],
'5': ['ZBTB16',
'TLE1',
'TBC1D5',
'CRIM1',
'PRKCB',
'VAV3',
'MCTP2'],
'175': ['XIST', 'ITGA4', 'LPP'],
'114': ['XCL2', 'TFPI2', 'CCL4L2'],
'126': ['XCL1', 'GNG4', 'TNFRSF18'],
'166': ['WDR86', 'CNKSR2', 'GRAP'],
'53': ['WBP4', 'SREBF2-AS1', 'COQ7', 'KMT2E-AS1'],
'98': ['WASF1', 'SGO1', 'CCDC18'],
'100': ['VPREB3', 'CD24', 'FLI1'],
'101': ['USP7-AS1', 'CATSPERZ', 'CERNA1'],
'205': ['USF1', 'RASSF2', 'RIN3'],
'102': ['UMAD1', 'SCN3A', 'BEND5'],
'103': ['UACA', 'PDE7B', 'EPS8'],
'158': ['TYROBP', 'FCER1G', 'HCST'],
'137': ['TUSC1', 'MYL5', 'LINC01003'],
'51': ['TUBB6', 'EIF1AY', 'PDLIM7', 'HMGN5'],
'104': ['TTC9', 'CXCR5', 'DBNDD1'],
'105': ['TTC23L-AS1', 'LNP1', 'MSS51'],
'106': ['TSC22D1', 'MEG3', 'H2AC6'],
'107': ['TRIM66', 'EPM2A', 'MCF2L'],
'108': ['TRIM52-AS1', 'NUBPL', 'THAP7-AS1'],
'22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
'131': ['TRBC1', 'TRDC', 'C12orf75'],
'81': ['TRAF4', 'METTL8', 'BTLA', 'ADAM19'],
'1': ['TNFRSF13B',
'FCRL2',
'PNOC',
'S100B',
'LINC01781',
'CD1C',
'LINC01857',
'OSBPL10'],
'110': ['TNFAIP6', 'ADM', 'PTGS2'],
'111': ['TMEM62', 'CLCN4', 'LETM2'],
'132': ['TMEM44-AS1', 'LINC02551', 'TNRC6B-DT'],
'149': ['TMEM38A', 'EIF2AK3-DT', 'GCNT2'],
'112': ['TMEM131L', 'ACSM3', 'MARCHF3'],
'10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
'68': ['TUNAR', 'TVP23A', 'ELL3', 'GLI1'],
'113': ['THOC1-DT', 'LINGO3', 'GORAB-AS1'],
'65': ['THBS1', 'SERPINB2', 'CCL7', 'EREG'],
'40': ['TGM1', 'DTX4', 'ECE1-AS1', 'SERPINB9P1'],
'56': ['TTC30A', 'ATP10D', 'MUL1', 'SPRYD7'],
'59': ['TEX9', 'STK33', 'TMEM67', 'PPIL6'],
'216': ['TESC', 'SCNM1'],
'201': ['TEDC1', 'ARHGEF18', 'PCGF1'],
'251': ['TCEAL4', 'INO80C'],
'77': ['TBXA2R', 'SDK2', 'CR2', 'MYEF2'],
'97': ['ZNF667', 'WDR5B-DT', 'ZNF10'],
'174': ['TAT-AS1', 'CUBN', 'LINC02175'],
'177': ['TAPT1', 'BCL7A', 'IL4R'],
'94': ['TAL1', 'SMIM5', 'PDZK1IP1'],
'25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
'178': ['SUV39H2', 'DPPA4', 'IMPACT'],
'119': ['SULF2', 'S100Z', 'DOCK8-AS1'],
'302': ['STAG1-DT'],
'79': ['UQCRHL', 'GIHCG', 'NREP', 'AZU1'],
'219': ['SPRY1', 'ZNF667-AS1'],
'66': ['SPINK2', 'ZNF844', 'ERICH6-AS1', 'ABCB1'],
'235': ['SPATA6', 'MAPK10'],
'90': ['SPARC', 'CLU', 'MYH11', 'MYL9'],
'220': ['SPAG16', 'PTPRK'],
'2': ['SP2',
'H2AC21',
'H2AC11',
'H2BC13',
'H2BC18',
'H2BC4',
'H2BC7',
'H3C3'],
'184': ['SOX2-OT', 'STAR', 'MYL3'],
'194': ['SOS1-IT1', 'CAPN14', 'MTCL1'],
'33': ['SOCS1', 'CSRNP1', 'GADD45G', 'RGS16'],
'115': ['SOBP', 'TPK1', 'KRBOX1'],
'85': ['SNX29', 'CYSLTR1', 'UVRAG', 'LHPP'],
'55': ['SNX22', 'TCTN1', 'FAM111B', 'CNFN'],
'116': ['SNAPC1', 'HEIH', 'MXI1'],
'117': ['SMAGP', 'SCPEP1', 'ANXA4'],
'148': ['SLC39A10', 'CEP290', 'ANKRD26'],
'222': ['ST20-AS1', 'ERMN'],
'63': ['SLC2A6', 'SIGLEC10', 'SIDT2', 'SCIMP'],
'62': ['SKAP1', 'NCR3', 'KLRB1', 'HOPX'],
'186': ['SIMC1', 'LYRM1', 'PIGC'],
'118': ['SH3BP2', 'FAM120A', 'VPS41'],
'163': ['SFR1', 'RAB30-DT', 'CHML'],
'206': ['SETBP1', 'OR2T11', 'LAMC1'],
'64': ['SERPINI1', 'SLC2A1', 'ANKRD37', 'PWWP3A'],
'41': ['TPST1', 'RAPGEF5', 'FLT1', 'ST6GALNAC3'],
'188': ['SEMA6B', 'GASAL1', 'INSIG1-DT'],
'17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
'139': ['SAV1', 'USP6NL', 'POU2F1'],
'223': ['TRMT1', 'LINC01678'],
'123': ['SATB1', 'PDE7A', 'BCL11B'],
'13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
'86': ['RUNX1', 'GNA15', 'ANKRD28', 'LRRC75A'],
'16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
'224': ['RRAD', 'DNAAF1'],
'0': ['TXNDC5',
'IGHA2',
'IGLC3',
'IGHA1',
'IGHG4',
'JCHAIN',
'IGHG1',
'IGHG3',
'IGLC2'],
'241': ['RP9', 'AGPAT5'],
'83': ['SNX10', 'SLC11A1', 'MCEMP1', 'PTPN6'],
'121': ['RPGR', 'SEMA4B', 'RRM2B'],
'225': ['USP9Y', 'CCDC141'],
'34': ['UBE4A', 'OGT', 'TRIB2', 'CCND2'],
'156': ['TBCK', 'EPG5', 'NOTCH1'],
'36': ['SBNO1-AS1', 'LAIR1', 'CRELD2', 'CD38'],
'252': ['WAC-AS1', 'TLR1'],
'226': ['WWC3', 'ZDHHC21'],
'35': ['RPAP2', 'DUSP12', 'SLC25A45', 'OSGEPL1'],
'44': ['ZNF585A', 'EID2B', 'HS3ST3B1', 'AKAP7'],
'122': ['ZRSR2', 'DIPK1B', 'SCAMP4'],
'67': ['STEAP1B', 'DTWD2', 'BCAS4', 'CCSER1'],
'197': ['ZNF83', 'SESN1', 'NUAK2'],
'269': ['SCAI', 'ARPP21'],
'296': ['TNFSF13B', 'CLEC7A'],
'39': ['RNF170', 'BMP3', 'ALKAL2', 'CBR3-AS1'],
'227': ['WDR19', 'IQCN'],
'287': ['RNF138', 'NAF1'],
'69': ['RNF103', 'R3HDM2', 'ZYG11B', 'ZBTB18'],
'9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
'80': ['RNASEH2B-AS1', 'SPA17', 'EFHB', 'IQCD'],
'48': ['RIOK1', 'URB1-AS1', 'LINC00526', 'NUFIP1'],
'140': ['RGS1', 'DUSP4', 'CCL2'],
'30': ['REXO1', 'CORO7', 'AKAP8', 'BTBD2'],
'124': ['RDX', 'TSPAN13', 'IL3RA_ENSG00000185291'],
'58': ['RBP7', 'NFAM1', 'HCAR2', 'ASGR1'],
'46': ['RASSF8', 'RBM38-AS1', 'LINC01811', 'PCP2'],
'280': ['RASSF1-AS1', 'FMO5'],
'125': ['RAP2C-AS1', 'MAP3K2-DT', 'LINC00494'],
'19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
'212': ['TNFAIP2', 'UBXN11'],
'18': ['RAB33B-AS1',
'MZF1-AS1',
'SDR42E2',
'C16orf74',
'MIR4458HG'],
'7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
'29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
'60': ['GNMT', 'ERCC2', 'TMEM167B-DT', 'PSMA8'],
'6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
'23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
'180': ['GAPT', 'IGLC5', 'CD200'],
'4': ['IGHV4-39',
'IGKV1-9',
'IGHV4-34',
'IGHV1-3',
'IGLV2-14',
'IGHV1-18',
'IGHV3-23'],
'266': ['CLMN', 'TFEB'],
'15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
'14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
'127': ['GALNT6', 'GOLPH3-DT', 'GPR137C'],
'229': ['CXCL10', 'CXCL2'],
'27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
'92': ['BBOF1', 'TWSG1-DT', 'WDR5B', 'CFAP45'],
'70': ['CLLU1-AS1', 'ARHGAP22', 'DNAJC12', 'AVP'],
'199': ['CCDC112', 'FADS3', 'ZNF821'],
'8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
'246': ['FCER2', 'ABCB4'],
'190': ['POC1B', 'ASXL2', 'MBNL3'],
'168': ['FAM241A', 'SDCCAG8', 'CERS6'],
'230': ['ARL13B', 'HNRNPD-DT'],
'231': ['NR4A1', 'ATF3'],
'221': ['FAM111A', 'PLCL2'],
'232': ['HMGA1P4', 'C1orf220'],
'233': ['EVI5', 'ATP6V0A1'],
'128': ['C11orf24', 'LARGE2', 'KCNG1'],
'234': ['ARFIP1', 'GCNT1'],
'129': ['DNAL4', 'CD2AP-DT', 'PRXL2B'],
'147': ['EPB41L2', 'ELL2', 'ABCA1'],
'3': ['MX1',
'HERC5',
'XAF1',
'IFI44L',
'IFIT3',
'MX2',
'EPSTI1',
'PLSCR1'],
'54': ['CYTOR', 'SLAMF7', 'BHLHE41', 'IFNG'],
'183': ['FAAP24', 'ZNF252P-AS1', 'HCN3'],
'130': ['PACSIN2', 'DDI2', 'RYBP'],
'75': ['GBP5', 'GBP1', 'GBP4', 'GBP2'],
'259': ['DOCK9-DT', 'NUMBL'],
'150': ['USP11', 'BTN2A2', 'HIP1R'],
'73': ['DCTN6-DT', 'CSKMT', 'ATF7IP2', 'ILF3-DT'],
'203': ['NRGN', 'RGS18', 'PRKAR2B'],
'71': ['DNAAF4',
'LINC00106_ENSG00000236871',
'CSGALNACT2-DT',
'JARID2-AS1'],
'236': ['DDN-AS1', 'ICOSLG'],
'204': ['DTX1', 'GPR160', 'HRK'],
'237': ['ERCC1', 'HDAC9'],
'45': ['CYB5D2', 'LINC01004', 'ZFP28-DT', 'ALG9'],
'26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
'213': ['DYNLT2', 'CAGE1'],
'289': ['PPP1R14A', 'FAM30A'],
'52': ['NOTCH2', 'SERTAD2', 'KLF7', 'MAML3'],
'61': ['TSC22D2', 'ICA1L', 'TTN', 'KCNQ1OT1'],
'141': ['FAM177B', 'CCL22', 'IL6'],
'185': ['CYP2R1', 'NHLH1', 'LINC02848'],
'198': ['IRAK3', 'BAZ2B', 'PECAM1'],
'238': ['EFHC1', 'METTL7A'],
'239': ['ZNF165', 'CFAP44'],
'50': ['IGHV3-15', 'IGLC6', 'IGHV3-49', 'IGLC7'],
'260': ['NME4', 'MYC'],
'78': ['SOX4', 'ARMH1', 'MIR181A1HG', 'MTHFD2L'],
'133': ['VCAN', 'CSF3R', 'CD14'],
'43': ['DOP1B', 'MIR3936HG', 'DLEU2', 'NRDE2'],
'249': ['GDAP1', 'CENPP'],
'291': ['PTGDS', 'IGFBP4'],
'72': ['CA5B', 'RBM43', 'INTS6-AS1', 'LRRC8C-DT'],
'240': ['LIX1-AS1', 'CLEC20A'],
'169': ['NUP58', 'SPPL2B', 'CLASRP'],
'160': ['CRIP2', 'BDH2', 'PCSK1N'],
'297': ['COL19A1'],
'134': ['PIDD1', 'NUP210', 'SPATA33'],
'292': ['CORO2B', 'AMDHD2'],
'153': ['CLEC4A', 'GM2A', 'MIR223HG'],
'208': ['PRCD', 'ZNF528', 'MYBPC2'],
'243': ['KCNH8', 'LINC02397'],
'88': ['CNN3', 'CCND1', 'APP', 'PLK2'],
'162': ['FRY', 'LINC02029', 'GABBR1'],
'275': ['CNTNAP2', 'MAP3K9'],
'136': ['GABARAPL1', 'IL1RN', 'PLAUR'],
'244': ['CLIC4', 'FCHSD2'],
'159': ['HEXIM1', 'GADD45A', 'IRS2'],
'38': ['MARCHF1', 'HLA-DRB5', 'CPVL', 'SHTN1'],
'179': ['ANXA1', 'CD9', 'LGALS3'],
'193': ['TGFBI', 'RNASE6', 'MPEG1'],
'245': ['OPN3', 'MAP3K13'],
'84': ['IGHV3-53', 'IGKV2D-29', 'IGHV3-30', 'IGHV5-51'],
'242': ['IGHE', 'WNT10A'],
'191': ['H2BC21', 'H2BC8', 'H2BC5'],
'87': ['DSP', 'FXYD7', 'SH2D3A', 'LSR'],
'279': ['ADPGK-AS1', 'LINC00467'],
'138': ['RHOB', 'CITED2', 'H2AC19'],
'164': ['KCNIP2', 'RORA-AS1', 'PGBD4'],
'248': ['CEMIP2', 'KLRD1'],
'192': ['THRB', 'KLK1', 'MOB3B'],
'207': ['ADARB1', 'EPN2', 'MICAL3'],
'247': ['CFL2', 'PAIP2B'],
'264': ['IFNG-AS1', 'SLAMF1'],
'24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
'28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
'195': ['CCDC168', 'KLF3-AS1', 'NARF-AS2'],
'228': ['MSMP', 'PLD5'],
'89': ['CXCL3', 'CCL20', 'MIR3945HG', 'MMP7'],
'143': ['VAV3-AS1', 'DPEP2', 'IL6-AS1'],
'285': ['NETO1-DT', 'LRRIQ3'],
'31': ['ARHGAP9', 'ARHGAP4', 'FAM200B', 'CMTM7'],
'135': ['GALNT4', 'ARRDC4', 'LPP-AS2'],
'145': ['FCRL5', 'LINC01480', 'GCM1'],
'200': ['EGR2', 'C15orf48', 'EGR3'],
'146': ['FCRL1', 'KLHL14', 'MYO7B'],
'93': ['GPR35', 'LILRA1', 'LILRA6', 'HYAL3'],
'11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
'250': ['AKAP6', 'LINC01685'],
'271': ['LUCAT1', 'IER5L'],
'170': ['P2RX5', 'CDCA7L', 'GPR18'],
'282': ['CLEC17A', 'HLA-DOB'],
'76': ['DBNDD2', 'MANSC1', 'OXT', 'PPP1R21-DT'],
'151': ['COLCA1', 'LIX1L-AS1', 'TAGAP-AS1'],
'152': ['ARHGAP12', 'SLC16A11', 'HAGHL'],
'154': ['LEKR1', 'OXTR', 'P2RX7'],
'284': ['CD300E', 'DUSP6'],
'299': ['E2F5'],
'265': ['CAMK2D', 'CIITA'],
'286': ['RAB37', 'C11orf21'],
'82': ['DNASE1L3', 'KHDRBS2', 'MACROD2', 'MIR4432HG'],
'288': ['CCDC88A', 'OGFRL1'],
'47': ['MILR1', 'CBFA2T3', 'PAWR', 'HS2ST1'],
'278': ['NKILA', 'KLHL22'],
'283': ['TCL6', 'YPEL2'],
'294': ['IFRD1', 'NEIL1'],
'210': ['TMEM176A', 'G0S2', 'MAFB'],
'253': ['ARHGEF35-AS1', 'FAM24B'],
'181': ['KLRC1', 'GZMK', 'GPR65'],
'155': ['IRF4', 'TP53INP1', 'PHF6'],
'161': ['CSRP2', 'BEX1', 'PEG10'],
'42': ['ZCCHC9', 'FBXO22', 'KIAA0586', 'PNKD'],
'157': ['PPP1R14B-AS1', 'TP53RK-DT', 'PTK6'],
'202': ['OR2A1-AS1', 'PPP1R12A-AS1', 'MYLK-AS1'],
'209': ['COL9A3', 'N4BP3', 'LCN8'],
'96': ['CHL1', 'SYT17', 'MAP2'],
'257': ['ATP2B1-AS1', 'TMEM107'],
'196': ['ARMCX4', 'MAP3K6', 'NPTX1'],
'258': ['BCL6', 'ATP11A'],
'171': ['CTSW', 'GZMM', 'FEZ1'],
'20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44'],
'290': ['CBR3', 'LOH12CR2'],
'165': ['IRAK2', 'LINC00265', 'NRARP'],
'172': ['H1-10', 'H1-2', 'H1-3'],
'301': ['KLHDC7B-DT'],
'274': ['CD180', 'IRAG1-AS1'],
'167': ['JAG2', 'P2RY14', 'PPM1J'],
'262': ['CYP2U1-AS1', 'KCTD16'],
'263': ['CLDN11', 'NTNG1'],
'293': ['LIX1', 'GNG3'],
'298': ['IGLC1'],
'49': ['MED12', 'ERLIN2', 'IFT122', 'P2RX4'],
'176': ['HES1', 'ID1', 'ID2'],
'173': ['HES4', 'PPP1R17', 'ODAD3'],
'267': ['HHIP-AS1', 'LINC01013'],
'142': ['MZB1', 'DERL3', 'AIM2'],
'270': ['EAF2', 'PLPP5'],
'272': ['INSIG2', 'KHDRBS3'],
'273': ['IGHV3-7', 'STAG3'],
'189': ['DPYSL3', 'CERCAM', 'NRP2'],
'300': ['PRMT5-AS1'],
'276': ['CHST15', 'LINC02728'],
'277': ['OXCT1-AS1', 'MYZAP'],
'255': ['PCDH9', 'NIBAN3'],
'256': ['LINC01703', 'NDUFV2-AS1'],
'281': ['LINC02132', 'NUDT8'],
'295': ['LINC02422', 'SLC38A11'],
'187': ['AMN1', 'NT5E', 'OLMALINC'],
'261': ['GZMA', 'IL32']})
Filter on clusters with 5 or more genes
# Obtain the set of gene programmes from clusters with #genes >= 5
mgs = dict()
for mg, genes in metagenes.items():
if len(genes) > 4:
mgs[mg] = genes
# Here are the gene programmes identified
mgs
{'21': ['ZNF274', 'OMA1', 'POLM', 'HELQ', 'LINC00339'],
'12': ['ZFAND2A', 'HSPB1', 'DNAJB1', 'HSPA1A', 'HSPA6', 'HSPA1B'],
'5': ['ZBTB16', 'TLE1', 'TBC1D5', 'CRIM1', 'PRKCB', 'VAV3', 'MCTP2'],
'22': ['TRIM23', 'ERV3-1', 'RNGTT', 'MBD5', 'CCNT1'],
'1': ['TNFRSF13B',
'FCRL2',
'PNOC',
'S100B',
'LINC01781',
'CD1C',
'LINC01857',
'OSBPL10'],
'10': ['TIMP1', 'VSIR', 'AIF1', 'FABP5', 'CST3', 'LST1'],
'25': ['TACC3', 'MKI67', 'CKS2', 'CENPF', 'SMC4'],
'2': ['SP2',
'H2AC21',
'H2AC11',
'H2BC13',
'H2BC18',
'H2BC4',
'H2BC7',
'H3C3'],
'17': ['SEMA3C', 'PID1', 'EMP1', 'CREB5', 'CD109'],
'13': ['S100A8', 'FCN1', 'S100A12', 'LYZ', 'MNDA', 'S100A9'],
'16': ['RSPH1', 'CFAP126', 'IQCG', 'MORN2', 'MOK'],
'0': ['TXNDC5',
'IGHA2',
'IGLC3',
'IGHA1',
'IGHG4',
'JCHAIN',
'IGHG1',
'IGHG3',
'IGLC2'],
'9': ['TRAC', 'GIMAP7', 'CD3G', 'CD3D', 'CD2', 'IL7R'],
'19': ['RAB3IP', 'DZIP3', 'ALDH7A1', 'PGM2L1', 'PAX8-AS1'],
'18': ['RAB33B-AS1', 'MZF1-AS1', 'SDR42E2', 'C16orf74', 'MIR4458HG'],
'7': ['GP1BB', 'PPBP', 'HBD', 'CAVIN2', 'GNG11', 'PF4'],
'29': ['TRBV28', 'CD8A', 'CCND3', 'CD8B', 'LINC02446'],
'6': ['GNLY', 'CST7', 'FGFBP2', 'CCL5', 'GZMB', 'NKG7', 'GZMH'],
'23': ['JMY', 'DUSP2', 'ZNF331', 'NR4A2', 'DDX3Y'],
'4': ['IGHV4-39',
'IGKV1-9',
'IGHV4-34',
'IGHV1-3',
'IGLV2-14',
'IGHV1-18',
'IGHV3-23'],
'15': ['CENPH', 'CLSPN', 'RFC3', 'CDCA5', 'ITGB3BP'],
'14': ['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM'],
'27': ['TCL1A', 'FCRLA', 'CD19', 'CD22', 'CD72'],
'8': ['FCGRT', 'MS4A7', 'CTSL', 'CD68', 'LIPA', 'NINJ1'],
'3': ['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1'],
'26': ['PF4V1', 'GP9', 'CMTM5', 'PTGS1', 'MPIG6B'],
'24': ['AREG', 'INSIG1', 'MAP3K8', 'PLEK', 'PTPRE'],
'28': ['MIR155HG', 'CD70', 'TNFSF9', 'NR4A3', 'PIF1'],
'11': ['HBB', 'CAT', 'HBA2', 'HBA1', 'SLC25A37', 'MT2A'],
'20': ['IGHV2-5', 'IGHV3-33', 'IGHV4-59', 'IGKV4-1', 'IGLV1-44']}
Step 4: Visualise Network Connectivity within Desired Gene Programme¶
# Retrieve gene programme 3
GP_genes = mgs['3']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
df = embed.compute_similarities(i, GP_genes)
df['Gene1'] = i
df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)
['MX1', 'HERC5', 'XAF1', 'IFI44L', 'IFIT3', 'MX2', 'EPSTI1', 'PLSCR1']
100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 8/8 [00:00<00:00, 236.27it/s]
Gene Similarity Gene1 60 EPSTI1 0.225607 PLSCR1 45 EPSTI1 0.188223 MX2 6 EPSTI1 0.256395 MX1 35 EPSTI1 0.274948 IFIT3 29 EPSTI1 0.241366 IFI44L 19 EPSTI1 0.254587 XAF1 14 EPSTI1 0.119632 HERC5 31 HERC5 0.168532 IFI44L 55 HERC5 0.119632 EPSTI1 47 HERC5 0.163126 MX2 38 HERC5 0.202499 IFIT3 23 HERC5 0.168315 XAF1 63 HERC5 0.091875 PLSCR1 7 HERC5 0.174051 MX1 1 IFI44L 0.372201 MX1 17 IFI44L 0.337581 XAF1 59 IFI44L 0.232156 PLSCR1 11 IFI44L 0.168532 HERC5 34 IFI44L 0.296725 IFIT3 52 IFI44L 0.241366 EPSTI1 41 IFI44L 0.306452 MX2 44 IFIT3 0.198912 MX2 49 IFIT3 0.274948 EPSTI1 9 IFIT3 0.202499 HERC5 28 IFIT3 0.296725 IFI44L 3 IFIT3 0.313495 MX1 58 IFIT3 0.240587 PLSCR1 20 IFIT3 0.240173 XAF1 10 MX1 0.174051 HERC5 50 MX1 0.256395 EPSTI1 33 MX1 0.313495 IFIT3 25 MX1 0.372201 IFI44L 57 MX1 0.266255 PLSCR1 18 MX1 0.323817 XAF1 42 MX1 0.295364 MX2 4 MX2 0.295364 MX1 54 MX2 0.188223 EPSTI1 39 MX2 0.198912 IFIT3 13 MX2 0.163126 HERC5 21 MX2 0.228178 XAF1 27 MX2 0.306452 IFI44L 62 MX2 0.182111 PLSCR1 22 PLSCR1 0.212136 XAF1 30 PLSCR1 0.232156 IFI44L 53 PLSCR1 0.225607 EPSTI1 36 PLSCR1 0.240587 IFIT3 46 PLSCR1 0.182111 MX2 5 PLSCR1 0.266255 MX1 15 PLSCR1 0.091875 HERC5 43 XAF1 0.228178 MX2 37 XAF1 0.240173 IFIT3 26 XAF1 0.337581 IFI44L 12 XAF1 0.168315 HERC5 2 XAF1 0.323817 MX1 61 XAF1 0.212136 PLSCR1 51 XAF1 0.254587 EPSTI1
# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)
# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')
elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]
pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)
width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
if v > thresh:
width_large[list(widths.keys())[i]] = v*10
else:
width_small[list(widths.keys())[i]] = max(v, 0)*10
nx.draw_networkx_edges(G, pos,
edgelist = width_small.keys(),
width=list(width_small.values()),
edge_color='lightblue',
alpha=0.8)
nx.draw_networkx_edges(G, pos,
edgelist = width_large.keys(),
width = list(width_large.values()),
alpha = 0.5,
edge_color = "blue",
)
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)
ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()
Step 6: Reactome Pathway Analysis¶
Reactome pathway is another form of pathway analysis that can be carried out on the data.
# Meta info about the number of terms (tests) in the databases
df_database = pd.DataFrame(
data = [['GO_Biological_Process_2021', 6036],
['GO_Molecular_Function_2021', 1274],
['Reactome_2022', 1818]],
columns = ['dataset', 'term'])
# Select desired database for query; here use Reactome as an example
databases = ['Reactome_2022']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m
# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
gene_sets=databases,
organism='Human',
outdir='test/enr_Reactome',
cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)
# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df
| Gene_set | Term | Overlap | P-value | Adjusted P-value | Old P-value | Old Adjusted P-value | Odds Ratio | Combined Score | Genes | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Reactome_2022 | Interferon Signaling R-HSA-913531 | 5/200 | 5.197809e-09 | 8.836276e-08 | 0 | 0 | 169.205128 | 3227.592653 | HERC5;MX2;MX1;XAF1;IFIT3 |
| 1 | Reactome_2022 | Interferon Alpha/Beta Signaling R-HSA-909733 | 4/72 | 1.068672e-08 | 9.083711e-08 | 0 | 0 | 293.000000 | 5377.799365 | MX2;MX1;XAF1;IFIT3 |
| 2 | Reactome_2022 | Cytokine Signaling In Immune System R-HSA-1280215 | 5/702 | 2.693582e-06 | 1.189224e-05 | 0 | 0 | 46.138211 | 591.705896 | HERC5;MX2;MX1;XAF1;IFIT3 |
| 3 | Reactome_2022 | ISG15 Antiviral Mechanism R-HSA-1169408 | 3/75 | 2.798175e-06 | 1.189224e-05 | 0 | 0 | 166.000000 | 2122.566166 | HERC5;MX2;MX1 |
| 4 | Reactome_2022 | Antiviral Mechanism By IFN-stimulated Genes R-... | 3/83 | 3.801759e-06 | 1.292598e-05 | 0 | 0 | 149.340000 | 1863.770163 | HERC5;MX2;MX1 |
External Reading:
- SCGPT paper: https://www.nature.com/articles/s41592-024-02201-0
- Feature selection: https://bioconductor.org/books/3.15/OSCA.basic/feature-selection.html
- P-values: https://www.nature.com/articles/nmeth.2900
- Community detection algorithms: https://www.nature.com/articles/s41598-019-41695-z
- Reactome: https://reactome.org/
Exercises:
- Describe the process of loading the pre-trained scGPT model. What are the key parameters used?
- Explain the significance of the n_hvg and n_bins parameters used in the preprocessing pipeline.
- Why is it important to filter the model gene embeddings to match the HVGs found in the input dataset?
- Describe the process of constructing the gene embedding network and performing Louvain clustering. What is the role of the resolution parameter?
- Explain how the cosine similarity among genes in a specific gene programme is computed.
- What is the purpose of performing pathway enrichment analysis in this context?
- Explain the significance of the p-value correction applied during the pathway enrichment analysis.
- Change the resolution parameter for the Louvain clustering step and analyse how the number of identified gene programmes changes. Plot the distribution of gene programmes for different resolution values.
- Choose a different gene programme and create a similar gene similarity network to visualise.
- Perform pathway enrichment analysis using a different database, such as GO_Biological_Process_2021 as well as report the top enriched pathways/processes and their significance levels.
Answers:
- The process of loading the pre-trained scGPT model involves several steps:
- The vocabulary is loaded from a file and special tokens are appended if they are not already present.
- Model parameters such as embedding size, number of heads, number of layers amongst others are read from a configuration file.
- A transformer model object is intitialised with the retrieved parameters and the vocabulary size.
- The model weights are loaded from a checkpoint file. If there are mismatches in the parameter shapes, only matching parameters are updated in the model state.
- n_hvg parameter specifies the number of highly variable genes to retain from the dataset.Selecting HVGs helps in focusing on genes that have most significant variations across cells, which are typically more informative for downstream analyses like clustering and classification. n_bins parameter determines how many bins the gene expression values are divided into. Binning is used to discretise continuous gene expression data into discrete categories, which can simplify the input for models and help in reducing noise.
- Filtering genes ensures that only the embeddings of genes that are relevant (HVGs) to the curren analysis are used. HVGs capture the most important variations in the dataset. This alignment also guarantees that the model embeddings correspond to the actual genes present in the dataset, preventing discrepancies that could arise if embeddings for non-existent or non-variable genes were included. By focusing onlhy on HVGs, the computational complexity and memory usage are reduced, making the analysis more efficient.
- The process of constructing the network and clustering:
- The gene embeddings obtained from the model are used to create a graph where nodes represent genes, and edges represent similarities (e.g. cosine similarities) between gene embeddings.
- The Louivain method is applied to the graph to partition the network into clusters. The algorithm optimises modularity to find dense clusters of nodes (genes) that are highly interconnected.
- The resolution parameter controls the granularity of the clustering. A higher resolution value typically results in a larger number of smaller clusters, whereas a lower resolution value produces fewer but larger clusters. Adjusting it helps fine-tune the balance between cluster size and number.
- The cosine similarity calculation: For each gene in the gene programme, the cosine similarity with every other gene in the same programme is computed. The cosine similarity between two vectors is computed by the dot product of vector A and B divided by both vectors' magnitudes.
- The purpose is to help identify which biological pathways are significantly associated with the genes in a given gene programme. This can provide insights into the biological functions and processes that are active or relevant in the dataset. It also aids in generating hypotheses about the roles of specific genes and their interactions within known biological pathways. It also helps in annotating gene clusters with biological functions, which can be important for interpreting results of gene expression studies and understanding the underlying biology.
- When performing pathway enrichment analysis, multiple hypothesis tests are conducted (one for each pathway). This increases the likelihood of false positives. P-value correction methods, such as the Bonferroni correction used here, adjust the significance threshold to account for the number of tests performed, reducing the probability of false positive results. The corrected p-value threshold ensures that the results reported as significant are truly so, given the large number of tests. This helps in maintaining the overall type I error rate (false positive rate) at the desired level (typically 0.05).
Answer 8:
# Define a range of resolution values to test
resolutions = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
# Dictionary to store the number of gene programmes for each resolution
gene_program_counts = {}
for res in resolutions:
# Perform Louvain clustering with the current resolution
gdata = embed.get_adata(resolution=res)
metagenes = embed.get_metagenes(gdata)
# Count the number of gene programs (clusters with >= 5 genes)
gene_program_counts[res] = sum(1 for genes in metagenes.values() if len(genes) >= 5)
# Plot the number of gene programs for each resolution
plt.figure(figsize=(10, 6))
plt.plot(resolutions, list(gene_program_counts.values()), marker='o')
plt.xlabel('Resolution')
plt.ylabel('Number of Gene Programmes')
plt.title('Number of Gene Programmes vs. Resolution')
plt.grid(True)
plt.show()
As the resolution increases, so does the number of gene programmes.
Answer 9:
# Retrieve gene programme 14
GP_genes = mgs['14']
print(GP_genes)
# Initialise an empty list to collect DataFrames
df_list = []
# Compute cosine similarities among genes in this gene programme
for i in tqdm.tqdm(GP_genes):
df = embed.compute_similarities(i, GP_genes)
df['Gene1'] = i
df_list.append(df)
# Concatenate all the DataFrames in the list
df_GP = pd.concat(df_list, ignore_index=True)
# Filter out edges from each gene to itself and sort by 'Gene'
df_GP_sub = df_GP[df_GP['Similarity'] < 0.99].sort_values(by='Gene')
# Display the resulting DataFrame
print(df_GP_sub)
# Creates a graph from the cosine similarity network
input_node_weights = [(row['Gene'], row['Gene1'], round(row['Similarity'], 2)) for i, row in df_GP_sub.iterrows()]
G = nx.Graph()
G.add_weighted_edges_from(input_node_weights)
# Plot the cosine similarity network; strong edges (> select threshold) are highlighted
thresh = 0.4
plt.figure(figsize=(20, 20))
widths = nx.get_edge_attributes(G, 'weight')
elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > thresh]
esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= thresh]
pos = nx.spring_layout(G, k=0.4, iterations=15, seed=3)
width_large = {}
width_small = {}
for i, v in enumerate(list(widths.values())):
if v > thresh:
width_large[list(widths.keys())[i]] = v*10
else:
width_small[list(widths.keys())[i]] = max(v, 0)*10
nx.draw_networkx_edges(G, pos,
edgelist = width_small.keys(),
width=list(width_small.values()),
edge_color='lightblue',
alpha=0.8)
nx.draw_networkx_edges(G, pos,
edgelist = width_large.keys(),
width = list(width_large.values()),
alpha = 0.5,
edge_color = "blue",
)
# node labels
nx.draw_networkx_labels(G, pos, font_size=25, font_family="sans-serif")
# edge weight labels
d = nx.get_edge_attributes(G, "weight")
edge_labels = {k: d[k] for k in elarge}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=15)
ax = plt.gca()
ax.margins(0.08)
plt.axis("off")
plt.show()
['FOSB', 'FOS', 'KLF4', 'EGR1', 'MIR23AHG', 'MYADM']
100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 6/6 [00:00<00:00, 240.97it/s]
Gene Similarity Gene1 1 EGR1 0.405794 FOSB 28 EGR1 0.222037 MIR23AHG 34 EGR1 0.179937 MYADM 15 EGR1 0.296500 KLF4 8 EGR1 0.254614 FOS 29 FOS 0.210742 MIR23AHG 21 FOS 0.254614 EGR1 17 FOS 0.245863 KLF4 35 FOS 0.171657 MYADM 5 FOS 0.335028 FOSB 13 FOSB 0.339160 KLF4 7 FOSB 0.335028 FOS 31 FOSB 0.364945 MYADM 19 FOSB 0.405794 EGR1 25 FOSB 0.363793 MIR23AHG 32 KLF4 0.300213 MYADM 20 KLF4 0.296500 EGR1 4 KLF4 0.339160 FOSB 9 KLF4 0.245863 FOS 26 KLF4 0.274826 MIR23AHG 33 MIR23AHG 0.231585 MYADM 3 MIR23AHG 0.363793 FOSB 10 MIR23AHG 0.210742 FOS 16 MIR23AHG 0.274826 KLF4 22 MIR23AHG 0.222037 EGR1 23 MYADM 0.179937 EGR1 2 MYADM 0.364945 FOSB 14 MYADM 0.300213 KLF4 11 MYADM 0.171657 FOS 27 MYADM 0.231585 MIR23AHG
Answer 10:
# Set pandas options to display the entire term
pd.set_option('display.max_colwidth', None)
# Select desired database for query
databases = ['GO_Biological_Process_2021']
m = df_database[df_database['dataset'].isin(databases)]['term'].sum()
# p-value correction for total number of tests done
p_thresh = 0.05/m
# Perform pathway enrichment analysis using the gseapy package in the Reactome database
df_list = []
enr_Reactome = gp.enrichr(gene_list=GP_genes,
gene_sets=databases,
organism='Human',
outdir='test/enr_Reactome',
cutoff=0.5)
out = enr_Reactome.results
out = out[out['P-value'] < p_thresh]
df_list.append(out)
# Concatenate all DataFrames in the list
df = pd.concat(df_list, ignore_index=True)
df
| Gene_set | Term | Overlap | P-value | Adjusted P-value | Old P-value | Old Adjusted P-value | Odds Ratio | Combined Score | Genes | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GO_Biological_Process_2021 | positive regulation of pri-miRNA transcription by RNA polymerase II (GO:1902895) | 3/34 | 8.944946e-08 | 0.000014 | 0 | 0 | 643.967742 | 10451.333760 | EGR1;FOS;KLF4 |
| 1 | GO_Biological_Process_2021 | regulation of pri-miRNA transcription by RNA polymerase II (GO:1902893) | 3/45 | 2.118537e-07 | 0.000017 | 0 | 0 | 475.047619 | 7300.232566 | EGR1;FOS;KLF4 |
| 2 | GO_Biological_Process_2021 | regulation of cell-cell adhesion involved in gastrulation (GO:0070587) | 2/10 | 3.371314e-06 | 0.000131 | 0 | 0 | 1249.125000 | 15739.234729 | MYADM;KLF4 |
| 3 | GO_Biological_Process_2021 | negative regulation of heterotypic cell-cell adhesion (GO:0034115) | 2/10 | 3.371314e-06 | 0.000131 | 0 | 0 | 1249.125000 | 15739.234729 | MYADM;KLF4 |
The top significant processes are positive regulation of pri-miRNA transcription by RNA polymerase II (p.val <0.05), regulation of cell-cell adhesion involved in gastrulation (p.val <0.05) and negative regulation of heterotypic cell-cell adhesion (p.val <0.5).