How to find, load and process snRNA-seq data¶
#import libraries
import wget
import pandas as pd
import numpy as np
import scanpy as sc
import anndata
Gene network analysis is a method designed to identify sub-networks (modules) of correlated genes, which are likely to be co-expressed. This can be helpful in identification of sub-networks (modules) of genes that contribute to disease. In this example, we will cover how to create a pairwise correlation matrix of genes, as well as how to associate them with disease.
First we will cover how to find, load and process the snRNA-seq data.
Find a Dataset¶
For this tutorial, we will be using an open access freely available dataset that has been generated from human peripheral blood mononuclear cells from patients with clonal hematopoiesis and controls. This dataset is available from the cellxgene portal, accessible here: https://cellxgene.cziscience.com/collections/0aab20b3-c30c-4606-bd2e-d20dae739c45 entitled "Multiomic Profiling of Human Clonal Hematopoiesis Reveals Genotype and Cell-Specific Inflammatory Pathway Activation". The associated paper is called "Multiomic profiling of human clonal hematopoiesis reveals genotype and cell-specific inflammatory pathway activation" and available at: https://ashpublications.org/bloodadvances/article/8/14/3665/515374/Multiomic-profiling-of-human-clonal-hematopoiesis ScRNA-seq was performed for patients with clonal haematopoiesis and controls. This dataset was chosen due to its compatability with the purpose of the pipeline. This data will be available in the data/test/ directory. The generated dataset is stored in h5ad format. By the end of this section, we will have loaded and explored the dataset.
Download a Dataset¶
Start by downloading the dataset from the original portal. Important to note, this step does not have to be complete. To save time, the filtered dataset has already been placed into the github repository within /dataset.
# URL of the dataset
url = "https://datasets.cellxgene.cziscience.com/6094cddd-de51-4891-8841-43e25120c336.h5ad"
# Destination path where the dataset will be saved
destination_path = "/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad"
# Download the dataset
wget.download(url, destination_path)
#Alternatively, the dataset can be found in the directory stated in the next cell.
# Load the dataset
#Please be aware that you will have to personally download the dataset to work with
pbmc = sc.read("/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad")
#inspect the loaded data
pbmc
AnnData object with n_obs × n_vars = 66985 × 36263
obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_umap'
As can be seen there are 67110 cells within the dataset. For the purposes of these exercises we will be filtering the dataset further to focus on one cell type and to reduce the dataset in size for ease.
pbmc.obs['cell_type']
0002_AAACCCACAAGTCCCG-1 CD4-positive, alpha-beta T cell
0002_AAACCCAGTAGTCGTT-1 CD16-positive, CD56-dim natural killer cell, h...
0002_AAACCCATCTACACAG-1 dendritic cell
0002_AAACGAAAGAATTTGG-1 CD4-positive, alpha-beta T cell
0002_AAACGCTAGCGACTGA-1 CD14-positive monocyte
...
079_TTTGGAGTCAGAGTGG-1 CD14-positive monocyte
079_TTTGGAGTCGACATAC-1 CD14-positive monocyte
079_TTTGGTTAGGTTATAG-1 CD16-positive, CD56-dim natural killer cell, h...
079_TTTGGTTCACACCAGC-1 natural killer cell
079_TTTGTTGGTTGTTGCA-1 CD4-positive, alpha-beta T cell
Name: cell_type, Length: 66985, dtype: category
Categories (9, object): ['platelet', 'B cell', 'dendritic cell', 'natural killer cell', ..., 'CD8-positive, alpha-beta T cell', 'erythroid lineage cell', 'CD16-positive, CD56-dim natural killer cell, ..., 'CD14-positive monocyte']
As can be seen there are many different cell types contained within this dataset. We shall focus on B cells for the purposes of our exercises.
# Filter the AnnData object for hepatocytes
Bcell = pbmc[pbmc.obs['cell_type'] == 'B cell']
Bcell
View of AnnData object with n_obs × n_vars = 3540 × 36263
obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_umap'
#Check if the gene names are in the correct format of gene symbols and not Ensembl IDs which are also common.
Bcell.var
| feature_is_filtered | feature_name | feature_reference | feature_biotype | feature_length | |
|---|---|---|---|---|---|
| ENSG00000243485 | False | MIR1302-2HG | NCBITaxon:9606 | gene | 1021 |
| ENSG00000237613 | False | FAM138A | NCBITaxon:9606 | gene | 1219 |
| ENSG00000186092 | False | OR4F5 | NCBITaxon:9606 | gene | 2618 |
| ENSG00000238009 | False | ENSG00000238009.6 | NCBITaxon:9606 | gene | 3726 |
| ENSG00000239945 | False | ENSG00000239945.1 | NCBITaxon:9606 | gene | 1319 |
| ... | ... | ... | ... | ... | ... |
| ENSG00000277836 | False | ENSG00000277836.1 | NCBITaxon:9606 | gene | 288 |
| ENSG00000278633 | False | ENSG00000278633.1 | NCBITaxon:9606 | gene | 2404 |
| ENSG00000276017 | False | ENSG00000276017.1 | NCBITaxon:9606 | gene | 2404 |
| ENSG00000278817 | False | ENSG00000278817.1 | NCBITaxon:9606 | gene | 1213 |
| ENSG00000277196 | False | ENSG00000277196.4 | NCBITaxon:9606 | gene | 2405 |
36263 rows × 5 columns
As can be seen from the gene features dataframe, they have currently used the Ensembl gene naming system. However, this isn't helpful for our analyses as they are not intuitively easy to interpret, instead you would need to research each Ensembl ID to identify that particular gene's name and function. From the second column feature_name, it appears that the original authors have converted the Ensembl IDs to gene symbol names.
Process the Dataset to Correct Format for Analysis¶
#Let's go ahead and map the values in the feature_name column to the rownames of the dataframe:
# Set the "feature_name" column as the index (row names)
Bcell.var.set_index("feature_name", drop = False, inplace=True)
It is important to note that not all Ensembl IDs map to Gene symbol names, as can be seen within the top of the dataframe. Therefore, since there is not a mapping for all Ensembl IDs, we shall remove these rows from the dataframe as they will be difficult to interpret in subsequent analyses.
# Filter rows where the index does not start with "ENSG" i.e. the Ensembl IDs.
# Define the condition for filtering genes
filter_genes = ~Bcell.var.index.str.startswith("ENSG") # Exclude genes starting with "ENSG"
filter_genes
# Filter genes based on the condition
Bcell = Bcell[:, filter_genes]
Bcell
View of AnnData object with n_obs × n_vars = 3540 × 25198
obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_umap'
Bcell.var
| feature_is_filtered | feature_name | feature_reference | feature_biotype | feature_length | |
|---|---|---|---|---|---|
| ENSG00000243485 | False | MIR1302-2HG | NCBITaxon:9606 | gene | 1021 |
| ENSG00000237613 | False | FAM138A | NCBITaxon:9606 | gene | 1219 |
| ENSG00000186092 | False | OR4F5 | NCBITaxon:9606 | gene | 2618 |
| ENSG00000284733 | False | OR4F29 | NCBITaxon:9606 | gene | 939 |
| ENSG00000284662 | False | OR4F16 | NCBITaxon:9606 | gene | 939 |
| ... | ... | ... | ... | ... | ... |
| ENSG00000223641 | False | TTTY17C | NCBITaxon:9606 | gene | 776 |
| ENSG00000228786 | False | SEPTIN14P23 | NCBITaxon:9606 | gene | 1192 |
| ENSG00000172288 | False | CDY1 | NCBITaxon:9606 | gene | 2670 |
| ENSG00000231141 | False | TTTY3 | NCBITaxon:9606 | gene | 344 |
| ENSG00000274847 | False | MAFIP | NCBITaxon:9606 | gene | 1599 |
25198 rows × 5 columns
As can be seen, the number of genes have now reduced from 36263 to 25198 as any rows with Ensembl IDs have been removed. However, let's change the variable slot to contain the gene symbol names as they are easier to work with.
# Update var_names with feature names from var DataFrame
Bcell.var_names = Bcell.var['feature_name']
Bcell.var
| feature_is_filtered | feature_name | feature_reference | feature_biotype | feature_length | |
|---|---|---|---|---|---|
| feature_name | |||||
| MIR1302-2HG | False | MIR1302-2HG | NCBITaxon:9606 | gene | 1021 |
| FAM138A | False | FAM138A | NCBITaxon:9606 | gene | 1219 |
| OR4F5 | False | OR4F5 | NCBITaxon:9606 | gene | 2618 |
| OR4F29 | False | OR4F29 | NCBITaxon:9606 | gene | 939 |
| OR4F16 | False | OR4F16 | NCBITaxon:9606 | gene | 939 |
| ... | ... | ... | ... | ... | ... |
| TTTY17C | False | TTTY17C | NCBITaxon:9606 | gene | 776 |
| SEPTIN14P23 | False | SEPTIN14P23 | NCBITaxon:9606 | gene | 1192 |
| CDY1 | False | CDY1 | NCBITaxon:9606 | gene | 2670 |
| TTTY3 | False | TTTY3 | NCBITaxon:9606 | gene | 344 |
| MAFIP | False | MAFIP | NCBITaxon:9606 | gene | 1599 |
25198 rows × 5 columns
Also need to calculate the highly variable genes.
Calculating highly variable genes on gene expression data that has not been log-transformed or normalised appropriately can lead to issues, including the presence of infinity values. Log transformation is a common preprocessing step for scRNA-seq data, especially when dealing with count data, to stabilise the variance and make the data more amenable to downstream analysis. It helps to mitigate the impact of high expression values and reduce the influence of technical noise.
# Log normalise the gene expression data
sc.pp.log1p(Bcell)
# Calculate highly variable genes
sc.pp.highly_variable_genes(Bcell, n_top_genes = 1000)
Bcell
AnnData object with n_obs × n_vars = 3540 × 25198
obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_umap'
Bcell.var
| feature_is_filtered | feature_name | feature_reference | feature_biotype | feature_length | highly_variable | means | dispersions | dispersions_norm | |
|---|---|---|---|---|---|---|---|---|---|
| feature_name | |||||||||
| MIR1302-2HG | False | MIR1302-2HG | NCBITaxon:9606 | gene | 1021 | False | 1.000000e-12 | NaN | NaN |
| FAM138A | False | FAM138A | NCBITaxon:9606 | gene | 1219 | False | 1.000000e-12 | NaN | NaN |
| OR4F5 | False | OR4F5 | NCBITaxon:9606 | gene | 2618 | False | 1.000000e-12 | NaN | NaN |
| OR4F29 | False | OR4F29 | NCBITaxon:9606 | gene | 939 | False | 1.000000e-12 | NaN | NaN |
| OR4F16 | False | OR4F16 | NCBITaxon:9606 | gene | 939 | False | 1.000000e-12 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTTY17C | False | TTTY17C | NCBITaxon:9606 | gene | 776 | False | 1.000000e-12 | NaN | NaN |
| SEPTIN14P23 | False | SEPTIN14P23 | NCBITaxon:9606 | gene | 1192 | False | 4.027802e-04 | 0.354964 | -1.686367 |
| CDY1 | False | CDY1 | NCBITaxon:9606 | gene | 2670 | False | 1.000000e-12 | NaN | NaN |
| TTTY3 | False | TTTY3 | NCBITaxon:9606 | gene | 344 | False | 1.000000e-12 | NaN | NaN |
| MAFIP | False | MAFIP | NCBITaxon:9606 | gene | 1599 | False | 1.053585e-02 | 0.594371 | 0.297863 |
25198 rows × 9 columns
#Lets save the filtered object
Bcell.write_h5ad('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered.h5ad')
Process Associated Metadata¶
We will now explore the associated metadata
Bcell.obs.columns
Index(['nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID',
'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample',
'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP',
'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT',
'nFeature_SCT', 'scType_celltype', 'pANN',
'development_stage_ontology_term_id', 'cell_type_ontology_term_id',
'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id',
'suspension_type', 'is_primary_data', 'tissue_type',
'tissue_ontology_term_id', 'organism_ontology_term_id',
'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism',
'sex', 'tissue', 'self_reported_ethnicity', 'development_stage',
'observation_joinid'],
dtype='object')
As can be seen, this dataset contains 3540 cells and 25198 genes. It also has relevant metadata in the obs section, such as MUTATION. The metadata may need to be encoded into the correct format for subsequent analyses, so let's have a look at the current format.
Bcell.obs
| nCount_RNA | nFeature_RNA | nCount_HTO | nFeature_HTO | HTO_maxID | HTO_secondID | HTO_margin | HTO_classification.global | sample | donor_id | ... | disease_ontology_term_id | cell_type | assay | disease | organism | sex | tissue | self_reported_ethnicity | development_stage | observation_joinid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | 97.0 | 2 | sample-2 | sample-5 | 3.146440 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | FrFs19`Dsw |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | 21.0 | 3 | sample-2 | sample-5 | 1.314667 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | MMer^rOrRY |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | 110.0 | 4 | sample-2 | sample-6 | 2.556420 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | ^dC2N0DTU| |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | 20.0 | 4 | sample-2 | sample-3 | 0.705259 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | F>Ad_32l$> |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | 237.0 | 4 | sample-2 | sample-5 | 3.121787 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | +@dOztSS*d |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | 43.0 | 3 | sample-5 | sample-6 | 2.155876 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | GMZ)5R6Eh* |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | 89.0 | 4 | sample-5 | sample-4 | 2.725727 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | lxd{TRji23 |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | 37.0 | 2 | sample-5 | sample-2 | 1.818129 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | KN2ItXPkR4 |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | 74.0 | 3 | sample-5 | sample-1 | 2.510466 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | +VU%_s11(N |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | 67.0 | 5 | sample-5 | sample-3 | 2.200155 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | fc(s`v}4U! |
3540 rows × 41 columns
Lets create a separate dataframe with the metadata information as this will be needed for the correlation analysis.
#Currently we want to create a copy of the metadata so as not to alter the original adata object.
metadata = Bcell.obs.copy()
metadata
| nCount_RNA | nFeature_RNA | nCount_HTO | nFeature_HTO | HTO_maxID | HTO_secondID | HTO_margin | HTO_classification.global | sample | donor_id | ... | disease_ontology_term_id | cell_type | assay | disease | organism | sex | tissue | self_reported_ethnicity | development_stage | observation_joinid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | 97.0 | 2 | sample-2 | sample-5 | 3.146440 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | FrFs19`Dsw |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | 21.0 | 3 | sample-2 | sample-5 | 1.314667 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | MMer^rOrRY |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | 110.0 | 4 | sample-2 | sample-6 | 2.556420 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | ^dC2N0DTU| |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | 20.0 | 4 | sample-2 | sample-3 | 0.705259 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | F>Ad_32l$> |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | 237.0 | 4 | sample-2 | sample-5 | 3.121787 | Singlet | sample-2 | CH-20-002 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 68-year-old human stage | +@dOztSS*d |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | 43.0 | 3 | sample-5 | sample-6 | 2.155876 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | GMZ)5R6Eh* |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | 89.0 | 4 | sample-5 | sample-4 | 2.725727 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | lxd{TRji23 |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | 37.0 | 2 | sample-5 | sample-2 | 1.818129 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | KN2ItXPkR4 |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | 74.0 | 3 | sample-5 | sample-1 | 2.510466 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | +VU%_s11(N |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | 67.0 | 5 | sample-5 | sample-3 | 2.200155 | Singlet | sample-5 | CH-21-079 | ... | MONDO:0100542 | B cell | 10x 3' v3 | clonal hematopoiesis | Homo sapiens | male | blood | European | 78-year-old human stage | fc(s`v}4U! |
3540 rows × 41 columns
There are many columns that are not needed.
#Let's remove these columns
columns_to_remove = ['nCount_HTO', 'nFeature_HTO', 'HTO_maxID',
'HTO_secondID', 'HTO_margin', 'HTO_classification.global',
'sample', 'sex_ontology_term_id', 'assay_ontology_term_id',
'suspension_type', 'is_primary_data', 'tissue_ontology_term_id',
'organism_ontology_term_id', 'disease_ontology_term_id', 'assay',
'organism', 'self_reported_ethnicity', 'observation_joinid',
'CHIP', 'LANE', 'ProjectID', 'HTOID',
'nCount_SCT', 'nFeature_SCT', 'pANN',
'development_stage_ontology_term_id', 'cell_type_ontology_term_id',
'self_reported_ethnicity_ontology_term_id']
metadata.drop(columns=columns_to_remove, inplace = True) #Set inplace=True to modify the DataFrame in place. If you set inplace=False or omit it, the drop() method will return a new DataFrame with the specified columns removed, leaving the original DataFrame unchanged.
metadata
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | MUTATION.GROUP | percent.mt | scType_celltype | tissue_type | cell_type | disease | sex | tissue | development_stage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 3.803975 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.969349 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 4.029404 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.138810 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 13.945409 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 4.876033 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.510031 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.495584 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 6.130157 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 3.212387 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage |
3540 rows × 13 columns
From investigating the metadata dataframe there are some columns that contain numerical data and some that contain character strings. The columns with character strings will need to be reformatted appropriately so that they can be correlated against. Lets first identify the unique labels within each column
metadata['sex'].unique()
['male', 'female'] Categories (2, object): ['female', 'male']
Looks like both male and female patients are included within this dataset. This will need to be numerically encoded so that it can be correlated against in downstream analysis.
metadata['male'] = metadata['sex'].apply(lambda x: 1 if x == "male" else 0)
metadata['female'] = metadata['sex'].apply(lambda x: 1 if x == "female" else 0)
metadata
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | MUTATION.GROUP | percent.mt | scType_celltype | tissue_type | cell_type | disease | sex | tissue | development_stage | male | female | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 3.803975 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.969349 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 4.029404 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.138810 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 13.945409 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 4.876033 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.510031 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.495584 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 6.130157 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 3.212387 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 |
3540 rows × 15 columns
Now let's have a look at the disease variable
metadata['disease'].unique()
['clonal hematopoiesis', 'normal'] Categories (2, object): ['normal', 'clonal hematopoiesis']
#The disease column can be encoded into a binary variable.
metadata['CH'] = metadata['disease'].apply(lambda x: 1 if x == "clonal hematopoiesis" else 0)
metadata['normal'] = metadata['disease'].apply(lambda x: 1 if x == "normal" else 0)
metadata
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | MUTATION.GROUP | percent.mt | scType_celltype | tissue_type | cell_type | disease | sex | tissue | development_stage | male | female | CH | normal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 3.803975 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 | 1 | 0 |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.969349 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 | 1 | 0 |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 4.029404 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 | 1 | 0 |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.138810 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 | 1 | 0 |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 13.945409 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68-year-old human stage | 1 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 4.876033 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 | 1 | 0 |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.510031 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 | 1 | 0 |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.495584 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 | 1 | 0 |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 6.130157 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 | 1 | 0 |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 3.212387 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78-year-old human stage | 1 | 0 | 1 | 0 |
3540 rows × 17 columns
Now lets sort out the development_stage column
print(metadata['development_stage'].cat.categories)
Index(['39-year-old human stage', '48-year-old human stage',
'50-year-old human stage', '58-year-old human stage',
'60-year-old human stage', '61-year-old human stage',
'65-year-old human stage', '67-year-old human stage',
'68-year-old human stage', '70-year-old human stage',
'71-year-old human stage', '73-year-old human stage',
'74-year-old human stage', '77-year-old human stage',
'78-year-old human stage', '80-year-old human stage',
'81-year-old human stage', '83-year-old human stage',
'85-year-old human stage', '89-year-old human stage',
'91-year-old human stage'],
dtype='object')
#There appear to be 8 categories. Lets numerically encode them
# Recode development_stage
development_stage_mapping = {
'39-year-old human stage': 39,
'48-year-old human stage': 48,
'50-year-old human stage': 50,
'58-year-old human stage': 58,
'60-year-old human stage': 60,
'61-year-old human stage': 61,
'65-year-old human stage': 65,
'67-year-old human stage': 67,
'68-year-old human stage': 68,
'70-year-old human stage': 70,
'71-year-old human stage': 71,
'73-year-old human stage': 73,
'74-year-old human stage': 74,
'77-year-old human stage': 77,
'78-year-old human stage': 78,
'80-year-old human stage': 80,
'81-year-old human stage': 81,
'83-year-old human stage': 83,
'85-year-old human stage': 85,
'89-year-old human stage': 89,
'91-year-old human stage': 91
}
metadata['development_stage'] = metadata['development_stage'].map(development_stage_mapping)
metadata
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | MUTATION.GROUP | percent.mt | scType_celltype | tissue_type | cell_type | disease | sex | tissue | development_stage | male | female | CH | normal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 3.803975 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.969349 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 4.029404 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.138810 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 13.945409 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 4.876033 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.510031 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.495584 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 6.130157 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 3.212387 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 |
3540 rows × 17 columns
metadata['MUTATION.GROUP'].unique()
['DNMT3A', 'none', 'TET2'] Categories (3, object): ['DNMT3A', 'TET2', 'none']
metadata['DNMT3A'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "DNMT3A" else 0)
metadata['TET2'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "TET2" else 0)
metadata['NoMutation'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "none" else 0)
metadata
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | MUTATION.GROUP | percent.mt | scType_celltype | tissue_type | cell_type | disease | sex | tissue | development_stage | male | female | CH | normal | DNMT3A | TET2 | NoMutation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 3.803975 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.969349 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 4.029404 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 7.138810 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | DNMT3A | 13.945409 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 4.876033 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.510031 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 5.495584 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 6.130157 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | CH-21-079 | DNMT3A M880V (5%) | DNMT3A | 3.212387 | Naive B cells | tissue | B cell | clonal hematopoiesis | male | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
3540 rows × 20 columns
# Drop unnecessary columns
metadata = metadata.drop(['disease', 'MUTATION.GROUP', 'sex'], axis=1)
metadata
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | percent.mt | scType_celltype | tissue_type | cell_type | tissue | development_stage | male | female | CH | normal | DNMT3A | TET2 | NoMutation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 3.803975 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AACAACCAGGGTTAGC-1 | 849.0 | 548 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 7.969349 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AACCCAAAGGGCCTCT-1 | 2492.0 | 1188 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 4.029404 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AACGAAACACAAAGTA-1 | 1060.0 | 608 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 7.138810 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0002_AAGCGTTTCTTGGGCG-1 | 1270.0 | 716 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 13.945409 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 1925.0 | 1054 | CH-21-079 | DNMT3A M880V (5%) | 4.876033 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TGAATCGAGATTCGAA-1 | 2026.0 | 1097 | CH-21-079 | DNMT3A M880V (5%) | 5.510031 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TGCGATAAGGTAGATT-1 | 1594.0 | 933 | CH-21-079 | DNMT3A M880V (5%) | 5.495584 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TGCTCGTAGGGTTGCA-1 | 1840.0 | 1101 | CH-21-079 | DNMT3A M880V (5%) | 6.130157 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 079_TTCCTCTAGAGCTTTC-1 | 2643.0 | 1197 | CH-21-079 | DNMT3A M880V (5%) | 3.212387 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
3540 rows × 17 columns
#Save the metadata dataframe
metadata.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata.csv', index = True)
metadata = pd.read_csv('data/Bcell_metadata.csv', index_col = 0)
Due to the nature of single-cell data, we naturally have many cells from the same donor. However, we cannot simply correlate the gene expression data in its current form. this would lead to within and outwith donor correlations. Therefore, since we are working with single-cell data, this must first be pseudobulked in order to continue with the analysis. This is important as it not only speeds up the computation, but most importantly negates the effects of within sample correlation. Also, pseudobulking can help to mitigate the issues commonly found in single-cell data, such as drop outs and high zero value counts.
Pseudobulk the Metadata¶
First we shall sort out the metadata dataframe so that it only contains one row per donor since the data will be aggregated.
# Convert row names to a column named 'cell_id'
metadata['cell_id'] = metadata.index
# Group by 'donor_id' and select the first row of each group
rows = metadata.groupby('donor_id').first().reset_index()
rows
| donor_id | nCount_RNA | nFeature_RNA | MUTATION | percent.mt | scType_celltype | tissue_type | cell_type | tissue | development_stage | male | female | CH | normal | DNMT3A | TET2 | NoMutation | cell_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CH-20-001 | 2490.0 | 1403 | DNMT3A R882C | 6.119578 | Naive B cells | tissue | B cell | blood | 60 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 001_AAAGAACGTTCTCAGA-1 |
| 1 | CH-20-002 | 1192.0 | 629 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 3.803975 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0002_AAAGGGCAGCAGCACA-1 |
| 2 | CH-20-004 | 1833.0 | 985 | TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95... | 5.335196 | Naive B cells | tissue | B cell | blood | 85 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 004_AACCTGATCTTTGATC-1 |
| 3 | CH-20-005 | 1966.0 | 886 | TET2 V1900F (2%) | 5.314136 | Naive B cells | tissue | B cell | blood | 58 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 005_AACAACCAGAGCTGAC-1 |
| 4 | CH-21-002 | 1912.0 | 938 | none | 5.657238 | Naive B cells | tissue | B cell | blood | 48 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 002_AAAGGTACACATTGTG-1 |
| 5 | CH-21-006 | 1356.0 | 709 | DNMT3A R882H (13%) | 5.211849 | Naive B cells | tissue | B cell | blood | 67 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 006_AACGAAACAGAGTTCT-1 |
| 6 | CH-21-008 | 1117.0 | 575 | none | 8.398348 | Naive B cells | tissue | B cell | blood | 70 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 008_AACAGGGTCTTCTCAA-1 |
| 7 | CH-21-013 | 1321.0 | 816 | none | 4.663212 | Naive B cells | tissue | B cell | blood | 73 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 013_AACCAACAGGTAGCCA-1 |
| 8 | CH-21-014 | 1064.0 | 623 | SRSF2 P95R (40%), TET2 L957Ifs*15 (51%) | 4.146577 | Naive B cells | tissue | B cell | blood | 74 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 014_AAAGTCCGTTTGACAC-1 |
| 9 | CH-21-017 | 1880.0 | 953 | DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27... | 6.519922 | Naive B cells | tissue | B cell | blood | 65 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 017_AAACGAAAGGCGAACT-1 |
| 10 | CH-21-020 | 5325.0 | 2286 | none | 5.631046 | Naive B cells | tissue | B cell | blood | 61 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 020_AAACGAATCGATTTCT-1 |
| 11 | CH-21-021 | 1671.0 | 943 | none | 3.214286 | Naive B cells | tissue | B cell | blood | 83 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 021_AAAGGTAGTTGTTGAC-1 |
| 12 | CH-21-028 | 1690.0 | 866 | none | 6.053894 | Naive B cells | tissue | B cell | blood | 89 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 028_AAAGTGACATAGACTC-1 |
| 13 | CH-21-029 | 2180.0 | 1073 | TET2 G68X (2%) | 2.570194 | Naive B cells | tissue | B cell | blood | 83 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 029_AAAGGTAAGCCGTTAT-1 |
| 14 | CH-21-031 | 1592.0 | 887 | none | 6.734398 | Naive B cells | tissue | B cell | blood | 78 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 031_AAACGCTAGTTTGTCG-1 |
| 15 | CH-21-033 | 2219.0 | 1138 | TET2 (33%) | 5.670567 | Naive B cells | tissue | B cell | blood | 81 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 033_AAACGCTGTAAGCGGT-1 |
| 16 | CH-21-034 | 2010.0 | 974 | DNMT3A Q816X (8%) | 7.937365 | Naive B cells | tissue | B cell | blood | 39 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 034_AAACCCAAGCGTCTCG-1 |
| 17 | CH-21-036 | 2686.0 | 1337 | DNMT3A splice (7%) | 3.909544 | Naive B cells | tissue | B cell | blood | 91 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 036_AAAGGGCTCCCTCTAG-1 |
| 18 | CH-21-037 | 3546.0 | 1645 | TET2 (6.2%) | 4.473764 | Naive B cells | tissue | B cell | blood | 71 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 037_AAAGGTAAGCGCCATC-1 |
| 19 | CH-21-046 | 1918.0 | 907 | DNMT3A W305X (24%) | 4.807084 | Naive B cells | tissue | B cell | blood | 80 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 046_AACCATGCAGATCATC-1 |
| 20 | CH-21-073 | 2148.0 | 1096 | SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742... | 5.174489 | Naive B cells | tissue | B cell | blood | 77 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 073_AAACGCTGTAACCCGC-1 |
| 21 | CH-21-074 | 1322.0 | 708 | TET2 C1378Y (23%) | 3.328561 | Naive B cells | tissue | B cell | blood | 70 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 074_AATGGCTGTCCAGAAG-1 |
| 22 | CH-21-077 | 1715.0 | 934 | DNMT3A R749C (9.1%) | 6.539510 | Naive B cells | tissue | B cell | blood | 50 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 077_AACAAGAGTAAGTTAG-1 |
| 23 | CH-21-079 | 1354.0 | 793 | DNMT3A M880V (5%) | 6.386293 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 079_AAAGGATCAAGCCCAC-1 |
# Extract row indices corresponding to the first cell from each donor
row_list = []
for i, row in rows.iterrows():
row_idx = metadata.index.get_loc(row['cell_id'])
row_list.append(row_idx)
row_list
[100, 0, 187, 276, 145, 421, 473, 663, 842, 915, 1070, 1589, 1651, 1700, 1849, 1992, 2409, 2954, 3028, 3228, 3304, 3337, 3353, 3506]
# Select the columns from the DataFrame
metadata2 = metadata.iloc[row_list, :].copy()
metadata2
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | percent.mt | scType_celltype | tissue_type | cell_type | tissue | development_stage | male | female | CH | normal | DNMT3A | TET2 | NoMutation | cell_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 001_AAAGAACGTTCTCAGA-1 | 2490.0 | 1403 | CH-20-001 | DNMT3A R882C | 6.119578 | Naive B cells | tissue | B cell | blood | 60 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 001_AAAGAACGTTCTCAGA-1 |
| 0002_AAAGGGCAGCAGCACA-1 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 3.803975 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0002_AAAGGGCAGCAGCACA-1 |
| 004_AACCTGATCTTTGATC-1 | 1833.0 | 985 | CH-20-004 | TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95... | 5.335196 | Naive B cells | tissue | B cell | blood | 85 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 004_AACCTGATCTTTGATC-1 |
| 005_AACAACCAGAGCTGAC-1 | 1966.0 | 886 | CH-20-005 | TET2 V1900F (2%) | 5.314136 | Naive B cells | tissue | B cell | blood | 58 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 005_AACAACCAGAGCTGAC-1 |
| 002_AAAGGTACACATTGTG-1 | 1912.0 | 938 | CH-21-002 | none | 5.657238 | Naive B cells | tissue | B cell | blood | 48 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 002_AAAGGTACACATTGTG-1 |
| 006_AACGAAACAGAGTTCT-1 | 1356.0 | 709 | CH-21-006 | DNMT3A R882H (13%) | 5.211849 | Naive B cells | tissue | B cell | blood | 67 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 006_AACGAAACAGAGTTCT-1 |
| 008_AACAGGGTCTTCTCAA-1 | 1117.0 | 575 | CH-21-008 | none | 8.398348 | Naive B cells | tissue | B cell | blood | 70 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 008_AACAGGGTCTTCTCAA-1 |
| 013_AACCAACAGGTAGCCA-1 | 1321.0 | 816 | CH-21-013 | none | 4.663212 | Naive B cells | tissue | B cell | blood | 73 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 013_AACCAACAGGTAGCCA-1 |
| 014_AAAGTCCGTTTGACAC-1 | 1064.0 | 623 | CH-21-014 | SRSF2 P95R (40%), TET2 L957Ifs*15 (51%) | 4.146577 | Naive B cells | tissue | B cell | blood | 74 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 014_AAAGTCCGTTTGACAC-1 |
| 017_AAACGAAAGGCGAACT-1 | 1880.0 | 953 | CH-21-017 | DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27... | 6.519922 | Naive B cells | tissue | B cell | blood | 65 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 017_AAACGAAAGGCGAACT-1 |
| 020_AAACGAATCGATTTCT-1 | 5325.0 | 2286 | CH-21-020 | none | 5.631046 | Naive B cells | tissue | B cell | blood | 61 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 020_AAACGAATCGATTTCT-1 |
| 021_AAAGGTAGTTGTTGAC-1 | 1671.0 | 943 | CH-21-021 | none | 3.214286 | Naive B cells | tissue | B cell | blood | 83 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 021_AAAGGTAGTTGTTGAC-1 |
| 028_AAAGTGACATAGACTC-1 | 1690.0 | 866 | CH-21-028 | none | 6.053894 | Naive B cells | tissue | B cell | blood | 89 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 028_AAAGTGACATAGACTC-1 |
| 029_AAAGGTAAGCCGTTAT-1 | 2180.0 | 1073 | CH-21-029 | TET2 G68X (2%) | 2.570194 | Naive B cells | tissue | B cell | blood | 83 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 029_AAAGGTAAGCCGTTAT-1 |
| 031_AAACGCTAGTTTGTCG-1 | 1592.0 | 887 | CH-21-031 | none | 6.734398 | Naive B cells | tissue | B cell | blood | 78 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 031_AAACGCTAGTTTGTCG-1 |
| 033_AAACGCTGTAAGCGGT-1 | 2219.0 | 1138 | CH-21-033 | TET2 (33%) | 5.670567 | Naive B cells | tissue | B cell | blood | 81 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 033_AAACGCTGTAAGCGGT-1 |
| 034_AAACCCAAGCGTCTCG-1 | 2010.0 | 974 | CH-21-034 | DNMT3A Q816X (8%) | 7.937365 | Naive B cells | tissue | B cell | blood | 39 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 034_AAACCCAAGCGTCTCG-1 |
| 036_AAAGGGCTCCCTCTAG-1 | 2686.0 | 1337 | CH-21-036 | DNMT3A splice (7%) | 3.909544 | Naive B cells | tissue | B cell | blood | 91 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 036_AAAGGGCTCCCTCTAG-1 |
| 037_AAAGGTAAGCGCCATC-1 | 3546.0 | 1645 | CH-21-037 | TET2 (6.2%) | 4.473764 | Naive B cells | tissue | B cell | blood | 71 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 037_AAAGGTAAGCGCCATC-1 |
| 046_AACCATGCAGATCATC-1 | 1918.0 | 907 | CH-21-046 | DNMT3A W305X (24%) | 4.807084 | Naive B cells | tissue | B cell | blood | 80 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 046_AACCATGCAGATCATC-1 |
| 073_AAACGCTGTAACCCGC-1 | 2148.0 | 1096 | CH-21-073 | SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742... | 5.174489 | Naive B cells | tissue | B cell | blood | 77 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 073_AAACGCTGTAACCCGC-1 |
| 074_AATGGCTGTCCAGAAG-1 | 1322.0 | 708 | CH-21-074 | TET2 C1378Y (23%) | 3.328561 | Naive B cells | tissue | B cell | blood | 70 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 074_AATGGCTGTCCAGAAG-1 |
| 077_AACAAGAGTAAGTTAG-1 | 1715.0 | 934 | CH-21-077 | DNMT3A R749C (9.1%) | 6.539510 | Naive B cells | tissue | B cell | blood | 50 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 077_AACAAGAGTAAGTTAG-1 |
| 079_AAAGGATCAAGCCCAC-1 | 1354.0 | 793 | CH-21-079 | DNMT3A M880V (5%) | 6.386293 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 079_AAAGGATCAAGCCCAC-1 |
metadata2.set_index('donor_id', inplace = True, drop = False)
metadata2
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | percent.mt | scType_celltype | tissue_type | cell_type | tissue | development_stage | male | female | CH | normal | DNMT3A | TET2 | NoMutation | cell_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| donor_id | ||||||||||||||||||
| CH-20-001 | 2490.0 | 1403 | CH-20-001 | DNMT3A R882C | 6.119578 | Naive B cells | tissue | B cell | blood | 60 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 001_AAAGAACGTTCTCAGA-1 |
| CH-20-002 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 3.803975 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0002_AAAGGGCAGCAGCACA-1 |
| CH-20-004 | 1833.0 | 985 | CH-20-004 | TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95... | 5.335196 | Naive B cells | tissue | B cell | blood | 85 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 004_AACCTGATCTTTGATC-1 |
| CH-20-005 | 1966.0 | 886 | CH-20-005 | TET2 V1900F (2%) | 5.314136 | Naive B cells | tissue | B cell | blood | 58 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 005_AACAACCAGAGCTGAC-1 |
| CH-21-002 | 1912.0 | 938 | CH-21-002 | none | 5.657238 | Naive B cells | tissue | B cell | blood | 48 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 002_AAAGGTACACATTGTG-1 |
| CH-21-006 | 1356.0 | 709 | CH-21-006 | DNMT3A R882H (13%) | 5.211849 | Naive B cells | tissue | B cell | blood | 67 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 006_AACGAAACAGAGTTCT-1 |
| CH-21-008 | 1117.0 | 575 | CH-21-008 | none | 8.398348 | Naive B cells | tissue | B cell | blood | 70 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 008_AACAGGGTCTTCTCAA-1 |
| CH-21-013 | 1321.0 | 816 | CH-21-013 | none | 4.663212 | Naive B cells | tissue | B cell | blood | 73 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 013_AACCAACAGGTAGCCA-1 |
| CH-21-014 | 1064.0 | 623 | CH-21-014 | SRSF2 P95R (40%), TET2 L957Ifs*15 (51%) | 4.146577 | Naive B cells | tissue | B cell | blood | 74 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 014_AAAGTCCGTTTGACAC-1 |
| CH-21-017 | 1880.0 | 953 | CH-21-017 | DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27... | 6.519922 | Naive B cells | tissue | B cell | blood | 65 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 017_AAACGAAAGGCGAACT-1 |
| CH-21-020 | 5325.0 | 2286 | CH-21-020 | none | 5.631046 | Naive B cells | tissue | B cell | blood | 61 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 020_AAACGAATCGATTTCT-1 |
| CH-21-021 | 1671.0 | 943 | CH-21-021 | none | 3.214286 | Naive B cells | tissue | B cell | blood | 83 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 021_AAAGGTAGTTGTTGAC-1 |
| CH-21-028 | 1690.0 | 866 | CH-21-028 | none | 6.053894 | Naive B cells | tissue | B cell | blood | 89 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 028_AAAGTGACATAGACTC-1 |
| CH-21-029 | 2180.0 | 1073 | CH-21-029 | TET2 G68X (2%) | 2.570194 | Naive B cells | tissue | B cell | blood | 83 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 029_AAAGGTAAGCCGTTAT-1 |
| CH-21-031 | 1592.0 | 887 | CH-21-031 | none | 6.734398 | Naive B cells | tissue | B cell | blood | 78 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 031_AAACGCTAGTTTGTCG-1 |
| CH-21-033 | 2219.0 | 1138 | CH-21-033 | TET2 (33%) | 5.670567 | Naive B cells | tissue | B cell | blood | 81 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 033_AAACGCTGTAAGCGGT-1 |
| CH-21-034 | 2010.0 | 974 | CH-21-034 | DNMT3A Q816X (8%) | 7.937365 | Naive B cells | tissue | B cell | blood | 39 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 034_AAACCCAAGCGTCTCG-1 |
| CH-21-036 | 2686.0 | 1337 | CH-21-036 | DNMT3A splice (7%) | 3.909544 | Naive B cells | tissue | B cell | blood | 91 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 036_AAAGGGCTCCCTCTAG-1 |
| CH-21-037 | 3546.0 | 1645 | CH-21-037 | TET2 (6.2%) | 4.473764 | Naive B cells | tissue | B cell | blood | 71 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 037_AAAGGTAAGCGCCATC-1 |
| CH-21-046 | 1918.0 | 907 | CH-21-046 | DNMT3A W305X (24%) | 4.807084 | Naive B cells | tissue | B cell | blood | 80 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 046_AACCATGCAGATCATC-1 |
| CH-21-073 | 2148.0 | 1096 | CH-21-073 | SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742... | 5.174489 | Naive B cells | tissue | B cell | blood | 77 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 073_AAACGCTGTAACCCGC-1 |
| CH-21-074 | 1322.0 | 708 | CH-21-074 | TET2 C1378Y (23%) | 3.328561 | Naive B cells | tissue | B cell | blood | 70 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 074_AATGGCTGTCCAGAAG-1 |
| CH-21-077 | 1715.0 | 934 | CH-21-077 | DNMT3A R749C (9.1%) | 6.539510 | Naive B cells | tissue | B cell | blood | 50 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 077_AACAAGAGTAAGTTAG-1 |
| CH-21-079 | 1354.0 | 793 | CH-21-079 | DNMT3A M880V (5%) | 6.386293 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 079_AAAGGATCAAGCCCAC-1 |
#Remove the cell_id column
metadata2.drop(columns = 'cell_id', inplace = True)
metadata2
| nCount_RNA | nFeature_RNA | donor_id | MUTATION | percent.mt | scType_celltype | tissue_type | cell_type | tissue | development_stage | male | female | CH | normal | DNMT3A | TET2 | NoMutation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| donor_id | |||||||||||||||||
| CH-20-001 | 2490.0 | 1403 | CH-20-001 | DNMT3A R882C | 6.119578 | Naive B cells | tissue | B cell | blood | 60 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| CH-20-002 | 1192.0 | 629 | CH-20-002 | DNMT3A R729W (4%), DNMT3A R736C (2%) | 3.803975 | Naive B cells | tissue | B cell | blood | 68 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| CH-20-004 | 1833.0 | 985 | CH-20-004 | TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95... | 5.335196 | Naive B cells | tissue | B cell | blood | 85 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| CH-20-005 | 1966.0 | 886 | CH-20-005 | TET2 V1900F (2%) | 5.314136 | Naive B cells | tissue | B cell | blood | 58 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| CH-21-002 | 1912.0 | 938 | CH-21-002 | none | 5.657238 | Naive B cells | tissue | B cell | blood | 48 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| CH-21-006 | 1356.0 | 709 | CH-21-006 | DNMT3A R882H (13%) | 5.211849 | Naive B cells | tissue | B cell | blood | 67 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
| CH-21-008 | 1117.0 | 575 | CH-21-008 | none | 8.398348 | Naive B cells | tissue | B cell | blood | 70 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| CH-21-013 | 1321.0 | 816 | CH-21-013 | none | 4.663212 | Naive B cells | tissue | B cell | blood | 73 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| CH-21-014 | 1064.0 | 623 | CH-21-014 | SRSF2 P95R (40%), TET2 L957Ifs*15 (51%) | 4.146577 | Naive B cells | tissue | B cell | blood | 74 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| CH-21-017 | 1880.0 | 953 | CH-21-017 | DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27... | 6.519922 | Naive B cells | tissue | B cell | blood | 65 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| CH-21-020 | 5325.0 | 2286 | CH-21-020 | none | 5.631046 | Naive B cells | tissue | B cell | blood | 61 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| CH-21-021 | 1671.0 | 943 | CH-21-021 | none | 3.214286 | Naive B cells | tissue | B cell | blood | 83 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| CH-21-028 | 1690.0 | 866 | CH-21-028 | none | 6.053894 | Naive B cells | tissue | B cell | blood | 89 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| CH-21-029 | 2180.0 | 1073 | CH-21-029 | TET2 G68X (2%) | 2.570194 | Naive B cells | tissue | B cell | blood | 83 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| CH-21-031 | 1592.0 | 887 | CH-21-031 | none | 6.734398 | Naive B cells | tissue | B cell | blood | 78 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| CH-21-033 | 2219.0 | 1138 | CH-21-033 | TET2 (33%) | 5.670567 | Naive B cells | tissue | B cell | blood | 81 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| CH-21-034 | 2010.0 | 974 | CH-21-034 | DNMT3A Q816X (8%) | 7.937365 | Naive B cells | tissue | B cell | blood | 39 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
| CH-21-036 | 2686.0 | 1337 | CH-21-036 | DNMT3A splice (7%) | 3.909544 | Naive B cells | tissue | B cell | blood | 91 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| CH-21-037 | 3546.0 | 1645 | CH-21-037 | TET2 (6.2%) | 4.473764 | Naive B cells | tissue | B cell | blood | 71 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| CH-21-046 | 1918.0 | 907 | CH-21-046 | DNMT3A W305X (24%) | 4.807084 | Naive B cells | tissue | B cell | blood | 80 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| CH-21-073 | 2148.0 | 1096 | CH-21-073 | SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742... | 5.174489 | Naive B cells | tissue | B cell | blood | 77 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| CH-21-074 | 1322.0 | 708 | CH-21-074 | TET2 C1378Y (23%) | 3.328561 | Naive B cells | tissue | B cell | blood | 70 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| CH-21-077 | 1715.0 | 934 | CH-21-077 | DNMT3A R749C (9.1%) | 6.539510 | Naive B cells | tissue | B cell | blood | 50 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
| CH-21-079 | 1354.0 | 793 | CH-21-079 | DNMT3A M880V (5%) | 6.386293 | Naive B cells | tissue | B cell | blood | 78 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
#Save the metadata
metadata2.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata_pseudobulk.csv', index = True)
The metadata dataframe for the pseudobulk is now complete
Pseudobulk the Corresponding Data¶
Lets proceed to aggregate the gene expression data. This involves summing the gene expression data for each gene of each donor.
First the gene expression matrix will need to be extracted from our adata object
Since we are working with single-cell data which will be stored as a sparse matrix, this must be coerced into a dense matrix, so that it can be converted to a dataframe.
# Convert the sparse matrix to a dense matrix
dense_matrix = Bcell.X.todense()
datExpr = pd.DataFrame(dense_matrix, index=Bcell.obs_names, columns=Bcell.var_names)
datExpr
| feature_name | MIR1302-2HG | FAM138A | OR4F5 | OR4F29 | OR4F16 | LINC01409 | FAM87B | LINC01128 | LINC00115 | FAM41C | ... | BPY2B | DAZ3 | DAZ4 | BPY2C | TTTY4C | TTTY17C | SEPTIN14P23 | CDY1 | TTTY3 | MAFIP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 0002_AACAACCAGGGTTAGC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 0002_AACCCAAAGGGCCTCT-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 0002_AACGAAACACAAAGTA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 0002_AAGCGTTTCTTGGGCG-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 079_TGAATCGAGATTCGAA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 079_TGCGATAAGGTAGATT-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 079_TGCTCGTAGGGTTGCA-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 079_TTCCTCTAGAGCTTTC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3540 rows × 25198 columns
#save datExpr
#Save the metadata dataframe
datExpr.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_singlecell.csv', index = True)
Since highly variable genes capture the most informative genes, they will be used to filter the expression matrix further. This is also a way to reduce the dimensionality of the data, so that downstream analyses may be more computationally efficient.
hvg = Bcell.var_names[Bcell.var['highly_variable']]
hvg
CategoricalIndex(['ISG15', 'LINC01342', 'TTLL10-AS1', 'TNFRSF18', 'CALML6',
'CHD5', 'ICMT-DT', 'MIR34AHG', 'RBP7', 'MTOR-AS1',
...
'FRMPD3', 'TSC22D3', 'KLHL13', 'AKAP14', 'RHOXF1-AS1',
'TMEM255A', 'SMIM10L2B-AS1', 'IL9R_ENSG00000124334', 'DDX3Y',
'EIF1AY'],
categories=['A1BG', 'A1BG-AS1', 'A1CF', 'A2M', 'A2M-AS1', 'A2ML1', 'A2ML1-AS1', 'A2ML1-AS2', ...], ordered=False, dtype='category', name='feature_name', length=1000)
datExpr = datExpr.loc[:,hvg]
datExpr
| feature_name | ISG15 | LINC01342 | TTLL10-AS1 | TNFRSF18 | CALML6 | CHD5 | ICMT-DT | MIR34AHG | RBP7 | MTOR-AS1 | ... | FRMPD3 | TSC22D3 | KLHL13 | AKAP14 | RHOXF1-AS1 | TMEM255A | SMIM10L2B-AS1 | IL9R_ENSG00000124334 | DDX3Y | EIF1AY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0002_AAAGGGCAGCAGCACA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.513502 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 0002_AACAACCAGGGTTAGC-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.583828 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 0002_AACCCAAAGGGCCTCT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.344490 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.271980 | 0.960117 |
| 0002_AACGAAACACAAAGTA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.207486 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 0002_AAGCGTTTCTTGGGCG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.435951 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.157864 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.582282 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.038052 | 0.000000 |
| 079_TGAATCGAGATTCGAA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.445902 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.022813 | 0.000000 |
| 079_TGCGATAAGGTAGATT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.282646 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.282646 | 0.000000 |
| 079_TGCTCGTAGGGTTGCA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.245300 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 079_TTCCTCTAGAGCTTTC-1 | 0.942032 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.516629 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.942032 | 0.942032 |
3540 rows × 1000 columns
Add the donor_id column to the gene expression dataframe, so we know which cell came from which donor
# Reset the index of 'datExpr' DataFrame to make the row names (cell names) a column
datExpr_donor = datExpr.reset_index()
datExpr_donor
| feature_name | index | ISG15 | LINC01342 | TTLL10-AS1 | TNFRSF18 | CALML6 | CHD5 | ICMT-DT | MIR34AHG | RBP7 | ... | FRMPD3 | TSC22D3 | KLHL13 | AKAP14 | RHOXF1-AS1 | TMEM255A | SMIM10L2B-AS1 | IL9R_ENSG00000124334 | DDX3Y | EIF1AY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0002_AAAGGGCAGCAGCACA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.513502 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 1 | 0002_AACAACCAGGGTTAGC-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.583828 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 2 | 0002_AACCCAAAGGGCCTCT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.344490 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.271980 | 0.960117 |
| 3 | 0002_AACGAAACACAAAGTA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.207486 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 4 | 0002_AAGCGTTTCTTGGGCG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.435951 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.157864 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3535 | 079_TCTCCGAAGCTATCTG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.582282 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.038052 | 0.000000 |
| 3536 | 079_TGAATCGAGATTCGAA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.445902 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.022813 | 0.000000 |
| 3537 | 079_TGCGATAAGGTAGATT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.282646 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.282646 | 0.000000 |
| 3538 | 079_TGCTCGTAGGGTTGCA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.245300 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 3539 | 079_TTCCTCTAGAGCTTTC-1 | 0.942032 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.516629 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.942032 | 0.942032 |
3540 rows × 1001 columns
# Merge 'datExpr_reset' with 'metadata' on the 'index' and 'cell_id' columns
datExpr_donor = pd.merge(datExpr_donor, metadata[['cell_id', 'donor_id']], left_on='index', right_on='cell_id', how='left')
datExpr_donor
| index | ISG15 | LINC01342 | TTLL10-AS1 | TNFRSF18 | CALML6 | CHD5 | ICMT-DT | MIR34AHG | RBP7 | ... | KLHL13 | AKAP14 | RHOXF1-AS1 | TMEM255A | SMIM10L2B-AS1 | IL9R_ENSG00000124334 | DDX3Y | EIF1AY | cell_id | donor_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0002_AAAGGGCAGCAGCACA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0002_AAAGGGCAGCAGCACA-1 | CH-20-002 |
| 1 | 0002_AACAACCAGGGTTAGC-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0002_AACAACCAGGGTTAGC-1 | CH-20-002 |
| 2 | 0002_AACCCAAAGGGCCTCT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.271980 | 0.960117 | 0002_AACCCAAAGGGCCTCT-1 | CH-20-002 |
| 3 | 0002_AACGAAACACAAAGTA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0002_AACGAAACACAAAGTA-1 | CH-20-002 |
| 4 | 0002_AAGCGTTTCTTGGGCG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.157864 | 0.000000 | 0002_AAGCGTTTCTTGGGCG-1 | CH-20-002 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3535 | 079_TCTCCGAAGCTATCTG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.038052 | 0.000000 | 079_TCTCCGAAGCTATCTG-1 | CH-21-079 |
| 3536 | 079_TGAATCGAGATTCGAA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.022813 | 0.000000 | 079_TGAATCGAGATTCGAA-1 | CH-21-079 |
| 3537 | 079_TGCGATAAGGTAGATT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.282646 | 0.000000 | 079_TGCGATAAGGTAGATT-1 | CH-21-079 |
| 3538 | 079_TGCTCGTAGGGTTGCA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 079_TGCTCGTAGGGTTGCA-1 | CH-21-079 |
| 3539 | 079_TTCCTCTAGAGCTTTC-1 | 0.942032 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.942032 | 0.942032 | 079_TTCCTCTAGAGCTTTC-1 | CH-21-079 |
3540 rows × 1003 columns
# Set the cell names as the index again
datExpr_donor.set_index('index', inplace=True)
datExpr_donor
| ISG15 | LINC01342 | TTLL10-AS1 | TNFRSF18 | CALML6 | CHD5 | ICMT-DT | MIR34AHG | RBP7 | MTOR-AS1 | ... | KLHL13 | AKAP14 | RHOXF1-AS1 | TMEM255A | SMIM10L2B-AS1 | IL9R_ENSG00000124334 | DDX3Y | EIF1AY | cell_id | donor_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||||||||||||||
| 0002_AAAGGGCAGCAGCACA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0002_AAAGGGCAGCAGCACA-1 | CH-20-002 |
| 0002_AACAACCAGGGTTAGC-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0002_AACAACCAGGGTTAGC-1 | CH-20-002 |
| 0002_AACCCAAAGGGCCTCT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.271980 | 0.960117 | 0002_AACCCAAAGGGCCTCT-1 | CH-20-002 |
| 0002_AACGAAACACAAAGTA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0002_AACGAAACACAAAGTA-1 | CH-20-002 |
| 0002_AAGCGTTTCTTGGGCG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.157864 | 0.000000 | 0002_AAGCGTTTCTTGGGCG-1 | CH-20-002 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.038052 | 0.000000 | 079_TCTCCGAAGCTATCTG-1 | CH-21-079 |
| 079_TGAATCGAGATTCGAA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.022813 | 0.000000 | 079_TGAATCGAGATTCGAA-1 | CH-21-079 |
| 079_TGCGATAAGGTAGATT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.282646 | 0.000000 | 079_TGCGATAAGGTAGATT-1 | CH-21-079 |
| 079_TGCTCGTAGGGTTGCA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 079_TGCTCGTAGGGTTGCA-1 | CH-21-079 |
| 079_TTCCTCTAGAGCTTTC-1 | 0.942032 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.942032 | 0.942032 | 079_TTCCTCTAGAGCTTTC-1 | CH-21-079 |
3540 rows × 1002 columns
# Remove the 'cell_id' column if needed
datExpr_donor.drop(columns=['cell_id'], inplace=True)
datExpr_donor
| ISG15 | LINC01342 | TTLL10-AS1 | TNFRSF18 | CALML6 | CHD5 | ICMT-DT | MIR34AHG | RBP7 | MTOR-AS1 | ... | TSC22D3 | KLHL13 | AKAP14 | RHOXF1-AS1 | TMEM255A | SMIM10L2B-AS1 | IL9R_ENSG00000124334 | DDX3Y | EIF1AY | donor_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||||||||||||||
| 0002_AAAGGGCAGCAGCACA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.513502 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | CH-20-002 |
| 0002_AACAACCAGGGTTAGC-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.583828 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | CH-20-002 |
| 0002_AACCCAAAGGGCCTCT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.344490 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.271980 | 0.960117 | CH-20-002 |
| 0002_AACGAAACACAAAGTA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.207486 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | CH-20-002 |
| 0002_AAGCGTTTCTTGGGCG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.435951 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.157864 | 0.000000 | CH-20-002 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 079_TCTCCGAAGCTATCTG-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.582282 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.038052 | 0.000000 | CH-21-079 |
| 079_TGAATCGAGATTCGAA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.445902 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.022813 | 0.000000 | CH-21-079 |
| 079_TGCGATAAGGTAGATT-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.282646 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.282646 | 0.000000 | CH-21-079 |
| 079_TGCTCGTAGGGTTGCA-1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.245300 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | CH-21-079 |
| 079_TTCCTCTAGAGCTTTC-1 | 0.942032 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.516629 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.942032 | 0.942032 | CH-21-079 |
3540 rows × 1001 columns
#Save the expression matrix with donor_id
datExpr_donor.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_donorid_singlecell.csv', index = True)
Now that we have our gene expression dataframe, it is now possible to aggregate the data for pseudobulking.
# Aggregate expression by donor ID (summing the values)
pseudobulk_df = datExpr_donor.groupby('donor_id').sum()
pseudobulk_df
| ISG15 | LINC01342 | TTLL10-AS1 | TNFRSF18 | CALML6 | CHD5 | ICMT-DT | MIR34AHG | RBP7 | MTOR-AS1 | ... | FRMPD3 | TSC22D3 | KLHL13 | AKAP14 | RHOXF1-AS1 | TMEM255A | SMIM10L2B-AS1 | IL9R_ENSG00000124334 | DDX3Y | EIF1AY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| donor_id | |||||||||||||||||||||
| CH-20-001 | 6.380902 | 0.00000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 53.239479 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21.632603 | 17.641195 |
| CH-20-002 | 12.606750 | 2.33599 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.089918 | 0.000000 | 1.158743 | 1.173824 | ... | 0.000000 | 112.643967 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 45.432411 | 22.809191 |
| CH-20-004 | 12.302510 | 0.00000 | 0.000000 | 21.512184 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 42.873409 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.570595 | 20.173725 |
| CH-20-005 | 18.603716 | 1.16925 | 1.232658 | 4.975880 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.112746 | 190.337738 | 0.000000 | 0.0000 | 1.191559 | 0.000000 | 0.000000 | 0.000000 | 6.931139 | 1.071742 |
| CH-21-002 | 13.705297 | 0.00000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 44.942261 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 1.323198 | 0.000000 | 0.000000 |
| CH-21-006 | 4.377715 | 0.00000 | 0.000000 | 23.782143 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 1.023552 | 0.000000 | ... | 0.000000 | 12.741602 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.349793 | 12.981407 |
| CH-21-008 | 18.058025 | 0.00000 | 0.000000 | 44.614342 | 0.00000 | 1.201673 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 76.893723 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 1.080360 | 1.188176 | 2.377049 |
| CH-21-013 | 21.395964 | 0.00000 | 0.000000 | 30.426510 | 0.00000 | 0.000000 | 1.235703 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 54.458328 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 1.117969 | 1.236817 | 23.543072 | 53.250420 |
| CH-21-014 | 13.436963 | 0.00000 | 0.000000 | 11.067089 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 32.248600 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 17.709280 | 21.636190 |
| CH-21-017 | 22.916807 | 0.00000 | 0.000000 | 9.076924 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 2.478934 | 0.000000 | ... | 0.000000 | 188.600067 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 49.114601 | 40.454937 |
| CH-21-020 | 197.794693 | 0.00000 | 0.000000 | 122.788269 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 1.047435 | 0.000000 | ... | 0.000000 | 197.616577 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.765914 | 88.202682 | 173.938080 |
| CH-21-021 | 13.898113 | 0.00000 | 0.000000 | 11.169237 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 20.431047 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 11.017612 | 21.158054 |
| CH-21-028 | 7.210576 | 0.00000 | 1.066841 | 1.321003 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 57.428059 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.989964 | 0.000000 |
| CH-21-029 | 9.007506 | 0.00000 | 0.000000 | 1.928463 | 1.21185 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 157.941895 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.745571 | 2.051687 |
| CH-21-031 | 30.211197 | 0.00000 | 0.000000 | 40.325451 | 0.00000 | 0.000000 | 0.000000 | 2.130981 | 0.000000 | 0.000000 | ... | 1.244156 | 12.550498 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.227465 | 0.906813 |
| CH-21-033 | 21.972580 | 0.00000 | 0.000000 | 84.504501 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 152.658005 | 1.167664 | 1.2505 | 0.000000 | 1.199426 | 0.000000 | 0.000000 | 45.661453 | 142.600739 |
| CH-21-034 | 54.934029 | 0.00000 | 0.000000 | 147.552780 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.886594 | 167.975906 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.075679 |
| CH-21-036 | 17.018766 | 0.00000 | 0.000000 | 2.483573 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 88.924919 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 39.051266 | 10.004631 |
| CH-21-037 | 150.473450 | 0.00000 | 0.000000 | 53.255013 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 38.325649 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 33.786545 | 58.778214 |
| CH-21-046 | 9.337872 | 0.00000 | 0.000000 | 28.949800 | 0.00000 | 1.123670 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 28.600826 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.238980 | 12.119887 |
| CH-21-073 | 4.982193 | 0.00000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 40.201653 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 27.179262 | 2.954510 |
| CH-21-074 | 3.954194 | 0.00000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.127058 | 0.000000 | 0.000000 | ... | 0.000000 | 18.354240 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.401694 | 2.069999 |
| CH-21-077 | 33.969109 | 0.00000 | 0.000000 | 3.333775 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 1.115637 | 0.000000 | ... | 0.000000 | 161.007568 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.646068 | 0.000000 |
| CH-21-079 | 7.030363 | 0.00000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 40.449474 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 18.332941 | 11.980942 |
24 rows × 1000 columns
#Save the pseudobulk expression matrix with donor_id
pseudobulk_df.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_pseudobulk.csv', index = True)
We now have the pseudobulked data and the corresponding metadata dataframe to start the correlation network analysis