How to find, load and process snRNA-seq data¶

In [1]:

Copied!





#import libraries
import wget
import pandas as pd
import numpy as np
import scanpy as sc
import anndata
#import libraries
import wget
import pandas as pd
import numpy as np
import scanpy as sc
import anndata

Gene network analysis is a method designed to identify sub-networks (modules) of correlated genes, which are likely to be co-expressed. This can be helpful in identification of sub-networks (modules) of genes that contribute to disease. In this example, we will cover how to create a pairwise correlation matrix of genes, as well as how to associate them with disease.

First we will cover how to find, load and process the snRNA-seq data.

Find a Dataset¶

For this tutorial, we will be using an open access freely available dataset that has been generated from human peripheral blood mononuclear cells from patients with clonal hematopoiesis and controls. This dataset is available from the cellxgene portal, accessible here: https://cellxgene.cziscience.com/collections/0aab20b3-c30c-4606-bd2e-d20dae739c45 entitled "Multiomic Profiling of Human Clonal Hematopoiesis Reveals Genotype and Cell-Specific Inflammatory Pathway Activation". The associated paper is called "Multiomic profiling of human clonal hematopoiesis reveals genotype and cell-specific inflammatory pathway activation" and available at: https://ashpublications.org/bloodadvances/article/8/14/3665/515374/Multiomic-profiling-of-human-clonal-hematopoiesis ScRNA-seq was performed for patients with clonal haematopoiesis and controls. This dataset was chosen due to its compatability with the purpose of the pipeline. This data will be available in the data/test/ directory. The generated dataset is stored in h5ad format. By the end of this section, we will have loaded and explored the dataset.

Download a Dataset¶

Start by downloading the dataset from the original portal. Important to note, this step does not have to be complete. To save time, the filtered dataset has already been placed into the github repository within /dataset.

In [ ]:

Copied!

# URL of the dataset
url = "https://datasets.cellxgene.cziscience.com/6094cddd-de51-4891-8841-43e25120c336.h5ad"

# Destination path where the dataset will be saved
destination_path = "/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad"

# Download the dataset
wget.download(url, destination_path)

#Alternatively, the dataset can be found in the directory stated in the next cell.
# URL of the dataset
url = "https://datasets.cellxgene.cziscience.com/6094cddd-de51-4891-8841-43e25120c336.h5ad"

# Destination path where the dataset will be saved
destination_path = "/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad"

# Download the dataset
wget.download(url, destination_path)

#Alternatively, the dataset can be found in the directory stated in the next cell.

In [ ]:

Copied!

# Load the dataset
#Please be aware that you will have to personally download the dataset to work with
pbmc = sc.read("/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad")
# Load the dataset
#Please be aware that you will have to personally download the dataset to work with
pbmc = sc.read("/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad")

In [4]:

Copied!

#inspect the loaded data
pbmc
#inspect the loaded data
pbmc

Out[4]:

AnnData object with n_obs × n_vars = 66985 × 36263
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

As can be seen there are 67110 cells within the dataset. For the purposes of these exercises we will be filtering the dataset further to focus on one cell type and to reduce the dataset in size for ease.

In [5]:

Copied!

pbmc.obs['cell_type']
pbmc.obs['cell_type']

Out[5]:

0002_AAACCCACAAGTCCCG-1                      CD4-positive, alpha-beta T cell
0002_AAACCCAGTAGTCGTT-1    CD16-positive, CD56-dim natural killer cell, h...
0002_AAACCCATCTACACAG-1                                       dendritic cell
0002_AAACGAAAGAATTTGG-1                      CD4-positive, alpha-beta T cell
0002_AAACGCTAGCGACTGA-1                               CD14-positive monocyte
                                                 ...                        
079_TTTGGAGTCAGAGTGG-1                                CD14-positive monocyte
079_TTTGGAGTCGACATAC-1                                CD14-positive monocyte
079_TTTGGTTAGGTTATAG-1     CD16-positive, CD56-dim natural killer cell, h...
079_TTTGGTTCACACCAGC-1                                   natural killer cell
079_TTTGTTGGTTGTTGCA-1                       CD4-positive, alpha-beta T cell
Name: cell_type, Length: 66985, dtype: category
Categories (9, object): ['platelet', 'B cell', 'dendritic cell', 'natural killer cell', ..., 'CD8-positive, alpha-beta T cell', 'erythroid lineage cell', 'CD16-positive, CD56-dim natural killer cell, ..., 'CD14-positive monocyte']

As can be seen there are many different cell types contained within this dataset. We shall focus on B cells for the purposes of our exercises.

In [6]:

Copied!

# Filter the AnnData object for hepatocytes
Bcell = pbmc[pbmc.obs['cell_type'] == 'B cell']
# Filter the AnnData object for hepatocytes
Bcell = pbmc[pbmc.obs['cell_type'] == 'B cell']

In [7]:

Copied!

Bcell
Bcell

Out[7]:

View of AnnData object with n_obs × n_vars = 3540 × 36263
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

In [8]:

Copied!

#Check if the gene names are in the correct format of gene symbols and not Ensembl IDs which are also common.
Bcell.var
#Check if the gene names are in the correct format of gene symbols and not Ensembl IDs which are also common.
Bcell.var

Out[8]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length
ENSG00000243485	False	MIR1302-2HG	NCBITaxon:9606	gene	1021
ENSG00000237613	False	FAM138A	NCBITaxon:9606	gene	1219
ENSG00000186092	False	OR4F5	NCBITaxon:9606	gene	2618
ENSG00000238009	False	ENSG00000238009.6	NCBITaxon:9606	gene	3726
ENSG00000239945	False	ENSG00000239945.1	NCBITaxon:9606	gene	1319
...	...	...	...	...	...
ENSG00000277836	False	ENSG00000277836.1	NCBITaxon:9606	gene	288
ENSG00000278633	False	ENSG00000278633.1	NCBITaxon:9606	gene	2404
ENSG00000276017	False	ENSG00000276017.1	NCBITaxon:9606	gene	2404
ENSG00000278817	False	ENSG00000278817.1	NCBITaxon:9606	gene	1213
ENSG00000277196	False	ENSG00000277196.4	NCBITaxon:9606	gene	2405

36263 rows × 5 columns

As can be seen from the gene features dataframe, they have currently used the Ensembl gene naming system. However, this isn't helpful for our analyses as they are not intuitively easy to interpret, instead you would need to research each Ensembl ID to identify that particular gene's name and function. From the second column feature_name, it appears that the original authors have converted the Ensembl IDs to gene symbol names.

Process the Dataset to Correct Format for Analysis¶

In [9]:

Copied!

#Let's go ahead and map the values in the feature_name column to the rownames of the dataframe:
# Set the "feature_name" column as the index (row names)
Bcell.var.set_index("feature_name", drop = False, inplace=True)
#Let's go ahead and map the values in the feature_name column to the rownames of the dataframe:
# Set the "feature_name" column as the index (row names)
Bcell.var.set_index("feature_name", drop = False, inplace=True)

It is important to note that not all Ensembl IDs map to Gene symbol names, as can be seen within the top of the dataframe. Therefore, since there is not a mapping for all Ensembl IDs, we shall remove these rows from the dataframe as they will be difficult to interpret in subsequent analyses.

In [10]:

Copied!





# Filter rows where the index does not start with "ENSG" i.e. the Ensembl IDs.
# Define the condition for filtering genes
filter_genes = ~Bcell.var.index.str.startswith("ENSG")  # Exclude genes starting with "ENSG"
filter_genes

# Filter genes based on the condition
Bcell = Bcell[:, filter_genes]
# Filter rows where the index does not start with "ENSG" i.e. the Ensembl IDs.
# Define the condition for filtering genes
filter_genes = ~Bcell.var.index.str.startswith("ENSG")  # Exclude genes starting with "ENSG"
filter_genes

# Filter genes based on the condition
Bcell = Bcell[:, filter_genes]

In [11]:

Copied!

Bcell
Bcell

Out[11]:

View of AnnData object with n_obs × n_vars = 3540 × 25198
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

In [12]:

Copied!

Bcell.var
Bcell.var

Out[12]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length
ENSG00000243485	False	MIR1302-2HG	NCBITaxon:9606	gene	1021
ENSG00000237613	False	FAM138A	NCBITaxon:9606	gene	1219
ENSG00000186092	False	OR4F5	NCBITaxon:9606	gene	2618
ENSG00000284733	False	OR4F29	NCBITaxon:9606	gene	939
ENSG00000284662	False	OR4F16	NCBITaxon:9606	gene	939
...	...	...	...	...	...
ENSG00000223641	False	TTTY17C	NCBITaxon:9606	gene	776
ENSG00000228786	False	SEPTIN14P23	NCBITaxon:9606	gene	1192
ENSG00000172288	False	CDY1	NCBITaxon:9606	gene	2670
ENSG00000231141	False	TTTY3	NCBITaxon:9606	gene	344
ENSG00000274847	False	MAFIP	NCBITaxon:9606	gene	1599

25198 rows × 5 columns

As can be seen, the number of genes have now reduced from 36263 to 25198 as any rows with Ensembl IDs have been removed. However, let's change the variable slot to contain the gene symbol names as they are easier to work with.

In [ ]:

Copied!

# Update var_names with feature names from var DataFrame
Bcell.var_names = Bcell.var['feature_name']
# Update var_names with feature names from var DataFrame
Bcell.var_names = Bcell.var['feature_name']

In [14]:

Copied!

Bcell.var
Bcell.var

Out[14]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length
feature_name
MIR1302-2HG	False	MIR1302-2HG	NCBITaxon:9606	gene	1021
FAM138A	False	FAM138A	NCBITaxon:9606	gene	1219
OR4F5	False	OR4F5	NCBITaxon:9606	gene	2618
OR4F29	False	OR4F29	NCBITaxon:9606	gene	939
OR4F16	False	OR4F16	NCBITaxon:9606	gene	939
...	...	...	...	...	...
TTTY17C	False	TTTY17C	NCBITaxon:9606	gene	776
SEPTIN14P23	False	SEPTIN14P23	NCBITaxon:9606	gene	1192
CDY1	False	CDY1	NCBITaxon:9606	gene	2670
TTTY3	False	TTTY3	NCBITaxon:9606	gene	344
MAFIP	False	MAFIP	NCBITaxon:9606	gene	1599

25198 rows × 5 columns

Also need to calculate the highly variable genes.

Calculating highly variable genes on gene expression data that has not been log-transformed or normalised appropriately can lead to issues, including the presence of infinity values. Log transformation is a common preprocessing step for scRNA-seq data, especially when dealing with count data, to stabilise the variance and make the data more amenable to downstream analysis. It helps to mitigate the impact of high expression values and reduce the influence of technical noise.

In [ ]:

Copied!

# Log normalise the gene expression data
sc.pp.log1p(Bcell)
# Log normalise the gene expression data
sc.pp.log1p(Bcell)

In [16]:

Copied!

# Calculate highly variable genes
sc.pp.highly_variable_genes(Bcell, n_top_genes = 1000)
# Calculate highly variable genes
sc.pp.highly_variable_genes(Bcell, n_top_genes = 1000)

In [17]:

Copied!

Bcell
Bcell

Out[17]:

AnnData object with n_obs × n_vars = 3540 × 25198
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

In [18]:

Copied!

Bcell.var
Bcell.var

Out[18]:

	feature_is_filtered	feature_name	feature_reference	feature_biotype	feature_length	highly_variable	means	dispersions	dispersions_norm
feature_name
MIR1302-2HG	False	MIR1302-2HG	NCBITaxon:9606	gene	1021	False	1.000000e-12	NaN	NaN
FAM138A	False	FAM138A	NCBITaxon:9606	gene	1219	False	1.000000e-12	NaN	NaN
OR4F5	False	OR4F5	NCBITaxon:9606	gene	2618	False	1.000000e-12	NaN	NaN
OR4F29	False	OR4F29	NCBITaxon:9606	gene	939	False	1.000000e-12	NaN	NaN
OR4F16	False	OR4F16	NCBITaxon:9606	gene	939	False	1.000000e-12	NaN	NaN
...	...	...	...	...	...	...	...	...	...
TTTY17C	False	TTTY17C	NCBITaxon:9606	gene	776	False	1.000000e-12	NaN	NaN
SEPTIN14P23	False	SEPTIN14P23	NCBITaxon:9606	gene	1192	False	4.027802e-04	0.354964	-1.686367
CDY1	False	CDY1	NCBITaxon:9606	gene	2670	False	1.000000e-12	NaN	NaN
TTTY3	False	TTTY3	NCBITaxon:9606	gene	344	False	1.000000e-12	NaN	NaN
MAFIP	False	MAFIP	NCBITaxon:9606	gene	1599	False	1.053585e-02	0.594371	0.297863

25198 rows × 9 columns

In [ ]:

In [19]:

Copied!

#Lets save the filtered object
Bcell.write_h5ad('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered.h5ad')
#Lets save the filtered object
Bcell.write_h5ad('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered.h5ad')

In [ ]:

Process Associated Metadata¶

We will now explore the associated metadata

In [20]:

Copied!

Bcell.obs.columns
Bcell.obs.columns

Out[20]:

Index(['nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID',
       'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample',
       'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP',
       'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT',
       'nFeature_SCT', 'scType_celltype', 'pANN',
       'development_stage_ontology_term_id', 'cell_type_ontology_term_id',
       'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id',
       'suspension_type', 'is_primary_data', 'tissue_type',
       'tissue_ontology_term_id', 'organism_ontology_term_id',
       'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism',
       'sex', 'tissue', 'self_reported_ethnicity', 'development_stage',
       'observation_joinid'],
      dtype='object')

As can be seen, this dataset contains 3540 cells and 25198 genes. It also has relevant metadata in the obs section, such as MUTATION. The metadata may need to be encoded into the correct format for subsequent analyses, so let's have a look at the current format.

In [21]:

Copied!

Bcell.obs
Bcell.obs

Out[21]:

	nCount_RNA	nFeature_RNA	nCount_HTO	nFeature_HTO	HTO_maxID	HTO_secondID	HTO_margin	HTO_classification.global	sample	donor_id	...	disease_ontology_term_id	cell_type	assay	disease	organism	sex	tissue	self_reported_ethnicity	development_stage	observation_joinid
0002_AAAGGGCAGCAGCACA-1	1192.0	629	97.0	2	sample-2	sample-5	3.146440	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	FrFs19`Dsw
0002_AACAACCAGGGTTAGC-1	849.0	548	21.0	3	sample-2	sample-5	1.314667	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	MMer^rOrRY
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	110.0	4	sample-2	sample-6	2.556420	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	^dC2N0DTU\|
0002_AACGAAACACAAAGTA-1	1060.0	608	20.0	4	sample-2	sample-3	0.705259	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	F>Ad_32l$>
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	237.0	4	sample-2	sample-5	3.121787	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	+@dOztSS*d
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	43.0	3	sample-5	sample-6	2.155876	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	GMZ)5R6Eh*
079_TGAATCGAGATTCGAA-1	2026.0	1097	89.0	4	sample-5	sample-4	2.725727	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	lxd{TRji23
079_TGCGATAAGGTAGATT-1	1594.0	933	37.0	2	sample-5	sample-2	1.818129	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	KN2ItXPkR4
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	74.0	3	sample-5	sample-1	2.510466	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	+VU%_s11(N
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	67.0	5	sample-5	sample-3	2.200155	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	fc(s`v}4U!

3540 rows × 41 columns

Lets create a separate dataframe with the metadata information as this will be needed for the correlation analysis.

In [22]:

Copied!

#Currently we want to create a copy of the metadata so as not to alter the original adata object.
metadata = Bcell.obs.copy()
metadata
#Currently we want to create a copy of the metadata so as not to alter the original adata object.
metadata = Bcell.obs.copy()
metadata

Out[22]:

	nCount_RNA	nFeature_RNA	nCount_HTO	nFeature_HTO	HTO_maxID	HTO_secondID	HTO_margin	HTO_classification.global	sample	donor_id	...	disease_ontology_term_id	cell_type	assay	disease	organism	sex	tissue	self_reported_ethnicity	development_stage	observation_joinid
0002_AAAGGGCAGCAGCACA-1	1192.0	629	97.0	2	sample-2	sample-5	3.146440	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	FrFs19`Dsw
0002_AACAACCAGGGTTAGC-1	849.0	548	21.0	3	sample-2	sample-5	1.314667	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	MMer^rOrRY
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	110.0	4	sample-2	sample-6	2.556420	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	^dC2N0DTU\|
0002_AACGAAACACAAAGTA-1	1060.0	608	20.0	4	sample-2	sample-3	0.705259	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	F>Ad_32l$>
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	237.0	4	sample-2	sample-5	3.121787	Singlet	sample-2	CH-20-002	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	68-year-old human stage	+@dOztSS*d
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	43.0	3	sample-5	sample-6	2.155876	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	GMZ)5R6Eh*
079_TGAATCGAGATTCGAA-1	2026.0	1097	89.0	4	sample-5	sample-4	2.725727	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	lxd{TRji23
079_TGCGATAAGGTAGATT-1	1594.0	933	37.0	2	sample-5	sample-2	1.818129	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	KN2ItXPkR4
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	74.0	3	sample-5	sample-1	2.510466	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	+VU%_s11(N
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	67.0	5	sample-5	sample-3	2.200155	Singlet	sample-5	CH-21-079	...	MONDO:0100542	B cell	10x 3' v3	clonal hematopoiesis	Homo sapiens	male	blood	European	78-year-old human stage	fc(s`v}4U!

3540 rows × 41 columns

In [ ]:

There are many columns that are not needed.

In [23]:

Copied!





#Let's remove these columns
columns_to_remove = ['nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 
                     'HTO_secondID', 'HTO_margin', 'HTO_classification.global',
                     'sample', 'sex_ontology_term_id', 'assay_ontology_term_id', 
                     'suspension_type', 'is_primary_data', 'tissue_ontology_term_id',
                     'organism_ontology_term_id', 'disease_ontology_term_id', 'assay', 
                     'organism', 'self_reported_ethnicity', 'observation_joinid',
                    'CHIP', 'LANE', 'ProjectID', 'HTOID',
                    'nCount_SCT', 'nFeature_SCT', 'pANN',
                    'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 
                    'self_reported_ethnicity_ontology_term_id']
#Let's remove these columns
columns_to_remove = ['nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 
                     'HTO_secondID', 'HTO_margin', 'HTO_classification.global',
                     'sample', 'sex_ontology_term_id', 'assay_ontology_term_id', 
                     'suspension_type', 'is_primary_data', 'tissue_ontology_term_id',
                     'organism_ontology_term_id', 'disease_ontology_term_id', 'assay', 
                     'organism', 'self_reported_ethnicity', 'observation_joinid',
                    'CHIP', 'LANE', 'ProjectID', 'HTOID',
                    'nCount_SCT', 'nFeature_SCT', 'pANN',
                    'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 
                    'self_reported_ethnicity_ontology_term_id']

In [24]:

Copied!

metadata.drop(columns=columns_to_remove, inplace = True) #Set inplace=True to modify the DataFrame in place. If you set inplace=False or omit it, the drop() method will return a new DataFrame with the specified columns removed, leaving the original DataFrame unchanged.
metadata.drop(columns=columns_to_remove, inplace = True) #Set inplace=True to modify the DataFrame in place. If you set inplace=False or omit it, the drop() method will return a new DataFrame with the specified columns removed, leaving the original DataFrame unchanged.

In [25]:

Copied!

metadata
metadata

Out[25]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	MUTATION.GROUP	percent.mt	scType_celltype	tissue_type	cell_type	disease	sex	tissue	development_stage
0002_AAAGGGCAGCAGCACA-1	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	3.803975	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage
0002_AACAACCAGGGTTAGC-1	849.0	548	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.969349	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	4.029404	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage
0002_AACGAAACACAAAGTA-1	1060.0	608	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.138810	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	13.945409	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage
...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	CH-21-079	DNMT3A M880V (5%)	DNMT3A	4.876033	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage
079_TGAATCGAGATTCGAA-1	2026.0	1097	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.510031	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage
079_TGCGATAAGGTAGATT-1	1594.0	933	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.495584	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	CH-21-079	DNMT3A M880V (5%)	DNMT3A	6.130157	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	CH-21-079	DNMT3A M880V (5%)	DNMT3A	3.212387	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage

3540 rows × 13 columns

From investigating the metadata dataframe there are some columns that contain numerical data and some that contain character strings. The columns with character strings will need to be reformatted appropriately so that they can be correlated against. Lets first identify the unique labels within each column

In [26]:

Copied!

metadata['sex'].unique()
metadata['sex'].unique()

Out[26]:

['male', 'female']
Categories (2, object): ['female', 'male']

Looks like both male and female patients are included within this dataset. This will need to be numerically encoded so that it can be correlated against in downstream analysis.

In [27]:

Copied!

metadata['male'] = metadata['sex'].apply(lambda x: 1 if x == "male" else 0)
metadata['female'] = metadata['sex'].apply(lambda x: 1 if x == "female" else 0)
metadata['male'] = metadata['sex'].apply(lambda x: 1 if x == "male" else 0)
metadata['female'] = metadata['sex'].apply(lambda x: 1 if x == "female" else 0)

In [28]:

Copied!

metadata
metadata

Out[28]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	MUTATION.GROUP	percent.mt	scType_celltype	tissue_type	cell_type	disease	sex	tissue	development_stage	male	female
0002_AAAGGGCAGCAGCACA-1	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	3.803975	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0
0002_AACAACCAGGGTTAGC-1	849.0	548	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.969349	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	4.029404	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0
0002_AACGAAACACAAAGTA-1	1060.0	608	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.138810	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	13.945409	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	CH-21-079	DNMT3A M880V (5%)	DNMT3A	4.876033	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0
079_TGAATCGAGATTCGAA-1	2026.0	1097	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.510031	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0
079_TGCGATAAGGTAGATT-1	1594.0	933	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.495584	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	CH-21-079	DNMT3A M880V (5%)	DNMT3A	6.130157	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	CH-21-079	DNMT3A M880V (5%)	DNMT3A	3.212387	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0

3540 rows × 15 columns

Now let's have a look at the disease variable

In [29]:

Copied!

metadata['disease'].unique()
metadata['disease'].unique()

Out[29]:

['clonal hematopoiesis', 'normal']
Categories (2, object): ['normal', 'clonal hematopoiesis']

In [30]:

Copied!

#The disease column can be encoded into a binary variable. 
metadata['CH'] = metadata['disease'].apply(lambda x: 1 if x == "clonal hematopoiesis" else 0)
metadata['normal'] = metadata['disease'].apply(lambda x: 1 if x == "normal" else 0)
#The disease column can be encoded into a binary variable. 
metadata['CH'] = metadata['disease'].apply(lambda x: 1 if x == "clonal hematopoiesis" else 0)
metadata['normal'] = metadata['disease'].apply(lambda x: 1 if x == "normal" else 0)

In [31]:

Copied!

metadata
metadata

Out[31]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	MUTATION.GROUP	percent.mt	scType_celltype	tissue_type	cell_type	disease	sex	tissue	development_stage	male	female	CH	normal
0002_AAAGGGCAGCAGCACA-1	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	3.803975	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0	1	0
0002_AACAACCAGGGTTAGC-1	849.0	548	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.969349	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0	1	0
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	4.029404	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0	1	0
0002_AACGAAACACAAAGTA-1	1060.0	608	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.138810	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0	1	0
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	13.945409	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68-year-old human stage	1	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	CH-21-079	DNMT3A M880V (5%)	DNMT3A	4.876033	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0	1	0
079_TGAATCGAGATTCGAA-1	2026.0	1097	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.510031	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0	1	0
079_TGCGATAAGGTAGATT-1	1594.0	933	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.495584	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0	1	0
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	CH-21-079	DNMT3A M880V (5%)	DNMT3A	6.130157	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0	1	0
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	CH-21-079	DNMT3A M880V (5%)	DNMT3A	3.212387	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78-year-old human stage	1	0	1	0

3540 rows × 17 columns

Now lets sort out the development_stage column

In [32]:

Copied!

print(metadata['development_stage'].cat.categories)
print(metadata['development_stage'].cat.categories)

Index(['39-year-old human stage', '48-year-old human stage',
       '50-year-old human stage', '58-year-old human stage',
       '60-year-old human stage', '61-year-old human stage',
       '65-year-old human stage', '67-year-old human stage',
       '68-year-old human stage', '70-year-old human stage',
       '71-year-old human stage', '73-year-old human stage',
       '74-year-old human stage', '77-year-old human stage',
       '78-year-old human stage', '80-year-old human stage',
       '81-year-old human stage', '83-year-old human stage',
       '85-year-old human stage', '89-year-old human stage',
       '91-year-old human stage'],
      dtype='object')

In [33]:

Copied!





#There appear to be 8 categories. Lets numerically encode them
# Recode development_stage
development_stage_mapping = {
    '39-year-old human stage': 39,
    '48-year-old human stage': 48,
    '50-year-old human stage': 50,
    '58-year-old human stage': 58,
    '60-year-old human stage': 60,
    '61-year-old human stage': 61,
    '65-year-old human stage': 65,
    '67-year-old human stage': 67,
    '68-year-old human stage': 68,
    '70-year-old human stage': 70,
    '71-year-old human stage': 71,
    '73-year-old human stage': 73,
    '74-year-old human stage': 74,
    '77-year-old human stage': 77,
    '78-year-old human stage': 78,
    '80-year-old human stage': 80,
    '81-year-old human stage': 81,
    '83-year-old human stage': 83,
    '85-year-old human stage': 85,
    '89-year-old human stage': 89,
    '91-year-old human stage': 91    
}
metadata['development_stage'] = metadata['development_stage'].map(development_stage_mapping)
#There appear to be 8 categories. Lets numerically encode them
# Recode development_stage
development_stage_mapping = {
    '39-year-old human stage': 39,
    '48-year-old human stage': 48,
    '50-year-old human stage': 50,
    '58-year-old human stage': 58,
    '60-year-old human stage': 60,
    '61-year-old human stage': 61,
    '65-year-old human stage': 65,
    '67-year-old human stage': 67,
    '68-year-old human stage': 68,
    '70-year-old human stage': 70,
    '71-year-old human stage': 71,
    '73-year-old human stage': 73,
    '74-year-old human stage': 74,
    '77-year-old human stage': 77,
    '78-year-old human stage': 78,
    '80-year-old human stage': 80,
    '81-year-old human stage': 81,
    '83-year-old human stage': 83,
    '85-year-old human stage': 85,
    '89-year-old human stage': 89,
    '91-year-old human stage': 91    
}
metadata['development_stage'] = metadata['development_stage'].map(development_stage_mapping)

In [34]:

Copied!

metadata
metadata

Out[34]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	MUTATION.GROUP	percent.mt	scType_celltype	tissue_type	cell_type	disease	sex	tissue	development_stage	male	female	CH	normal
0002_AAAGGGCAGCAGCACA-1	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	3.803975	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0
0002_AACAACCAGGGTTAGC-1	849.0	548	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.969349	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	4.029404	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0
0002_AACGAAACACAAAGTA-1	1060.0	608	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.138810	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	13.945409	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	CH-21-079	DNMT3A M880V (5%)	DNMT3A	4.876033	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0
079_TGAATCGAGATTCGAA-1	2026.0	1097	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.510031	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0
079_TGCGATAAGGTAGATT-1	1594.0	933	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.495584	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	CH-21-079	DNMT3A M880V (5%)	DNMT3A	6.130157	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	CH-21-079	DNMT3A M880V (5%)	DNMT3A	3.212387	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0

3540 rows × 17 columns

In [35]:

Copied!

metadata['MUTATION.GROUP'].unique()
metadata['MUTATION.GROUP'].unique()

Out[35]:

['DNMT3A', 'none', 'TET2']
Categories (3, object): ['DNMT3A', 'TET2', 'none']

In [36]:

Copied!

metadata['DNMT3A'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "DNMT3A" else 0)
metadata['TET2'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "TET2" else 0)
metadata['NoMutation'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "none" else 0)
metadata['DNMT3A'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "DNMT3A" else 0)
metadata['TET2'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "TET2" else 0)
metadata['NoMutation'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "none" else 0)

In [37]:

Copied!

metadata
metadata

Out[37]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	MUTATION.GROUP	percent.mt	scType_celltype	tissue_type	cell_type	disease	sex	tissue	development_stage	male	female	CH	normal	DNMT3A	TET2	NoMutation
0002_AAAGGGCAGCAGCACA-1	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	3.803975	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0	1	0	0
0002_AACAACCAGGGTTAGC-1	849.0	548	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.969349	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0	1	0	0
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	4.029404	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0	1	0	0
0002_AACGAAACACAAAGTA-1	1060.0	608	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	7.138810	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0	1	0	0
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	DNMT3A	13.945409	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	68	1	0	1	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	CH-21-079	DNMT3A M880V (5%)	DNMT3A	4.876033	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0	1	0	0
079_TGAATCGAGATTCGAA-1	2026.0	1097	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.510031	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0	1	0	0
079_TGCGATAAGGTAGATT-1	1594.0	933	CH-21-079	DNMT3A M880V (5%)	DNMT3A	5.495584	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0	1	0	0
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	CH-21-079	DNMT3A M880V (5%)	DNMT3A	6.130157	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0	1	0	0
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	CH-21-079	DNMT3A M880V (5%)	DNMT3A	3.212387	Naive B cells	tissue	B cell	clonal hematopoiesis	male	blood	78	1	0	1	0	1	0	0

3540 rows × 20 columns

In [38]:

Copied!

# Drop unnecessary columns
metadata = metadata.drop(['disease', 'MUTATION.GROUP', 'sex'], axis=1)
metadata
# Drop unnecessary columns
metadata = metadata.drop(['disease', 'MUTATION.GROUP', 'sex'], axis=1)
metadata

Out[38]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	percent.mt	scType_celltype	tissue_type	cell_type	tissue	development_stage	male	female	CH	normal	DNMT3A	TET2	NoMutation
0002_AAAGGGCAGCAGCACA-1	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	3.803975	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0
0002_AACAACCAGGGTTAGC-1	849.0	548	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	7.969349	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0
0002_AACCCAAAGGGCCTCT-1	2492.0	1188	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	4.029404	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0
0002_AACGAAACACAAAGTA-1	1060.0	608	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	7.138810	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0
0002_AAGCGTTTCTTGGGCG-1	1270.0	716	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	13.945409	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	1925.0	1054	CH-21-079	DNMT3A M880V (5%)	4.876033	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0
079_TGAATCGAGATTCGAA-1	2026.0	1097	CH-21-079	DNMT3A M880V (5%)	5.510031	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0
079_TGCGATAAGGTAGATT-1	1594.0	933	CH-21-079	DNMT3A M880V (5%)	5.495584	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0
079_TGCTCGTAGGGTTGCA-1	1840.0	1101	CH-21-079	DNMT3A M880V (5%)	6.130157	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0
079_TTCCTCTAGAGCTTTC-1	2643.0	1197	CH-21-079	DNMT3A M880V (5%)	3.212387	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0

3540 rows × 17 columns

In [39]:

Copied!

#Save the metadata dataframe
metadata.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata.csv', index = True)
#Save the metadata dataframe
metadata.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata.csv', index = True)

In [ ]:

Copied!

metadata = pd.read_csv('data/Bcell_metadata.csv', index_col = 0)
metadata = pd.read_csv('data/Bcell_metadata.csv', index_col = 0)

Due to the nature of single-cell data, we naturally have many cells from the same donor. However, we cannot simply correlate the gene expression data in its current form. this would lead to within and outwith donor correlations. Therefore, since we are working with single-cell data, this must first be pseudobulked in order to continue with the analysis. This is important as it not only speeds up the computation, but most importantly negates the effects of within sample correlation. Also, pseudobulking can help to mitigate the issues commonly found in single-cell data, such as drop outs and high zero value counts.

Pseudobulk the Metadata¶

First we shall sort out the metadata dataframe so that it only contains one row per donor since the data will be aggregated.

In [42]:

Copied!

# Convert row names to a column named 'cell_id'
metadata['cell_id'] = metadata.index
# Convert row names to a column named 'cell_id'
metadata['cell_id'] = metadata.index

In [43]:

Copied!

# Group by 'donor_id' and select the first row of each group
rows = metadata.groupby('donor_id').first().reset_index()
# Group by 'donor_id' and select the first row of each group
rows = metadata.groupby('donor_id').first().reset_index()

In [44]:

Copied!

rows
rows

Out[44]:

	donor_id	nCount_RNA	nFeature_RNA	MUTATION	percent.mt	scType_celltype	tissue_type	cell_type	tissue	development_stage	male	female	CH	normal	DNMT3A	TET2	NoMutation	cell_id
0	CH-20-001	2490.0	1403	DNMT3A R882C	6.119578	Naive B cells	tissue	B cell	blood	60	1	0	1	0	1	0	0	001_AAAGAACGTTCTCAGA-1
1	CH-20-002	1192.0	629	DNMT3A R729W (4%), DNMT3A R736C (2%)	3.803975	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0	0002_AAAGGGCAGCAGCACA-1
2	CH-20-004	1833.0	985	TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...	5.335196	Naive B cells	tissue	B cell	blood	85	1	0	1	0	0	1	0	004_AACCTGATCTTTGATC-1
3	CH-20-005	1966.0	886	TET2 V1900F (2%)	5.314136	Naive B cells	tissue	B cell	blood	58	0	1	1	0	0	1	0	005_AACAACCAGAGCTGAC-1
4	CH-21-002	1912.0	938	none	5.657238	Naive B cells	tissue	B cell	blood	48	0	1	0	1	0	0	1	002_AAAGGTACACATTGTG-1
5	CH-21-006	1356.0	709	DNMT3A R882H (13%)	5.211849	Naive B cells	tissue	B cell	blood	67	0	1	1	0	1	0	0	006_AACGAAACAGAGTTCT-1
6	CH-21-008	1117.0	575	none	8.398348	Naive B cells	tissue	B cell	blood	70	0	1	0	1	0	0	1	008_AACAGGGTCTTCTCAA-1
7	CH-21-013	1321.0	816	none	4.663212	Naive B cells	tissue	B cell	blood	73	1	0	0	1	0	0	1	013_AACCAACAGGTAGCCA-1
8	CH-21-014	1064.0	623	SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)	4.146577	Naive B cells	tissue	B cell	blood	74	1	0	1	0	0	1	0	014_AAAGTCCGTTTGACAC-1
9	CH-21-017	1880.0	953	DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...	6.519922	Naive B cells	tissue	B cell	blood	65	1	0	1	0	1	0	0	017_AAACGAAAGGCGAACT-1
10	CH-21-020	5325.0	2286	none	5.631046	Naive B cells	tissue	B cell	blood	61	1	0	0	1	0	0	1	020_AAACGAATCGATTTCT-1
11	CH-21-021	1671.0	943	none	3.214286	Naive B cells	tissue	B cell	blood	83	1	0	0	1	0	0	1	021_AAAGGTAGTTGTTGAC-1
12	CH-21-028	1690.0	866	none	6.053894	Naive B cells	tissue	B cell	blood	89	0	1	0	1	0	0	1	028_AAAGTGACATAGACTC-1
13	CH-21-029	2180.0	1073	TET2 G68X (2%)	2.570194	Naive B cells	tissue	B cell	blood	83	0	1	1	0	0	1	0	029_AAAGGTAAGCCGTTAT-1
14	CH-21-031	1592.0	887	none	6.734398	Naive B cells	tissue	B cell	blood	78	0	1	0	1	0	0	1	031_AAACGCTAGTTTGTCG-1
15	CH-21-033	2219.0	1138	TET2 (33%)	5.670567	Naive B cells	tissue	B cell	blood	81	1	0	1	0	0	1	0	033_AAACGCTGTAAGCGGT-1
16	CH-21-034	2010.0	974	DNMT3A Q816X (8%)	7.937365	Naive B cells	tissue	B cell	blood	39	0	1	1	0	1	0	0	034_AAACCCAAGCGTCTCG-1
17	CH-21-036	2686.0	1337	DNMT3A splice (7%)	3.909544	Naive B cells	tissue	B cell	blood	91	1	0	1	0	1	0	0	036_AAAGGGCTCCCTCTAG-1
18	CH-21-037	3546.0	1645	TET2 (6.2%)	4.473764	Naive B cells	tissue	B cell	blood	71	1	0	1	0	0	1	0	037_AAAGGTAAGCGCCATC-1
19	CH-21-046	1918.0	907	DNMT3A W305X (24%)	4.807084	Naive B cells	tissue	B cell	blood	80	1	0	1	0	1	0	0	046_AACCATGCAGATCATC-1
20	CH-21-073	2148.0	1096	SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742...	5.174489	Naive B cells	tissue	B cell	blood	77	1	0	1	0	0	1	0	073_AAACGCTGTAACCCGC-1
21	CH-21-074	1322.0	708	TET2 C1378Y (23%)	3.328561	Naive B cells	tissue	B cell	blood	70	1	0	1	0	0	1	0	074_AATGGCTGTCCAGAAG-1
22	CH-21-077	1715.0	934	DNMT3A R749C (9.1%)	6.539510	Naive B cells	tissue	B cell	blood	50	0	1	1	0	1	0	0	077_AACAAGAGTAAGTTAG-1
23	CH-21-079	1354.0	793	DNMT3A M880V (5%)	6.386293	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0	079_AAAGGATCAAGCCCAC-1

In [45]:

Copied!





# Extract row indices corresponding to the first cell from each donor
row_list = []
for i, row in rows.iterrows():
    row_idx = metadata.index.get_loc(row['cell_id'])
    row_list.append(row_idx)
# Extract row indices corresponding to the first cell from each donor
row_list = []
for i, row in rows.iterrows():
    row_idx = metadata.index.get_loc(row['cell_id'])
    row_list.append(row_idx)

In [46]:

Copied!

row_list
row_list

Out[46]:

In [47]:

Copied!

# Select the columns from the DataFrame
metadata2 = metadata.iloc[row_list, :].copy()
# Select the columns from the DataFrame
metadata2 = metadata.iloc[row_list, :].copy()

In [48]:

Copied!

metadata2
metadata2

Out[48]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	percent.mt	scType_celltype	tissue_type	cell_type	tissue	development_stage	male	female	CH	normal	DNMT3A	TET2	NoMutation	cell_id
001_AAAGAACGTTCTCAGA-1	2490.0	1403	CH-20-001	DNMT3A R882C	6.119578	Naive B cells	tissue	B cell	blood	60	1	0	1	0	1	0	0	001_AAAGAACGTTCTCAGA-1
0002_AAAGGGCAGCAGCACA-1	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	3.803975	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0	0002_AAAGGGCAGCAGCACA-1
004_AACCTGATCTTTGATC-1	1833.0	985	CH-20-004	TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...	5.335196	Naive B cells	tissue	B cell	blood	85	1	0	1	0	0	1	0	004_AACCTGATCTTTGATC-1
005_AACAACCAGAGCTGAC-1	1966.0	886	CH-20-005	TET2 V1900F (2%)	5.314136	Naive B cells	tissue	B cell	blood	58	0	1	1	0	0	1	0	005_AACAACCAGAGCTGAC-1
002_AAAGGTACACATTGTG-1	1912.0	938	CH-21-002	none	5.657238	Naive B cells	tissue	B cell	blood	48	0	1	0	1	0	0	1	002_AAAGGTACACATTGTG-1
006_AACGAAACAGAGTTCT-1	1356.0	709	CH-21-006	DNMT3A R882H (13%)	5.211849	Naive B cells	tissue	B cell	blood	67	0	1	1	0	1	0	0	006_AACGAAACAGAGTTCT-1
008_AACAGGGTCTTCTCAA-1	1117.0	575	CH-21-008	none	8.398348	Naive B cells	tissue	B cell	blood	70	0	1	0	1	0	0	1	008_AACAGGGTCTTCTCAA-1
013_AACCAACAGGTAGCCA-1	1321.0	816	CH-21-013	none	4.663212	Naive B cells	tissue	B cell	blood	73	1	0	0	1	0	0	1	013_AACCAACAGGTAGCCA-1
014_AAAGTCCGTTTGACAC-1	1064.0	623	CH-21-014	SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)	4.146577	Naive B cells	tissue	B cell	blood	74	1	0	1	0	0	1	0	014_AAAGTCCGTTTGACAC-1
017_AAACGAAAGGCGAACT-1	1880.0	953	CH-21-017	DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...	6.519922	Naive B cells	tissue	B cell	blood	65	1	0	1	0	1	0	0	017_AAACGAAAGGCGAACT-1
020_AAACGAATCGATTTCT-1	5325.0	2286	CH-21-020	none	5.631046	Naive B cells	tissue	B cell	blood	61	1	0	0	1	0	0	1	020_AAACGAATCGATTTCT-1
021_AAAGGTAGTTGTTGAC-1	1671.0	943	CH-21-021	none	3.214286	Naive B cells	tissue	B cell	blood	83	1	0	0	1	0	0	1	021_AAAGGTAGTTGTTGAC-1
028_AAAGTGACATAGACTC-1	1690.0	866	CH-21-028	none	6.053894	Naive B cells	tissue	B cell	blood	89	0	1	0	1	0	0	1	028_AAAGTGACATAGACTC-1
029_AAAGGTAAGCCGTTAT-1	2180.0	1073	CH-21-029	TET2 G68X (2%)	2.570194	Naive B cells	tissue	B cell	blood	83	0	1	1	0	0	1	0	029_AAAGGTAAGCCGTTAT-1
031_AAACGCTAGTTTGTCG-1	1592.0	887	CH-21-031	none	6.734398	Naive B cells	tissue	B cell	blood	78	0	1	0	1	0	0	1	031_AAACGCTAGTTTGTCG-1
033_AAACGCTGTAAGCGGT-1	2219.0	1138	CH-21-033	TET2 (33%)	5.670567	Naive B cells	tissue	B cell	blood	81	1	0	1	0	0	1	0	033_AAACGCTGTAAGCGGT-1
034_AAACCCAAGCGTCTCG-1	2010.0	974	CH-21-034	DNMT3A Q816X (8%)	7.937365	Naive B cells	tissue	B cell	blood	39	0	1	1	0	1	0	0	034_AAACCCAAGCGTCTCG-1
036_AAAGGGCTCCCTCTAG-1	2686.0	1337	CH-21-036	DNMT3A splice (7%)	3.909544	Naive B cells	tissue	B cell	blood	91	1	0	1	0	1	0	0	036_AAAGGGCTCCCTCTAG-1
037_AAAGGTAAGCGCCATC-1	3546.0	1645	CH-21-037	TET2 (6.2%)	4.473764	Naive B cells	tissue	B cell	blood	71	1	0	1	0	0	1	0	037_AAAGGTAAGCGCCATC-1
046_AACCATGCAGATCATC-1	1918.0	907	CH-21-046	DNMT3A W305X (24%)	4.807084	Naive B cells	tissue	B cell	blood	80	1	0	1	0	1	0	0	046_AACCATGCAGATCATC-1
073_AAACGCTGTAACCCGC-1	2148.0	1096	CH-21-073	SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742...	5.174489	Naive B cells	tissue	B cell	blood	77	1	0	1	0	0	1	0	073_AAACGCTGTAACCCGC-1
074_AATGGCTGTCCAGAAG-1	1322.0	708	CH-21-074	TET2 C1378Y (23%)	3.328561	Naive B cells	tissue	B cell	blood	70	1	0	1	0	0	1	0	074_AATGGCTGTCCAGAAG-1
077_AACAAGAGTAAGTTAG-1	1715.0	934	CH-21-077	DNMT3A R749C (9.1%)	6.539510	Naive B cells	tissue	B cell	blood	50	0	1	1	0	1	0	0	077_AACAAGAGTAAGTTAG-1
079_AAAGGATCAAGCCCAC-1	1354.0	793	CH-21-079	DNMT3A M880V (5%)	6.386293	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0	079_AAAGGATCAAGCCCAC-1

In [49]:

Copied!

metadata2.set_index('donor_id', inplace = True, drop = False)
metadata2.set_index('donor_id', inplace = True, drop = False)

In [50]:

Copied!

metadata2
metadata2

Out[50]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	percent.mt	scType_celltype	tissue_type	cell_type	tissue	development_stage	male	female	CH	normal	DNMT3A	TET2	NoMutation	cell_id
donor_id
CH-20-001	2490.0	1403	CH-20-001	DNMT3A R882C	6.119578	Naive B cells	tissue	B cell	blood	60	1	0	1	0	1	0	0	001_AAAGAACGTTCTCAGA-1
CH-20-002	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	3.803975	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0	0002_AAAGGGCAGCAGCACA-1
CH-20-004	1833.0	985	CH-20-004	TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...	5.335196	Naive B cells	tissue	B cell	blood	85	1	0	1	0	0	1	0	004_AACCTGATCTTTGATC-1
CH-20-005	1966.0	886	CH-20-005	TET2 V1900F (2%)	5.314136	Naive B cells	tissue	B cell	blood	58	0	1	1	0	0	1	0	005_AACAACCAGAGCTGAC-1
CH-21-002	1912.0	938	CH-21-002	none	5.657238	Naive B cells	tissue	B cell	blood	48	0	1	0	1	0	0	1	002_AAAGGTACACATTGTG-1
CH-21-006	1356.0	709	CH-21-006	DNMT3A R882H (13%)	5.211849	Naive B cells	tissue	B cell	blood	67	0	1	1	0	1	0	0	006_AACGAAACAGAGTTCT-1
CH-21-008	1117.0	575	CH-21-008	none	8.398348	Naive B cells	tissue	B cell	blood	70	0	1	0	1	0	0	1	008_AACAGGGTCTTCTCAA-1
CH-21-013	1321.0	816	CH-21-013	none	4.663212	Naive B cells	tissue	B cell	blood	73	1	0	0	1	0	0	1	013_AACCAACAGGTAGCCA-1
CH-21-014	1064.0	623	CH-21-014	SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)	4.146577	Naive B cells	tissue	B cell	blood	74	1	0	1	0	0	1	0	014_AAAGTCCGTTTGACAC-1
CH-21-017	1880.0	953	CH-21-017	DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...	6.519922	Naive B cells	tissue	B cell	blood	65	1	0	1	0	1	0	0	017_AAACGAAAGGCGAACT-1
CH-21-020	5325.0	2286	CH-21-020	none	5.631046	Naive B cells	tissue	B cell	blood	61	1	0	0	1	0	0	1	020_AAACGAATCGATTTCT-1
CH-21-021	1671.0	943	CH-21-021	none	3.214286	Naive B cells	tissue	B cell	blood	83	1	0	0	1	0	0	1	021_AAAGGTAGTTGTTGAC-1
CH-21-028	1690.0	866	CH-21-028	none	6.053894	Naive B cells	tissue	B cell	blood	89	0	1	0	1	0	0	1	028_AAAGTGACATAGACTC-1
CH-21-029	2180.0	1073	CH-21-029	TET2 G68X (2%)	2.570194	Naive B cells	tissue	B cell	blood	83	0	1	1	0	0	1	0	029_AAAGGTAAGCCGTTAT-1
CH-21-031	1592.0	887	CH-21-031	none	6.734398	Naive B cells	tissue	B cell	blood	78	0	1	0	1	0	0	1	031_AAACGCTAGTTTGTCG-1
CH-21-033	2219.0	1138	CH-21-033	TET2 (33%)	5.670567	Naive B cells	tissue	B cell	blood	81	1	0	1	0	0	1	0	033_AAACGCTGTAAGCGGT-1
CH-21-034	2010.0	974	CH-21-034	DNMT3A Q816X (8%)	7.937365	Naive B cells	tissue	B cell	blood	39	0	1	1	0	1	0	0	034_AAACCCAAGCGTCTCG-1
CH-21-036	2686.0	1337	CH-21-036	DNMT3A splice (7%)	3.909544	Naive B cells	tissue	B cell	blood	91	1	0	1	0	1	0	0	036_AAAGGGCTCCCTCTAG-1
CH-21-037	3546.0	1645	CH-21-037	TET2 (6.2%)	4.473764	Naive B cells	tissue	B cell	blood	71	1	0	1	0	0	1	0	037_AAAGGTAAGCGCCATC-1
CH-21-046	1918.0	907	CH-21-046	DNMT3A W305X (24%)	4.807084	Naive B cells	tissue	B cell	blood	80	1	0	1	0	1	0	0	046_AACCATGCAGATCATC-1
CH-21-073	2148.0	1096	CH-21-073	SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742...	5.174489	Naive B cells	tissue	B cell	blood	77	1	0	1	0	0	1	0	073_AAACGCTGTAACCCGC-1
CH-21-074	1322.0	708	CH-21-074	TET2 C1378Y (23%)	3.328561	Naive B cells	tissue	B cell	blood	70	1	0	1	0	0	1	0	074_AATGGCTGTCCAGAAG-1
CH-21-077	1715.0	934	CH-21-077	DNMT3A R749C (9.1%)	6.539510	Naive B cells	tissue	B cell	blood	50	0	1	1	0	1	0	0	077_AACAAGAGTAAGTTAG-1
CH-21-079	1354.0	793	CH-21-079	DNMT3A M880V (5%)	6.386293	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0	079_AAAGGATCAAGCCCAC-1

In [51]:

Copied!

#Remove the cell_id column
metadata2.drop(columns = 'cell_id', inplace = True)
#Remove the cell_id column
metadata2.drop(columns = 'cell_id', inplace = True)

In [52]:

Copied!

metadata2
metadata2

Out[52]:

	nCount_RNA	nFeature_RNA	donor_id	MUTATION	percent.mt	scType_celltype	tissue_type	cell_type	tissue	development_stage	male	female	CH	normal	DNMT3A	TET2	NoMutation
donor_id
CH-20-001	2490.0	1403	CH-20-001	DNMT3A R882C	6.119578	Naive B cells	tissue	B cell	blood	60	1	0	1	0	1	0	0
CH-20-002	1192.0	629	CH-20-002	DNMT3A R729W (4%), DNMT3A R736C (2%)	3.803975	Naive B cells	tissue	B cell	blood	68	1	0	1	0	1	0	0
CH-20-004	1833.0	985	CH-20-004	TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...	5.335196	Naive B cells	tissue	B cell	blood	85	1	0	1	0	0	1	0
CH-20-005	1966.0	886	CH-20-005	TET2 V1900F (2%)	5.314136	Naive B cells	tissue	B cell	blood	58	0	1	1	0	0	1	0
CH-21-002	1912.0	938	CH-21-002	none	5.657238	Naive B cells	tissue	B cell	blood	48	0	1	0	1	0	0	1
CH-21-006	1356.0	709	CH-21-006	DNMT3A R882H (13%)	5.211849	Naive B cells	tissue	B cell	blood	67	0	1	1	0	1	0	0
CH-21-008	1117.0	575	CH-21-008	none	8.398348	Naive B cells	tissue	B cell	blood	70	0	1	0	1	0	0	1
CH-21-013	1321.0	816	CH-21-013	none	4.663212	Naive B cells	tissue	B cell	blood	73	1	0	0	1	0	0	1
CH-21-014	1064.0	623	CH-21-014	SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)	4.146577	Naive B cells	tissue	B cell	blood	74	1	0	1	0	0	1	0
CH-21-017	1880.0	953	CH-21-017	DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...	6.519922	Naive B cells	tissue	B cell	blood	65	1	0	1	0	1	0	0
CH-21-020	5325.0	2286	CH-21-020	none	5.631046	Naive B cells	tissue	B cell	blood	61	1	0	0	1	0	0	1
CH-21-021	1671.0	943	CH-21-021	none	3.214286	Naive B cells	tissue	B cell	blood	83	1	0	0	1	0	0	1
CH-21-028	1690.0	866	CH-21-028	none	6.053894	Naive B cells	tissue	B cell	blood	89	0	1	0	1	0	0	1
CH-21-029	2180.0	1073	CH-21-029	TET2 G68X (2%)	2.570194	Naive B cells	tissue	B cell	blood	83	0	1	1	0	0	1	0
CH-21-031	1592.0	887	CH-21-031	none	6.734398	Naive B cells	tissue	B cell	blood	78	0	1	0	1	0	0	1
CH-21-033	2219.0	1138	CH-21-033	TET2 (33%)	5.670567	Naive B cells	tissue	B cell	blood	81	1	0	1	0	0	1	0
CH-21-034	2010.0	974	CH-21-034	DNMT3A Q816X (8%)	7.937365	Naive B cells	tissue	B cell	blood	39	0	1	1	0	1	0	0
CH-21-036	2686.0	1337	CH-21-036	DNMT3A splice (7%)	3.909544	Naive B cells	tissue	B cell	blood	91	1	0	1	0	1	0	0
CH-21-037	3546.0	1645	CH-21-037	TET2 (6.2%)	4.473764	Naive B cells	tissue	B cell	blood	71	1	0	1	0	0	1	0
CH-21-046	1918.0	907	CH-21-046	DNMT3A W305X (24%)	4.807084	Naive B cells	tissue	B cell	blood	80	1	0	1	0	1	0	0
CH-21-073	2148.0	1096	CH-21-073	SRSF2 (33%), TET2 Y1245Lfs*22 (27%), TET2 Q742...	5.174489	Naive B cells	tissue	B cell	blood	77	1	0	1	0	0	1	0
CH-21-074	1322.0	708	CH-21-074	TET2 C1378Y (23%)	3.328561	Naive B cells	tissue	B cell	blood	70	1	0	1	0	0	1	0
CH-21-077	1715.0	934	CH-21-077	DNMT3A R749C (9.1%)	6.539510	Naive B cells	tissue	B cell	blood	50	0	1	1	0	1	0	0
CH-21-079	1354.0	793	CH-21-079	DNMT3A M880V (5%)	6.386293	Naive B cells	tissue	B cell	blood	78	1	0	1	0	1	0	0

In [53]:

Copied!

#Save the metadata
metadata2.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata_pseudobulk.csv', index = True)
#Save the metadata
metadata2.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata_pseudobulk.csv', index = True)

The metadata dataframe for the pseudobulk is now complete

Pseudobulk the Corresponding Data¶

Lets proceed to aggregate the gene expression data. This involves summing the gene expression data for each gene of each donor.

First the gene expression matrix will need to be extracted from our adata object

Since we are working with single-cell data which will be stored as a sparse matrix, this must be coerced into a dense matrix, so that it can be converted to a dataframe.

In [54]:

Copied!

# Convert the sparse matrix to a dense matrix
dense_matrix = Bcell.X.todense()
# Convert the sparse matrix to a dense matrix
dense_matrix = Bcell.X.todense()

In [55]:

Copied!

datExpr = pd.DataFrame(dense_matrix, index=Bcell.obs_names, columns=Bcell.var_names)
datExpr = pd.DataFrame(dense_matrix, index=Bcell.obs_names, columns=Bcell.var_names)

In [56]:

Copied!

datExpr
datExpr

Out[56]:

feature_name	MIR1302-2HG	FAM138A	OR4F5	OR4F29	OR4F16	LINC01409	FAM87B	LINC01128	LINC00115	FAM41C	...	BPY2B	DAZ3	DAZ4	BPY2C	TTTY4C	TTTY17C	SEPTIN14P23	CDY1	TTTY3	MAFIP
0002_AAAGGGCAGCAGCACA-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0002_AACAACCAGGGTTAGC-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0002_AACCCAAAGGGCCTCT-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0002_AACGAAACACAAAGTA-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0002_AAGCGTTTCTTGGGCG-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
079_TGAATCGAGATTCGAA-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
079_TGCGATAAGGTAGATT-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
079_TGCTCGTAGGGTTGCA-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
079_TTCCTCTAGAGCTTTC-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

3540 rows × 25198 columns

In [57]:

Copied!

#save datExpr
#Save the metadata dataframe
datExpr.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_singlecell.csv', index = True)
#save datExpr
#Save the metadata dataframe
datExpr.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_singlecell.csv', index = True)

Since highly variable genes capture the most informative genes, they will be used to filter the expression matrix further. This is also a way to reduce the dimensionality of the data, so that downstream analyses may be more computationally efficient.

In [58]:

Copied!

hvg = Bcell.var_names[Bcell.var['highly_variable']]
hvg
hvg = Bcell.var_names[Bcell.var['highly_variable']]
hvg

Out[58]:

CategoricalIndex(['ISG15', 'LINC01342', 'TTLL10-AS1', 'TNFRSF18', 'CALML6',
                  'CHD5', 'ICMT-DT', 'MIR34AHG', 'RBP7', 'MTOR-AS1',
                  ...
                  'FRMPD3', 'TSC22D3', 'KLHL13', 'AKAP14', 'RHOXF1-AS1',
                  'TMEM255A', 'SMIM10L2B-AS1', 'IL9R_ENSG00000124334', 'DDX3Y',
                  'EIF1AY'],
                 categories=['A1BG', 'A1BG-AS1', 'A1CF', 'A2M', 'A2M-AS1', 'A2ML1', 'A2ML1-AS1', 'A2ML1-AS2', ...], ordered=False, dtype='category', name='feature_name', length=1000)

In [59]:

Copied!

datExpr = datExpr.loc[:,hvg]
datExpr
datExpr = datExpr.loc[:,hvg]
datExpr

Out[59]:

feature_name	ISG15	LINC01342	TTLL10-AS1	TNFRSF18	CALML6	CHD5	ICMT-DT	MIR34AHG	RBP7	MTOR-AS1	...	FRMPD3	TSC22D3	KLHL13	AKAP14	RHOXF1-AS1	TMEM255A	SMIM10L2B-AS1	IL9R_ENSG00000124334	DDX3Y	EIF1AY
0002_AAAGGGCAGCAGCACA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.513502	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
0002_AACAACCAGGGTTAGC-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.583828	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
0002_AACCCAAAGGGCCTCT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.344490	0.0	0.0	0.0	0.0	0.0	0.0	1.271980	0.960117
0002_AACGAAACACAAAGTA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.207486	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
0002_AAGCGTTTCTTGGGCG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.435951	0.0	0.0	0.0	0.0	0.0	0.0	1.157864	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.582282	0.0	0.0	0.0	0.0	0.0	0.0	1.038052	0.000000
079_TGAATCGAGATTCGAA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.445902	0.0	0.0	0.0	0.0	0.0	0.0	1.022813	0.000000
079_TGCGATAAGGTAGATT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.282646	0.0	0.0	0.0	0.0	0.0	0.0	1.282646	0.000000
079_TGCTCGTAGGGTTGCA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.245300	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
079_TTCCTCTAGAGCTTTC-1	0.942032	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.516629	0.0	0.0	0.0	0.0	0.0	0.0	0.942032	0.942032

3540 rows × 1000 columns

Add the donor_id column to the gene expression dataframe, so we know which cell came from which donor

In [60]:

Copied!

# Reset the index of 'datExpr' DataFrame to make the row names (cell names) a column
datExpr_donor = datExpr.reset_index()
# Reset the index of 'datExpr' DataFrame to make the row names (cell names) a column
datExpr_donor = datExpr.reset_index()

In [61]:

Copied!

datExpr_donor
datExpr_donor

Out[61]:

feature_name	index	ISG15	LINC01342	TTLL10-AS1	TNFRSF18	CALML6	CHD5	ICMT-DT	MIR34AHG	RBP7	...	FRMPD3	TSC22D3	KLHL13	AKAP14	RHOXF1-AS1	TMEM255A	SMIM10L2B-AS1	IL9R_ENSG00000124334	DDX3Y	EIF1AY
0	0002_AAAGGGCAGCAGCACA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.513502	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
1	0002_AACAACCAGGGTTAGC-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.583828	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
2	0002_AACCCAAAGGGCCTCT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.344490	0.0	0.0	0.0	0.0	0.0	0.0	1.271980	0.960117
3	0002_AACGAAACACAAAGTA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.207486	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
4	0002_AAGCGTTTCTTGGGCG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.435951	0.0	0.0	0.0	0.0	0.0	0.0	1.157864	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3535	079_TCTCCGAAGCTATCTG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.582282	0.0	0.0	0.0	0.0	0.0	0.0	1.038052	0.000000
3536	079_TGAATCGAGATTCGAA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.445902	0.0	0.0	0.0	0.0	0.0	0.0	1.022813	0.000000
3537	079_TGCGATAAGGTAGATT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.282646	0.0	0.0	0.0	0.0	0.0	0.0	1.282646	0.000000
3538	079_TGCTCGTAGGGTTGCA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.245300	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000
3539	079_TTCCTCTAGAGCTTTC-1	0.942032	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.516629	0.0	0.0	0.0	0.0	0.0	0.0	0.942032	0.942032

3540 rows × 1001 columns

In [62]:

Copied!

# Merge 'datExpr_reset' with 'metadata' on the 'index' and 'cell_id' columns
datExpr_donor = pd.merge(datExpr_donor, metadata[['cell_id', 'donor_id']], left_on='index', right_on='cell_id', how='left')
# Merge 'datExpr_reset' with 'metadata' on the 'index' and 'cell_id' columns
datExpr_donor = pd.merge(datExpr_donor, metadata[['cell_id', 'donor_id']], left_on='index', right_on='cell_id', how='left')

In [63]:

Copied!

datExpr_donor
datExpr_donor

Out[63]:

	index	ISG15	LINC01342	TTLL10-AS1	TNFRSF18	CALML6	CHD5	ICMT-DT	MIR34AHG	RBP7	...	KLHL13	AKAP14	RHOXF1-AS1	TMEM255A	SMIM10L2B-AS1	IL9R_ENSG00000124334	DDX3Y	EIF1AY	cell_id	donor_id
0	0002_AAAGGGCAGCAGCACA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0002_AAAGGGCAGCAGCACA-1	CH-20-002
1	0002_AACAACCAGGGTTAGC-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0002_AACAACCAGGGTTAGC-1	CH-20-002
2	0002_AACCCAAAGGGCCTCT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.271980	0.960117	0002_AACCCAAAGGGCCTCT-1	CH-20-002
3	0002_AACGAAACACAAAGTA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0002_AACGAAACACAAAGTA-1	CH-20-002
4	0002_AAGCGTTTCTTGGGCG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.157864	0.000000	0002_AAGCGTTTCTTGGGCG-1	CH-20-002
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3535	079_TCTCCGAAGCTATCTG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.038052	0.000000	079_TCTCCGAAGCTATCTG-1	CH-21-079
3536	079_TGAATCGAGATTCGAA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.022813	0.000000	079_TGAATCGAGATTCGAA-1	CH-21-079
3537	079_TGCGATAAGGTAGATT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.282646	0.000000	079_TGCGATAAGGTAGATT-1	CH-21-079
3538	079_TGCTCGTAGGGTTGCA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	079_TGCTCGTAGGGTTGCA-1	CH-21-079
3539	079_TTCCTCTAGAGCTTTC-1	0.942032	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.942032	0.942032	079_TTCCTCTAGAGCTTTC-1	CH-21-079

3540 rows × 1003 columns

In [64]:

Copied!

# Set the cell names as the index again
datExpr_donor.set_index('index', inplace=True)
# Set the cell names as the index again
datExpr_donor.set_index('index', inplace=True)

In [65]:

Copied!

datExpr_donor
datExpr_donor

Out[65]:

	ISG15	LINC01342	TTLL10-AS1	TNFRSF18	CALML6	CHD5	ICMT-DT	MIR34AHG	RBP7	MTOR-AS1	...	KLHL13	AKAP14	RHOXF1-AS1	TMEM255A	SMIM10L2B-AS1	IL9R_ENSG00000124334	DDX3Y	EIF1AY	cell_id	donor_id
index
0002_AAAGGGCAGCAGCACA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0002_AAAGGGCAGCAGCACA-1	CH-20-002
0002_AACAACCAGGGTTAGC-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0002_AACAACCAGGGTTAGC-1	CH-20-002
0002_AACCCAAAGGGCCTCT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.271980	0.960117	0002_AACCCAAAGGGCCTCT-1	CH-20-002
0002_AACGAAACACAAAGTA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0002_AACGAAACACAAAGTA-1	CH-20-002
0002_AAGCGTTTCTTGGGCG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.157864	0.000000	0002_AAGCGTTTCTTGGGCG-1	CH-20-002
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.038052	0.000000	079_TCTCCGAAGCTATCTG-1	CH-21-079
079_TGAATCGAGATTCGAA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.022813	0.000000	079_TGAATCGAGATTCGAA-1	CH-21-079
079_TGCGATAAGGTAGATT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.282646	0.000000	079_TGCGATAAGGTAGATT-1	CH-21-079
079_TGCTCGTAGGGTTGCA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	079_TGCTCGTAGGGTTGCA-1	CH-21-079
079_TTCCTCTAGAGCTTTC-1	0.942032	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.942032	0.942032	079_TTCCTCTAGAGCTTTC-1	CH-21-079

3540 rows × 1002 columns

In [66]:

Copied!

# Remove the 'cell_id' column if needed
datExpr_donor.drop(columns=['cell_id'], inplace=True)
# Remove the 'cell_id' column if needed
datExpr_donor.drop(columns=['cell_id'], inplace=True)

In [67]:

Copied!

datExpr_donor
datExpr_donor

Out[67]:

	ISG15	LINC01342	TTLL10-AS1	TNFRSF18	CALML6	CHD5	ICMT-DT	MIR34AHG	RBP7	MTOR-AS1	...	TSC22D3	KLHL13	AKAP14	RHOXF1-AS1	TMEM255A	SMIM10L2B-AS1	IL9R_ENSG00000124334	DDX3Y	EIF1AY	donor_id
index
0002_AAAGGGCAGCAGCACA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.513502	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	CH-20-002
0002_AACAACCAGGGTTAGC-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.583828	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	CH-20-002
0002_AACCCAAAGGGCCTCT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.344490	0.0	0.0	0.0	0.0	0.0	0.0	1.271980	0.960117	CH-20-002
0002_AACGAAACACAAAGTA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.207486	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	CH-20-002
0002_AAGCGTTTCTTGGGCG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.435951	0.0	0.0	0.0	0.0	0.0	0.0	1.157864	0.000000	CH-20-002
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
079_TCTCCGAAGCTATCTG-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.582282	0.0	0.0	0.0	0.0	0.0	0.0	1.038052	0.000000	CH-21-079
079_TGAATCGAGATTCGAA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.445902	0.0	0.0	0.0	0.0	0.0	0.0	1.022813	0.000000	CH-21-079
079_TGCGATAAGGTAGATT-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.282646	0.0	0.0	0.0	0.0	0.0	0.0	1.282646	0.000000	CH-21-079
079_TGCTCGTAGGGTTGCA-1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.245300	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	CH-21-079
079_TTCCTCTAGAGCTTTC-1	0.942032	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.516629	0.0	0.0	0.0	0.0	0.0	0.0	0.942032	0.942032	CH-21-079

3540 rows × 1001 columns

In [68]:

Copied!

#Save the expression matrix with donor_id
datExpr_donor.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_donorid_singlecell.csv', index = True)
#Save the expression matrix with donor_id
datExpr_donor.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_donorid_singlecell.csv', index = True)

In [ ]:

Now that we have our gene expression dataframe, it is now possible to aggregate the data for pseudobulking.

In [69]:

Copied!

# Aggregate expression by donor ID (summing the values)
pseudobulk_df = datExpr_donor.groupby('donor_id').sum()
# Aggregate expression by donor ID (summing the values)
pseudobulk_df = datExpr_donor.groupby('donor_id').sum()

In [70]:

Copied!

pseudobulk_df
pseudobulk_df

Out[70]:

	ISG15	LINC01342	TTLL10-AS1	TNFRSF18	CALML6	CHD5	ICMT-DT	MIR34AHG	RBP7	MTOR-AS1	...	FRMPD3	TSC22D3	KLHL13	AKAP14	RHOXF1-AS1	TMEM255A	SMIM10L2B-AS1	IL9R_ENSG00000124334	DDX3Y	EIF1AY
donor_id
CH-20-001	6.380902	0.00000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	53.239479	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	21.632603	17.641195
CH-20-002	12.606750	2.33599	0.000000	0.000000	0.00000	0.000000	1.089918	0.000000	1.158743	1.173824	...	0.000000	112.643967	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	45.432411	22.809191
CH-20-004	12.302510	0.00000	0.000000	21.512184	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	42.873409	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	15.570595	20.173725
CH-20-005	18.603716	1.16925	1.232658	4.975880	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	1.112746	190.337738	0.000000	0.0000	1.191559	0.000000	0.000000	0.000000	6.931139	1.071742
CH-21-002	13.705297	0.00000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	44.942261	0.000000	0.0000	0.000000	0.000000	0.000000	1.323198	0.000000	0.000000
CH-21-006	4.377715	0.00000	0.000000	23.782143	0.00000	0.000000	0.000000	0.000000	1.023552	0.000000	...	0.000000	12.741602	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	5.349793	12.981407
CH-21-008	18.058025	0.00000	0.000000	44.614342	0.00000	1.201673	0.000000	0.000000	0.000000	0.000000	...	0.000000	76.893723	0.000000	0.0000	0.000000	0.000000	0.000000	1.080360	1.188176	2.377049
CH-21-013	21.395964	0.00000	0.000000	30.426510	0.00000	0.000000	1.235703	0.000000	0.000000	0.000000	...	0.000000	54.458328	0.000000	0.0000	0.000000	0.000000	1.117969	1.236817	23.543072	53.250420
CH-21-014	13.436963	0.00000	0.000000	11.067089	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	32.248600	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	17.709280	21.636190
CH-21-017	22.916807	0.00000	0.000000	9.076924	0.00000	0.000000	0.000000	0.000000	2.478934	0.000000	...	0.000000	188.600067	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	49.114601	40.454937
CH-21-020	197.794693	0.00000	0.000000	122.788269	0.00000	0.000000	0.000000	0.000000	1.047435	0.000000	...	0.000000	197.616577	0.000000	0.0000	0.000000	0.000000	0.000000	0.765914	88.202682	173.938080
CH-21-021	13.898113	0.00000	0.000000	11.169237	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	20.431047	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	11.017612	21.158054
CH-21-028	7.210576	0.00000	1.066841	1.321003	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	57.428059	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	2.989964	0.000000
CH-21-029	9.007506	0.00000	0.000000	1.928463	1.21185	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	157.941895	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	3.745571	2.051687
CH-21-031	30.211197	0.00000	0.000000	40.325451	0.00000	0.000000	0.000000	2.130981	0.000000	0.000000	...	1.244156	12.550498	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	1.227465	0.906813
CH-21-033	21.972580	0.00000	0.000000	84.504501	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	152.658005	1.167664	1.2505	0.000000	1.199426	0.000000	0.000000	45.661453	142.600739
CH-21-034	54.934029	0.00000	0.000000	147.552780	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.886594	167.975906	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	0.000000	1.075679
CH-21-036	17.018766	0.00000	0.000000	2.483573	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	88.924919	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	39.051266	10.004631
CH-21-037	150.473450	0.00000	0.000000	53.255013	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	38.325649	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	33.786545	58.778214
CH-21-046	9.337872	0.00000	0.000000	28.949800	0.00000	1.123670	0.000000	0.000000	0.000000	0.000000	...	0.000000	28.600826	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	9.238980	12.119887
CH-21-073	4.982193	0.00000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	40.201653	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	27.179262	2.954510
CH-21-074	3.954194	0.00000	0.000000	0.000000	0.00000	0.000000	0.000000	1.127058	0.000000	0.000000	...	0.000000	18.354240	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	6.401694	2.069999
CH-21-077	33.969109	0.00000	0.000000	3.333775	0.00000	0.000000	0.000000	0.000000	1.115637	0.000000	...	0.000000	161.007568	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	1.646068	0.000000
CH-21-079	7.030363	0.00000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	40.449474	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	18.332941	11.980942

24 rows × 1000 columns

In [71]:

Copied!

#Save the pseudobulk expression matrix with donor_id
pseudobulk_df.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_pseudobulk.csv', index = True)
#Save the pseudobulk expression matrix with donor_id
pseudobulk_df.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_pseudobulk.csv', index = True)

In [ ]:

We now have the pseudobulked data and the corresponding metadata dataframe to start the correlation network analysis

In [ ]: