Network Analysis¶
To identify which sub-networks i.e. communities, are of interest, we'll calculate the correlation between the networks and the relevant clinical metadata variables
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
from sklearn.decomposition import PCA
from scipy.stats import pearsonr
from statsmodels.stats.multitest import multipletests
import pickle
import seaborn as sns
Read in required data
#Load in the required data
datExpr = pd.read_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_pseudobulk.csv', index_col = 0)
metadata = pd.read_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata_pseudobulk.csv', index_col = 0)
datExpr
metadata
gene_names = datExpr.columns
gene_names
with open('/ReCoDE-Gene-Network-Analysis/data/other/separated_communities.pkl', 'rb') as file:
separated_communities = pickle.load(file)
print(separated_communities)
Step 1: Module Eigengene Calculation¶
Module eigengene calculation is a concept used in the analysis of gene expression data.
Module/sub-network: A module refers to a group of genes that exhibit similar expression patterns across samples. Modules are often identified using clustering algorithms applied to gene expression data.
Eigengene: An eigengene represents the overall expression profile of a module. It is calculated as the first principal component of the gene expression profiles within the module. Essentially, the eigengene captures the main axis of variation or the common expression pattern shared by the genes within the module.
We will be using module eigengenes as representations for further downstream analysis.
# Initialise a DataFrame to store module eigengenes with the rownames i.e. the gene names from the expression matrix as the rownames in the new dataframe
module_eigengenes =
# Calculate the module eigengene for each community
for i, community in enumerate(separated_communities):
community_genes = [gene for gene in community if gene in gene_names]
if community_genes:
community_expr = datExpr[community_genes]
pca = PCA(n_components=1)
eigengene = pca.fit_transform(community_expr)
module_eigengenes[f'Module_{i+1}'] = eigengene[:, 0]
Based on the metadata, which columns seem important/interesting and which do you think should be dropped for downstream analysis?
metadata
metadata2 = metadata.drop(columns=[])
Next we want to merge the metadata with the module eigengene dataframe using the pd.concat function:
merged_data =
merged data
Since you have already learned how to calculate pearson correlation in the previous notebook, calculate Pearson correlation on the merged data.
correlation_matrix2 = merged_data
correlation_matrix2
#The correlation matrix needs to be reformatted into the correct format:
correlation_matrix3 = correlation_matrix2.drop(['nCount_RNA', 'nFeature_RNA', 'percent.mt', 'development_stage',
'male', 'female', 'CH', 'normal', 'DNMT3A', 'TET2', 'NoMutation'])
correlation_matrix3
correlation_matrix3 = correlation_matrix3.drop(correlation_matrix3.columns[4:], axis=1)
correlation_matrix3
We now need to calculate the p-values for the correlations on the merged data.
# Initialize an empty DataFrame to store p-values
module_p_values = pd.DataFrame(index=module_eigengenes.columns, columns=metadata2.columns)
# Calculate p-values for correlations between module eigengenes and metadata
for module in module_eigengenes.columns:
for metadata_column in metadata2.columns:
# Calculate correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(module_eigengenes[module], metadata2[metadata_column])
# Store p-value in the DataFrame
module_p_values.loc[module, metadata_column] = p_value
print(module_p_values)
We now need to calculate the adjusted p-values. Calculate adjusted p-values using FDR correction. Use the previous example on how to calculate p-values to create a for loop to calculate the adjusted p-values. Please complete the for loop
adjusted_p_values = pd.DataFrame()
for column in :
p_values = module_p_values[column].astype(float)
# Perform FDR correction
_, adj_p_values, _, _ = multipletests(, method='fdr_bh')
[column] = adj_p_values
print(adjusted_p_values)
Step 2: Correlation Heatmap¶
We want a method to highlight interesting modules. For this purpose we will be creating a heatmap of the correlations and associated adjusted p-values between the modules and the metadata.
# Function to annotate heatmap with correlation values and p-values
def annotate_heatmap_with_p_values(ax, correlation_matrix, p_values, threshold=0.05):
stars = np.empty(p_values.shape, dtype='<U2')
stars[p_values > threshold] = ''
stars[p_values <= threshold] = '*'
for i in range(correlation_matrix.shape[0]):
for j in range(correlation_matrix.shape[1]):
# Format annotation string with correlation value
annotation_corr = f"{correlation_matrix.iloc[i, j]:.2f}"
# Format p-value in brackets
annotation_p_value = f"({p_values.iloc[i, j]:.2f})"
# Add star for significant p-values
if p_values.iloc[i, j] <= threshold:
annotation_p_value += '*'
# Add annotations
ax.text(j+0.5, i+0.4, annotation_corr, ha='center', va='center', color='black')
ax.text(j+0.5, i+0.6, annotation_p_value, ha='center', va='center', color='black')
Visualise the correlation matrix as a heatmap with correlation values and p-values You have already seen how to create figures. Fill in this example to create your heatmap
#Insert figure size
#Use sns.heatmap function to create the heatmap using the formatted correlation matrix
heatmap = sns.heatmap(, annot=False, cmap='coolwarm', vmin=-1, vmax=1)
#Insert a title
# Annotate heatmap with correlation values and p-values
annotate_heatmap_with_p_values(heatmap, correlation_matrix3, adjusted_p_values)
#Show the plot
External Reading:
- PCA: https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186
- Heatmaps: https://www.atlassian.com/data/charts/heatmap-complete-guide#:~:text=What%20is%20a%20heatmap%3F,in%20the%20corresponding%20cell%20range.
- False Discovery Rate: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1716-1
Exercise Questions:
- What is a correlation and why is it used in gene co-expression analysis?
- What is the purpose of calculating module eigengenes in the context of gene expression data analysis?
- Explain the rationale behind using PCA to calculate module eigengenes. How does PCA help in capturing the main variation in the gene expression data within a module?
- Why are adjusted p-values used instead of p-values?
- In the context of merging module eigengenes with clinical metadata, why might it be important to drop certain columns like 'donor_id.1', 'scType_celltype', 'tissue_type', 'cell_type', 'tissue', and 'MUTATION'?
- What is fdr_bh and explain it fully?
- Based on the heatmap, what can you summarise?
- It can be important to check for missing values within the data. How would you do this? Also, if there are missing values, how would you handle them in this context?
- Perform FDR adjustment on the p-values and compare the number of significant correlations before and after correction.
- Plot the expression of eigengenes for a selected module against a specific metadata variable and provide an interpretation of the plot.
Answers: