Gene Set Enrichment Analysis using Fisher Exact Test¶

In [ ]:

Copied!





import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy.stats import pearsonr
from statsmodels.stats.multitest import multipletests
import pickle
import seaborn as sns
from scipy.stats import fisher_exact
import gseapy as gp
from scipy.stats import hypergeom
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy.stats import pearsonr
from statsmodels.stats.multitest import multipletests
import pickle
import seaborn as sns
from scipy.stats import fisher_exact
import gseapy as gp
from scipy.stats import hypergeom

In [ ]:

Read in the required data

In [ ]:

Copied!

datExpr = pd.read_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_pseudobulk.csv', index_col = 0)
datExpr = pd.read_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_pseudobulk.csv', index_col = 0)

In [ ]:

Copied!

with open('/ReCoDE-Gene-Network-Analysis/data/other/separated_communities.pkl', 'rb') as file:
        separated_communities = pickle.load(file)
with open('/ReCoDE-Gene-Network-Analysis/data/other/separated_communities.pkl', 'rb') as file:
        separated_communities = pickle.load(file)

In [ ]:

Copied!

# Convert each module in separated_communities to a set
modules = {f'Module_{i+1}': set(community) for i, community in enumerate(separated_communities)}
# Convert each module in separated_communities to a set
modules = {f'Module_{i+1}': set(community) for i, community in enumerate(separated_communities)}

In [ ]:

Copied!

modules
modules

Step 1: Load in Gene-Sets¶

This exercise will show how to use GMT files that contain gene-set information downloaded from the Human MSigDB Collections database.

In [ ]:

Copied!





#Load the GMT file
def read_gmt_file(file_path):
    gene_sets = {}
    with open(file_path, 'r') as file:
        for line in file:
            parts = line.strip().split('\t')
            gene_set_name = parts[0]
            genes = parts[2:]
            gene_sets[gene_set_name] = set(genes)
    return gene_sets
#Load the GMT file
def read_gmt_file(file_path):
    gene_sets = {}
    with open(file_path, 'r') as file:
        for line in file:
            parts = line.strip().split('\t')
            gene_set_name = parts[0]
            genes = parts[2:]
            gene_sets[gene_set_name] = set(genes)
    return gene_sets

In [ ]:

Copied!

gmt_file = '/ReCoDE-Gene-Network-Analysis/data/other/h.all.v2023.2.Hs.symbols.gmt'

#Read the file in using the read_gmt_file function
gene_sets = read_gmt_file(gmt_file)
gmt_file = '/ReCoDE-Gene-Network-Analysis/data/other/h.all.v2023.2.Hs.symbols.gmt'

#Read the file in using the read_gmt_file function
gene_sets = read_gmt_file(gmt_file)

In [ ]:

Copied!

gene_sets
gene_sets

Step 2: Calculate Fisher's Exact Test for Enrichment¶

In [ ]:

Copied!





# Container for results
results = []

# Iterate over each module
for module_name, module_genes in modules.items():
    module_genes_in_expr = module_genes.intersection(datExpr.columns)
    
    # Check if module_genes_in_expr is empty to avoid issues
    if len(module_genes_in_expr) == 0:
        continue

    # Iterate over each gene set in the GMT file
    for gene_set_name, gene_set_genes in gene_sets.items():
        overlap_genes = module_genes_in_expr.intersection(gene_set_genes)
        non_overlap_module_genes = module_genes_in_expr.difference(gene_set_genes)
        non_overlap_gene_set_genes = gene_set_genes.difference(module_genes_in_expr)

        # Calculate the contingency table
        a = len(overlap_genes)
        b = len(non_overlap_module_genes)
        c = len(non_overlap_gene_set_genes)
        d = len(set(datExpr.columns)) - (a + b + c)  # Adjust d to only consider genes in datExpr
        
        # Ensure all values in the contingency table are non-negative
        if a < 0 or b < 0 or c < 0 or d < 0:
            continue

        # Perform Fisher's exact test
        odds_ratio, p_value = fisher_exact([[a, b], [c, d]], alternative='two-sided')
        
        results.append({
            'Module': module_name,
            'Gene Set': gene_set_name,
            'Overlap Count': a,
            'Odds Ratio': odds_ratio,
            'P-value': p_value
        })
# Container for results
results = []

# Iterate over each module
for module_name, module_genes in modules.items():
    module_genes_in_expr = module_genes.intersection(datExpr.columns)
    
    # Check if module_genes_in_expr is empty to avoid issues
    if len(module_genes_in_expr) == 0:
        continue

    # Iterate over each gene set in the GMT file
    for gene_set_name, gene_set_genes in gene_sets.items():
        overlap_genes = module_genes_in_expr.intersection(gene_set_genes)
        non_overlap_module_genes = module_genes_in_expr.difference(gene_set_genes)
        non_overlap_gene_set_genes = gene_set_genes.difference(module_genes_in_expr)

        # Calculate the contingency table
        a = len(overlap_genes)
        b = len(non_overlap_module_genes)
        c = len(non_overlap_gene_set_genes)
        d = len(set(datExpr.columns)) - (a + b + c)  # Adjust d to only consider genes in datExpr
        
        # Ensure all values in the contingency table are non-negative
        if a < 0 or b < 0 or c < 0 or d < 0:
            continue

        # Perform Fisher's exact test
        odds_ratio, p_value = fisher_exact([[a, b], [c, d]], alternative='two-sided')
        
        results.append({
            'Module': module_name,
            'Gene Set': gene_set_name,
            'Overlap Count': a,
            'Odds Ratio': odds_ratio,
            'P-value': p_value
        })

In [ ]:

Copied!





# Convert results to DataFrame for further analysis
results_df = 

# Adjust p-values for multiple testing using the multipletests function
results_df['Adjusted P-value'] = 

# Display results
print(results_df)
# Convert results to DataFrame for further analysis
results_df = 

# Adjust p-values for multiple testing using the multipletests function
results_df['Adjusted P-value'] = 

# Display results
print(results_df)

Let's focus on one module for the purpose of the enrichment analysis:

In [ ]:

Copied!

# Filter for Module 1 and sort by overlap count.
module_1_results = 

#Sort the module results by overlap count.
top_10_enriched_gene_sets = module_1_results.sort_values(by='Overlap Count', ascending=False).head(10)

print(top_10_enriched_gene_sets)
# Filter for Module 1 and sort by overlap count.
module_1_results = 

#Sort the module results by overlap count.
top_10_enriched_gene_sets = module_1_results.sort_values(by='Overlap Count', ascending=False).head(10)

print(top_10_enriched_gene_sets)

In [ ]:

Copied!





#Extract data for plotting (top 10 pathways with highest overlap count)
top_10_enriched_gene_sets
y_labels = top_10_enriched_gene_sets['Gene Set']
x = top_10_enriched_gene_sets['Odds Ratio']
sizes = top_10_enriched_gene_sets['Overlap Count'] * 20
p_values = top_10_enriched_gene_sets['P-value']
adjusted_p_values = top_10_enriched_gene_sets['Adjusted P-value']
#Extract data for plotting (top 10 pathways with highest overlap count)
top_10_enriched_gene_sets
y_labels = top_10_enriched_gene_sets['Gene Set']
x = top_10_enriched_gene_sets['Odds Ratio']
sizes = top_10_enriched_gene_sets['Overlap Count'] * 20
p_values = top_10_enriched_gene_sets['P-value']
adjusted_p_values = top_10_enriched_gene_sets['Adjusted P-value']

Step 3: Visualise Enriched Pathways¶

In [ ]:

Copied!





# Plot the scatter plot for the top 10 pathways
plt.figure(figsize=(12, 8))

# Main scatter plot
sc = plt.scatter(x, range(10), s=sizes, c=-adjusted_p_values, cmap='viridis', alpha=0.7)

# Set labels and title
plt.yticks(range(10), y_labels)  # Set y-axis labels to pathway names
plt.xlabel('Odds Ratio')
plt.ylabel('Pathway')
plt.title('Top 10 Enriched Gene Sets')

# Add a legend for dot sizes (overlap count) outside the plot
handles, labels = sc.legend_elements(prop="sizes", alpha=0.6)
legend_labels = [f'Overlap Count: {int(float(label.split("{")[1].split("}")[0]) // 20)}' for label in labels]
legend = plt.legend(handles, legend_labels, loc="center left", bbox_to_anchor=(1.3, 0.5), title="Overlap", frameon=False)

# Add colorbar
cbar = plt.colorbar(sc, shrink=0.2)  # Shrink colorbar size
cbar.ax.xaxis.set_label_position('top')
cbar.ax.set_title('Adj.pval', loc='left') 

# Adjust layout to prevent overlap between legends
plt.subplots_adjust(right=0.75)  # Adjust the right margin to make space for the legends

# Show the plot
plt.grid(True)
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
# Plot the scatter plot for the top 10 pathways
plt.figure(figsize=(12, 8))

# Main scatter plot
sc = plt.scatter(x, range(10), s=sizes, c=-adjusted_p_values, cmap='viridis', alpha=0.7)

# Set labels and title
plt.yticks(range(10), y_labels)  # Set y-axis labels to pathway names
plt.xlabel('Odds Ratio')
plt.ylabel('Pathway')
plt.title('Top 10 Enriched Gene Sets')

# Add a legend for dot sizes (overlap count) outside the plot
handles, labels = sc.legend_elements(prop="sizes", alpha=0.6)
legend_labels = [f'Overlap Count: {int(float(label.split("{")[1].split("}")[0]) // 20)}' for label in labels]
legend = plt.legend(handles, legend_labels, loc="center left", bbox_to_anchor=(1.3, 0.5), title="Overlap", frameon=False)

# Add colorbar
cbar = plt.colorbar(sc, shrink=0.2)  # Shrink colorbar size
cbar.ax.xaxis.set_label_position('top')
cbar.ax.set_title('Adj.pval', loc='left') 

# Adjust layout to prevent overlap between legends
plt.subplots_adjust(right=0.75)  # Adjust the right margin to make space for the legends

# Show the plot
plt.grid(True)
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()

KEGG Pathway Enrichment Analysis using Gseapy¶

In [ ]:

Copied!

kegg = gp.get_library(name='KEGG_2016', organism='Human')
kegg = gp.get_library(name='KEGG_2016', organism='Human')

In [ ]:

Copied!

kegg
kegg

Try out libraries other than KEGG too, such as GO_Molecular_Function_2018.

In [ ]:

Copied!

gene_sets
gene_sets

The kegg results need to be reformatted into a dictionary for enrichment analysis:

In [ ]:

Copied!





# Initialize an empty dictionary to store pathway name: gene names pairs
pathway_dict = {}

# Iterate through the pathway_gene_dict and populate the pathway_dict
for pathway_name, gene_list in kegg.items():
    # Convert gene list to a set to remove duplicates
    gene_set = set(gene_list)
    # Assign gene set to pathway name in pathway_dict
    pathway_dict[pathway_name] = gene_set

# Display the resulting dictionary
print(pathway_dict)
# Initialize an empty dictionary to store pathway name: gene names pairs
pathway_dict = {}

# Iterate through the pathway_gene_dict and populate the pathway_dict
for pathway_name, gene_list in kegg.items():
    # Convert gene list to a set to remove duplicates
    gene_set = set(gene_list)
    # Assign gene set to pathway name in pathway_dict
    pathway_dict[pathway_name] = gene_set

# Display the resulting dictionary
print(pathway_dict)

Create your own results dataframe and carry out the Fisher Exact Test on the KEGG results. If you get stuck, you can use the previous example as a template.

In [ ]:

Copied!

#Container for results
#Container for results

In [ ]:

Copied!





# Convert results to DataFrame for further analysis
results_df = 

# Adjust p-values for multiple testing
results_df['Adjusted P-value'] =

# Display results
print(results_df)
# Convert results to DataFrame for further analysis
results_df = 

# Adjust p-values for multiple testing
results_df['Adjusted P-value'] =

# Display results
print(results_df)

In [ ]:

Copied!

# Filter for Module 1 and sort by overlap count
module_1_results = 
top_10_enriched_gene_sets = 

print(top_10_enriched_gene_sets)
# Filter for Module 1 and sort by overlap count
module_1_results = 
top_10_enriched_gene_sets = 

print(top_10_enriched_gene_sets)

In [ ]:

Copied!





# Extract data for plotting (top 10 pathways with highest overlap count)
top_10_enriched_gene_sets
y_labels =
x = 
sizes = 
p_values =
adjusted_p_values =
# Extract data for plotting (top 10 pathways with highest overlap count)
top_10_enriched_gene_sets
y_labels =
x = 
sizes = 
p_values =
adjusted_p_values = 

Visualisation of KEGG Enrichment¶

In [ ]:

Copied!

# Plot the scatter plot for the top 10 pathways
# Plot the scatter plot for the top 10 pathways

In [ ]:

External Reading:

GSEA website: https://www.gsea-msigdb.org/gsea/index.jsp
GSEA paper: https://www.pnas.org/doi/10.1073/pnas.0506580102
Fisher exact test: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/
Odd's Ratio: https://www.ncbi.nlm.nih.gov/books/NBK431098/
Alternative GSEA package: https://maayanlab.cloud/Enrichr/

Exercise Questions:

What is gene set enrichment analysis?
What is Fisher Exact Test?
What is the contingency table in Fisher's Exact Test, and how is it constructed for GSEA?
What does the odds ratio represent in the context of Fisher's Exact Test?
What are some potential limitations of using Fisher's Exact Test for enrichment analysis?
What are some alternatives to Fisher's Exact Test for enrichment analysis?
How does GSEA differ from use of Fisher Exact Test?
Enhance the visualisation of enrichment results. Plot a scatter plot for the top 10 enriched gene sets for Module 2. Customise the plot with different color maps and point markers. Annotate the top 3 most significant gene sets on the scatter plot with their names and p-values.
Think of a different way to visualise the enrichment results instead and plot them. For example, plot a heatmap of overlap counts across modules and gene sets.
Carry out a correlation analysis of enrichment scores across modules. Investigate the correlation between enrichment scores across different modules to understand how similarly different gene sets are enriched across various modules.

Answers: