Reproducible Spatial Transcriptomics Pipeline with RSE Best Practices
Introduction
This exemplar details an analysis pipeline for spatial transcriptomics (10X Xenium platform).
Spatial transcriptomics
Below is representative image of spatial transcriptomics data.
Spatial transcriptomics analysis pipeline
The pipeline covers preprocessing, quality control, dimensionality reduction, clustering, annotation, viewing spatial images, and spatial statistics (squidpy and MuSpAn).
Cell segmentation is not included in this pipeline as it is performed prior to the analysis using the 10X Genomics Xenium software. If you would like to segment the cells yourself, please refer to the 10X Genomics Nucleus and Cell Segmentation Algorithms for more information.
Best Practices for Software Engineering
In addition to the analysis pipeline, we highlight several good software engineering practices including version control, containarization, linting, and continuous integration. Details on these practices and how to implement them can be found in the Best Practices for Software Engineering section of the documentation.
Author information
This exemplar was developed at Imperial College London by Sara Patti in collaboration with Adrian D'Alessandro from Research Software Engineering and Jesus Urtasun from Research Computing & Data Science at the Early Career Researcher Institute.
Learning Outcomes π
After completing this exemplar, students will be able to:
- Describe the key steps in spatial transcriptomic analysis
- Analyze spatial transcriptomic data and apply spatial statistical methods
- Design and build a reproducible analysis pipeline
- Apply research software engineering (RSE) best practices detailed in the RSE Best Practices section
Target Audience π―
- Scientists interested in analyzing spatial transcriptomics data
- Biologists interested in developing bioinformatic pipelines
Prerequisites β
Prior to undertaking this exemplar, learners should have the following skills and knowledge:
- Python
- Command line interface (CLI)
Although not necessary, we recommend the following skills and knowledge to enhance the learning experience:
- Spatial transcriptomics data and underlying principles (e.g. 10X Genomics Xenium)
- Understand data analysis and statistics
- Familiarity with the scverse ecosystem (e.g. scanpy, squidpy)
- Structuring python projects and packages
Academic π
- Familiarity with biological concepts and principals (e.g mRNA, gene expression, transcriptomics)
- Basic understanding of spatial transcriptomics platforms and datasets
- Familiarity with single-cell RNA sequencing (scRNA-seq) analysis
System π»
- Python 3.10+
- Anaconda or miniconda required for Mac Intel users, for more details please refer to the Installation Guide
Getting Started π
-
Start by cloning the repository to your local machine in the directory of your choice
git clone https://github.com/ImperialCollegeLondon/ReCoDe-spatial-transcriptomics.git
-
Download the Xenium Lung FFPE data
-
Data can be downloaded from the 10x Genomics website, or directly from the command line.
- If downloading from the website, download the
Xenium_V1_Human_Lung_Cancer_Addon_FFPE_outs.zip
file. - If downloading from the command line, use the following command:
curl -O https://cf.10xgenomics.com/samples/xenium/2.0.0/Xenium_V1_Human_Lung_Cancer_Addon_FFPE/Xenium_V1_Human_Lung_Cancer_Addon_FFPE_outs.zip
- If downloading from the website, download the
-
Unzip the downloaded file.
- If you downloaded the file from the website, unzip it using your preferred method.
-
If you downloaded the file from the command line, use the following command:
unzip Xenium_V1_Human_Lung_Cancer_Addon_FFPE_outs.zip
-
-
Create new virtual environment using
conda
orvenv
Full details on how to set up the environment and install necessary packages can be found in the Installation Guide.If you are using
venv
, run the following command:cd ReCoDe-spatial-transcriptomics # Ensure you are in the root directory of the repo python -m venv recode_st # create a new virtual environment named recode_st source recode_st/bin/activate # On Windows use: st_env\Scripts\activate pip install -r requirements.txt # Install required packages pip install -e . # Install the package in editable mode to install the st_recode package as defined by pyproject.toml
If you are using
conda
, run the following command:cd ReCoDe-spatial-transcriptomics # Ensure you are in the root directory of the repo conda env create -f environment.yml conda activate recode_st pip install --no-build-isolation --no-deps -e . pip install https://docs.muspan.co.uk/code/latest.zip # If you need the MuSpAn modules
Newest versions for some packages do not support older Macs with Intel CPUs, so we recommend using the
conda
environment for these systems. If you are using an Apple Silicon Mac, you can use eitherconda
orvenv
. -
Update the
config.toml
file with the relevant paths and parameters for your analysis. This file contains configuration settings for the analysis pipeline, such as paths to data files and parameters for various steps in the pipeline.Additional details can be found in the Configuration Management section of the documentation.
-
Run the analysis pipeline by executing the main script.
python -m recode_st config.toml
Software Tools π οΈ
These dependencies are required to run the exemplar:
- matplotlib
- numpy
- pandas[excel]
- torch
- scanpy[leiden]
- spatialdata
- spatialdata-io
- squidpy
- seaborn
- zarr
- pydantic
These dependencies are required to develop the exemplar:
- mkdocs
- mkdocs-material
- ruff
- pre-commit
- pytest
Project Structure ποΈ
Overview of code organisation and structure.
βββ analysis # This will be created by the pipeline and contains the results of the analysis
βββ data
β βββ selected_cells_stats.csv # subset of cells used for the spatial analysis
β βββ xenium # download and unzipped data here
β βββ xenium.zarr # created by the pipeline
βββ docs
β βββ installation.md # additional doc.md files
β βββ assets
βββ src
β βββrecode_st
β βββ __init__.py
β βββ __main__.py
β βββ annotate.py
β βββ config.py
β βββ dimension_reduction.py
β βββ format_data.py
β βββ helper_function.py
β βββ logging_config.py
β βββ ms_spatial_graph.py
β βββ ms_spatial_stat.py
β βββ muspan.py
β βββ qc.py
β βββ spatial_statistics.py
β βββ view_images.py
βββ tests
β βββ test_config.py
β βββ test_helper_function.py
β βββ test_logging_config.py
β βββ test_main.py
βββ utils
Code is organised into logical components:
src
contains the code for core modulesdata
contains needed datasets - user must download the data and unzip itdocs
for documentationtests
for testing scripts
Roadmap πΊοΈ
Preprocessing & Quality Control
Goal: Ensure clean, usable spatial gene expression data.
It is critical to preprocess and perform quality control on the data before proceeding with analysis. This step ensures that the data is clean, usable, and of high quality by removing low quality cells and low quality transcripts.
Steps:
- Calculate quality metrics
- Filter low-quality genes and cells
- Normalize and transform gene counts
Dimensionality Reduction & Clustering
Goal: Identify patterns and groups of similar gene expression profiles.
Dimensionality Reduction a technique used to reduce the number of features (or dimensions) in a dataset while preserving important information. Clustering is a technique used to group similar data points together based on their features. It is critical to determine the most accurate number of clusters to ensure that the clusters are meaningful and representative of the data.
Steps:
- Compute PCA and neighbors
- Compute and plot UMAP
- Cluster cells using Leiden algorithms
- Visualize clusters on UMAP
Annotation & Cell Type Identification
Goal: Assign biological meaning to clusters.
Annotation is the process of assigning biological meaning to clusters. This typically equates to assigning a cell type identification to each clusters. is the process of identifying the cell types present in the data. Choosing the number of clusters can be challenging and can be seen as more of an art than a science. It is important to choose the number of clusters that best represents the data and the biological question being asked. More information on how to choose the number of clusters can be found in the scRNAseq best practices.
Steps:
- Compute differentially expressed genes for each cluster
- Visualize cluster marker genes
- Identify cell types with marker genes
- Annotate clusters with known cell types
Spatial Mapping & Visualization
Goal: Map gene expression and clusters back to their spatial context.
- Overlay expression and clusters on tissue image
- Plot spatially enriched genes
- Map cell types or states in space
Spatial Statistics & Spatially Variable Genes
Goal: Quantify spatial patterns and variability utilizing spatial statistics.
We use two different approaches to spatial statistics: Squidpy and MuSpAn.
- Compute spatial autocorrelation (e.g. Moran's I)
TODO: Differential Expression & Functional Analysis
Goal: Discover meaningful biology.
- Spatially variable genes (SVGs)
- DE between regions or conditions
- Pathway or GO enrichment
Data π
- Small toy dataset for testing and development (TBD)
- Xenium Lung FFPE data
Best Practice Notes π
- Git version control
- Virtual environments (e.g. conda, venv)
- Code modularity (e.g. functions, classes)
- Code documentation (e.g. docstrings, comments)
- Code style (e.g. PEP 8 for Python)
- Code testing
- Use of continuous integration (pre-commit, ruff) (?)
Estimated Time β³
Task | Time |
---|---|
Reading | 3 hours |
Practising | 3 hours |
Additional Resources π
Learn more about spatial transcriptomics
- An introduction to spatial transcriptomics for biomedical research
- 10x Genomics Xenium documentation
- Single Cell Spatial Transcriptomics: 10x Genomics Xenium
- Best practices for single cell and spatial transcriptomics
Learn more about networks and spatial statistics
Learn more about our tools and libraries
- 30-days-of-python
- scverse documentation
- squidpy documentation
- scanpy documentation
- MuSpAn documentation
Video Tutorials
We have included bioinformatic bloggers that can help you get started with understanding key concepts in bioinformatics and transcriptomics analysis:
Licence π
This project is licensed under the BSD-3-Clause license.