Downstream analyses

Jack Gisby 2022-06-14

Loading the data
Unsupervised analysis and visualisation
Principal components analysis
Differential expression

This markdown document shows how we can perform basic downstream analysis on raw RNA-seq counts. These counts were produced by the data processing pipeline developed in this repository applied to RNA-seq reads measured from soybean plants grown at ambient or elevated ozone levels. If you want to learn more about the preprocessing pipeline, see the documents describing its development in the docs/ directory.

The raw counts files are a lot smaller than the original reads (that were stored in fasta files). They are small enough that we can read the raw counts into R on a standard home computer. After normalising the raw counts data, we will perform some basic analyses to investigate the transcriptome of the soybean samples.

Loading the data

We load the names of the samples from data/files.txt into sample_names before extracting the raw counts for each sample. There is a counts file for each sample, but it would be more convenient to group these together into a single data matrix. Each of these files contains two columns; the first represents gene IDs and the second indicates how many of the reads for that sample mapped to that gene.

We first create a dataframe, counts, that contains a single column: the gene IDs that we extract from the counts file for the first sample (SRR391535).

Then, we loop through each of the samples, reading in that sample’s count file and merging them into the matrix counts. Finally, we can see that we have created a single matrix describing how many reads map to each gene for each sample.

# get the sample names
sample_names <- read.csv("../data/files.txt", header = FALSE)[[1]]

# a dataframe for all of the counts
counts <- data.frame(gene_id = fread("../3_nextflow_pipeline_results/g_count/SRR391535.counts")[[1]])

# for each sample
for (s in sample_names) {

    # get the counts from htseq
    sample_counts <- fread(paste0("../3_nextflow_pipeline_results/g_count/", s, ".counts"))
    colnames(sample_counts) <- c("gene_id", s)

    # remove counts for reads that did not map to a gene
    sample_counts <- sample_counts[which(!grepl("__", sample_counts[[1]])) ,]

    # merge with the main dataframe
    counts <- merge(counts, sample_counts, by = "gene_id")
}

# move the gene_id column to the rownames
rownames(counts) <- counts$gene_id
counts <- subset(counts, select = -gene_id)

# view the combined matrix
print(head(counts))

##           SRR391535 SRR391536 SRR391537 SRR391538 SRR391539 SRR391541
## 1-3-1A           87        41        59        88        79        98
## 1-3-1B          242       150       212       289       249       288
## 100801599        22         9       859      1455      1421        24
## 100805863         0         0         0         0         0         0
## 13-1-1          110        39       163       121       284       194
## 14-1-1           21        26        65        14        63        33

In the following chunk, we create a dataframe, sample_names, describing the condition for each of the samples. Three of the samples were grown at ambient ozone levels, while the other three were grown with elevated ozone.

sample_info <- data.frame(condition = c("ambient", "ambient", "elevated", "elevated", "elevated", "ambient"))
rownames(sample_info) <- sample_names

print(sample_info)

##           condition
## SRR391535   ambient
## SRR391536   ambient
## SRR391537  elevated
## SRR391538  elevated
## SRR391539  elevated
## SRR391541   ambient

The Bioconductor project has a package implementing a special container for this sort of data. We can store the counts data in the SummarizedExperiment object, and we can also store information pertaining to the rows (genes) and columns (samples).

For more information on this data container, see the SummarizedExperiment vignette.

se <- SummarizedExperiment(counts, colData = sample_info)
assayNames(se) <- "counts"

print(se)

## class: SummarizedExperiment
## dim: 54361 6
## metadata(0):
## assays(1): counts
## rownames(54361): 1-3-1A 1-3-1B ... ZTL1 ZTL2
## rowData names(0):
## colnames(6): SRR391535 SRR391536 ... SRR391539 SRR391541
## colData names(1): condition

From this container, we can access information on our samples.

colData(se)

## DataFrame with 6 rows and 1 column
##             condition
##           <character>
## SRR391535     ambient
## SRR391536     ambient
## SRR391537    elevated
## SRR391538    elevated
## SRR391539    elevated
## SRR391541     ambient

And we can extract the counts data as a matrix.

# view the first ten rows of the counts matrix
assay(se)[1:10 ,]

##           SRR391535 SRR391536 SRR391537 SRR391538 SRR391539 SRR391541
## 1-3-1A           87        41        59        88        79        98
## 1-3-1B          242       150       212       289       249       288
## 100801599        22         9       859      1455      1421        24
## 100805863         0         0         0         0         0         0
## 13-1-1          110        39       163       121       284       194
## 14-1-1           21        26        65        14        63        33
## 19-1-5            0         0         1         0         0         0
## 4CL            1309       742       574       860      1279      1445
## 4CL1           9627      5597     16109     28881     32941     10526
## 4CL2           1145       556      2686      4424      5729      1407

Unsupervised analysis and visualisation

Next, we will attempt to visualise the soybean transcriptomes. Currently, we are storing the raw counts with the SummarizedExperiment object. However, these are not necessarily appropriate for all downstream applications. For instance:

Some genes may not be expressed by many/any of the samples. We may not want to include these in our downstream analyses.
Analyses that compare different samples or genes may assume that particular normalisation procedures have been applied.
Some analyses are designed to work with counts after they have had a transformation applied, such as a log transformation.

For each of our analyses, we will consider whether our data has been appropriately processed. The package edgeR is popular for performing differential expression analysis of RNA-seq data. It also has functions for normalising and transforming raw counts data. We will be using it in this notebook.

First, we will filter out genes that have low counts in our samples. The function filterByExpr will keep genes that have sufficient counts in the majority of our samples. We will also pass our experimental condition to the group argument, which makes sure there is sufficient counts in both the ambient and elevated sample groups.

# select genes to keep
fbe_keep <- filterByExpr(se, group = se$condition, min.count = 500)

# remove genes that are not in the list
filtered_se <- se[fbe_keep, ]

print(paste0("Number of genes before filtering: ", nrow(rowData(se))))

## [1] "Number of genes before filtering: 54361"

print(paste0("Number of genes after filtering: ", nrow(rowData(filtered_se))))

## [1] "Number of genes after filtering: 10727"

Secondly, we will calculate normalisation factors. The function calcNormFactors will do this, however the normalisation will not actually be applied to the counts yet. calcNormFactors transforms the SummarizedExperiment object into an edgeR DGEList object, in which it will store the raw counts and the calculated normalisation factors.

edgeR applies TMM normalisation, which accounts for the following factors: - sequencing depth - This refers to the number of reads that have been generated in total. More reads may have been generated for some samples compared to others. It is essential to normalise for sequencing depth before comparing gene expression from different samples. - RNA composition - Highly expressed outliers or contamination can skew some normalisation methods, so RNA composition must be accounted for. - gene length - Longer genes are likely to have more reads map to them. While not relevant to this notebook, if we were to compare expression of different genes within the sample samples we would need to account for gene length.

For more information on TMM normalisation, see the following publication.

# calculate normalisation factors, including TMM normalisation
dge <- calcNormFactors(filtered_se)

# add the experimental condition as the DGEList's group
dge$samples$group <- dge$samples$condition

The SummarizedExperiment can store multiple versions of the same count matrix, for instance with different normalisations or transformations applied. We have stored the raw counts in the first slot, but we can store the normalised data in another slot.

For generating visualisations and performing principal components analysis, we will convert the raw counts to counts per million (CPM). So, the raw counts represent number of mapped reads for each gene, while CPM represents the number of reads mapped to each gene per million total mapped reads.

The cpm function will also apply the normalisation factors calculated in the previous step log transform the data.

# add normalised logCPM data to the summarized experiment object
assay(filtered_se, 2) <- cpm(dge, log = TRUE)

Principal components analysis

Principal components analysis (PCA) is a dimensionality reduction that finds vectors (principal components) that represent the greatest proportion of variance in the data. In practice, we can summarise many (1000s) genes in just a few principal components that still contain most of the information in the full transcriptome.

We can perform PCA in R using the prcomp function, which uses singular value decomposition to carry out the calculation. The function also centers and scales (standardises) the logCPM data. PCA is sensitive to the variance of the input variables, and different genes have very different ranges of expression; so, standardisation is recommended to keep different genes on comparable scales.

# prcomp calculates PCA using singular value decomposition
cpm_pca <- prcomp(t(assay(filtered_se, 2)), center = TRUE, scale. = TRUE)

print(cpm_pca$x)

##                 PC1        PC2        PC3        PC4        PC5          PC6
## SRR391535 -84.67757  59.314154  19.071329 -12.818820 -14.725144 2.078164e-13
## SRR391536 -76.42968  -2.607008 -25.522355  17.769187  35.552955 2.476643e-13
## SRR391537  82.68334  19.575153 -46.849790   4.746674 -18.867265 1.912030e-13
## SRR391538  80.39912   3.084402  39.136481  36.215136   5.418837 4.876828e-14
## SRR391539  59.26173 -15.795781  10.446484 -48.490415  17.783266 1.209415e-13
## SRR391541 -61.23694 -63.570920   3.717851   2.578239 -25.162647 2.254863e-13

While we could plot the principal components manually, it is convenient to use the ggbiplot package to do it for us. This function additionally calculates the variance explained by each component as a percentage, and plots the orientation of some of the genes in PC-space. It also colours our samples by the experimental condition and draws an ellipse around each group.

We can see that we can draw mostly distinct ellipses around each of the experimental conditions, indicating clear differences between the groups. PCA is known as “unsupervised”, because it identifies patterns in the data without knowledge of labels of interest.

pca_df <- data.frame(condition = filtered_se$condition, cpm_pca$x)

# use the ggplot package to plot the PCA
ggplot(pca_df, aes(PC1, PC2, col = condition)) +
    geom_point()

Differential expression

We have performed an unsupervised analysis, PCA, to visualise our data. This demonstrated clear differences between or groups of interest, motivating us to investigate how the transcriptome differs by the experimental condition.

We will now perform differential expression for each of the genes in the transcriptome. This is a supervised method because we will look for differences using the group labels.

In a previous step, we generated a DGEList containing the raw counts and the computed normalisation factors. With edgeR, we never need to directly apply these normalisation factors to the data. The models employed by edgeR take the raw counts and normalisation factors directly.

These models can work with raw counts directly because they assume that the counts follow a negative binomial distribution. In order to fit these models, edgeR must calculate the technical- and biological-specific variation using the estimateDisp function.

The recommended method for a simple two-group comparison is edgeR’s quantile- adjusted conditional maximum likelihood method. This is implemented in the exactTest function. The edgeR Users Guide notes that this test has strong parallels with the Fisher’s exact test.

# estimate dispersion parameter
dge <- estimateDisp(dge)

## Using classic mode.

# fit the models
et <- exactTest(dge)

# get the results as a table
res <- topTags(et, n=60000)

# print the most significant results in the table
print(head(res))

## Comparison of groups:  elevated-ambient
##                 logFC   logCPM        PValue           FDR
## LOC100790507 7.292131 5.709596 1.446266e-222 1.551409e-218
## 100801599    6.053555 4.926274 1.819215e-156 9.757361e-153
## LOC100794841 4.251261 5.920580 3.867659e-151 1.382946e-147
## LOC100798930 4.134225 5.388805 5.977479e-130 1.603010e-126
## LOC100500316 6.323671 3.626933 3.824606e-124 8.205310e-121
## LOC100500550 4.716406 4.923224 4.859905e-116 8.688700e-113

To visualise the results, we will create a volcano plot, which makes it easier to identify genes with particularly high fold changes and small P-values.

# use the EnhancedVolcano package to visualise the results
EnhancedVolcano(
    res$table,
    lab = rownames(res$table),
    x = "logFC",
    y = "PValue",
    pCutoffCol = "FDR",
    pCutoff = 0.05
)

alt