Conducting a Survival Analysis

Using the previous data set we curated, we can now conduct the survival analysis. The Surv() function from the {survival} package creates a survival object for use as the response in a model formula. This object is then used to calculate the survival time of our desired model.

Univariate (unadjusted) model

Firstly, we must build a survival object for all cause mortality, (this can be altered to different types of death see Advanced Survival Analysis).

#Survival analysis people with [disease] vs [no disease] 
library(survival)
library(survminer)
library(ggplot2)
library(gtsummary)
library(knitr)
library(dplyr)

load("data/result.RData")
death$allcause_death <- as.numeric(death$allcause_death)

Surv(death$time_years, death$allcause_death)[1:10]
 [1] 67.56742  58.25051+ 72.41889+ 64.75291+ 82.58453+ 65.41821+ 64.25188+
 [8] 79.67146+ 81.50308+ 67.08556+

From this we can see that only the first person experienced the event at 67.56 years, the rest, as indicated by the “+” have not experienced the event and therefore are said to be censored.

Note: Possible Error

It is common that our code returns an error: “Invalid status value, must be logical or numeric”, if we have not specified the type of variable we want to store our all-cause mortality variable (status). This can be easily corrected with the code below.

#Surv(death$time_years, death$allcause_death): Invalid status value, must be logical or numeric. If this error appears, simply convert to numeric

#death$allcause_death <- as.numeric(death$allcause_death)

Creating survival curves

Let’s now use the survfit() function to create the survival curves for the entire cohort using the Kaplan-Meier method. In this example we defined the outcome as 0/1 for airflow obstruction.

LTRC_ats <- Surv(time =death$Age, time2 = death$time_years, event = death$allcause_death)

#Unadjusted for covariates
fit1 <- survfit(LTRC_ats ~ AO.fev1fvc, data = death)
summary_fit1 <- summary(fit1)

time <- summary_fit1$time
n_risk <- summary_fit1$n.risk
surv <- summary_fit1$surv
lower_ci <- summary_fit1$lower
upper_ci <- summary_fit1$upper

# Display the first few rows of these components
result <- data.frame(
  Time = time[1:10],
  N_Risk = n_risk[1:10],
  Survival = surv[1:10],
  Lower_CI = lower_ci[1:10],
  Upper_CI = upper_ci[1:10]
)

print(result)
       Time N_Risk  Survival  Lower_CI  Upper_CI
1  40.29295   2471 0.9995953 0.9988026 1.0000000
2  40.53388   2470 0.9991906 0.9980700 1.0000000
3  40.78029   2469 0.9987859 0.9974139 1.0000000
4  41.05955   7386 0.9986507 0.9972535 1.0000000
5  41.23751   7385 0.9985155 0.9970936 0.9999394
6  41.34428   7384 0.9983802 0.9969341 0.9998284
7  41.34702   7383 0.9982450 0.9967751 0.9997171
8  41.38809   7382 0.9981098 0.9966164 0.9996054
9  41.39083   7381 0.9979746 0.9964581 0.9994933
10 41.42642   7380 0.9978393 0.9963001 0.9993809

Visualisation of results

Survival analysis results may be presented in different ways depending on the type of analysis. For instance, univariate (unadjusted) models may be presented in a Kaplan-Meier Curve.

Kaplan-Meier Curve and Log-Rank test

The Kaplan-Meier method is a widely used technique for estimating survival times and probabilities. The results are represented as a step function, with each step corresponding to an event occurrence.

To compare survival curves between different groups, the Log-rank test is commonly used. This statistical test evaluates whether there are significant differences in survival distributions by comparing observed and expected event rates across groups.

#Plotting the results 

km_plot <- ggsurvplot(
  fit1, 
  data = death, 
  conf.int = TRUE,
  risk.table = TRUE,
  ggtheme = theme_minimal(),
  title = "Kaplan-Meier Survival Curve",
  xlab = "Time in Years",
  ylab = "Survival Probability",
  legend.labs = c("No Airflow Obstruction", "Airflow Obstruction"),#Change stata
  xlim = c(40, max(fit1$time)),  # Set x-axis limits (start at 40 min age)
  risk.table.fontsize = 3,  # Adjust the font size for the risk table numbers
  tables.theme = theme_survminer(
    font.main = 8, 
    font.submain = 8, 
    font.caption = 8, 
    font.x = 8, 
    font.y = 8, 
    font.tickslab = 8
  )
)

print(km_plot)

Overview of the survival model

We can present an overview of the survival model in a table.

summary_fit1 <- summary(fit1)
# Extract key summary statistics
summary_table <- data.frame(
  Time = summary_fit1$time,
  N.risk = summary_fit1$n.risk,
  N.event = summary_fit1$n.event,
  Survival = summary_fit1$surv,
  Standard.Error = summary_fit1$std.err,
  Lower.CI = summary_fit1$lower,
  Upper.CI = summary_fit1$upper
  )

# Display the summary table
kable(summary_table[1:15, ], caption = "Summary of Survival Fit for Airflow obstruction (AO.fev1fvc)")
Summary of Survival Fit for Airflow obstruction (AO.fev1fvc)
Time N.risk N.event Survival Standard.Error Lower.CI Upper.CI
40.29295 2471 1 0.9995953 0.0004046 0.9988026 1.0000000
40.53388 2470 1 0.9991906 0.0005721 0.9980700 1.0000000
40.78029 2469 1 0.9987859 0.0007005 0.9974139 1.0000000
41.05955 7386 1 0.9986507 0.0007134 0.9972535 1.0000000
41.23751 7385 1 0.9985155 0.0007260 0.9970936 0.9999394
41.34428 7384 1 0.9983802 0.0007384 0.9969341 0.9998284
41.34702 7383 1 0.9982450 0.0007505 0.9967751 0.9997171
41.38809 7382 1 0.9981098 0.0007625 0.9966164 0.9996054
41.39083 7381 1 0.9979746 0.0007743 0.9964581 0.9994933
41.42642 7380 1 0.9978393 0.0007859 0.9963001 0.9993809
41.43737 7379 1 0.9977041 0.0007974 0.9961425 0.9992682
41.46201 7378 1 0.9975689 0.0008087 0.9959852 0.9991551
41.61533 7377 1 0.9974336 0.0008198 0.9958282 0.9990417
41.74675 7376 1 0.9972984 0.0008307 0.9956715 0.9989280
41.80698 7375 1 0.9971632 0.0008416 0.9955151 0.9988140

Calculating Median Survival time

The median survival time is often reported in a survival analysis for each of the groups we are comparing.

median_survival <- fit1

median_survival
Call: survfit(formula = LTRC_ats ~ AO.fev1fvc, data = death)

   1325 observations deleted due to missingness 
             records  n.max n.start events median 0.95LCL 0.95UCL
AO.fev1fvc=0  229908 136318    2471  11251   86.5      NA      NA
AO.fev1fvc=1   21644  13187     246   2164   85.7    85.6      NA

We can see that the median survival time for people without airflow obstruction is 86.5 years compared to 85.7 years in people with airflow obstruction. Meaning that people with AO live less than those without.

Log Rank Test

# Create the survival object for right-censored data
right_censored_surv <- Surv(time = death$time_years, event = death$allcause_death)

# Fit the survival model
fit2 <- survfit(right_censored_surv ~ AO.fev1fvc, data = death)

# Perform log-rank test
log_rank_test <- survdiff(right_censored_surv ~ AO.fev1fvc, data = death)

#Log-rank test results 

print(log_rank_test)
Call:
survdiff(formula = right_censored_surv ~ AO.fev1fvc, data = death)

                  N Observed Expected (O-E)^2/E (O-E)^2/V
AO.fev1fvc=0 231027    12370    13399        79       869
AO.fev1fvc=1  21850     2370     1341       790       869

 Chisq= 869  on 1 degrees of freedom, p= <2e-16 

A significant Log-Rank test results means that there are statistically significant differences between the two groups we are comparing (e.g. People with AO vs people without AO)

Multivariate (adjusted) model

Multivariate (adjusted) models are used to account for confounders that may impact the associations between the exposure and the outcome. It is common to have complex models adjusting for multiple covariates, and therefore presenting these on a KM curve would be confusinve and not informative. Therefore, results from a multivatiate model are better presented in a table or a forest plot.

#Survival analysis people with SAO vs no SAO 
library(survival)
library(survminer)
library(ggplot2)
library(gtsummary)

#Order smoking status correctly
death$Smoking_status <- as.factor(death$Smoking_status)
death$Smoking_status <- relevel(death$Smoking_status, ref = "Never")

#All cause mortality adjusted for covariates
LTRC_ats <- Surv(time =death$Age, time2 = death$time_years, event = death$allcause_death)

coxph_LTRC_airflow <- coxph(LTRC_ats ~ AO.fev1fvc + Sex + bmi + Ethnicity + Smoking_status + UKB_centre, data = death)


summary(coxph_LTRC_airflow)
Call:
coxph(formula = LTRC_ats ~ AO.fev1fvc + Sex + bmi + Ethnicity + 
    Smoking_status + UKB_centre, data = death)

  n= 251138, number of events= 13375 
   (1739 observations deleted due to missingness)

                              coef exp(coef)  se(coef)      z Pr(>|z|)    
AO.fev1fvc                0.505141  1.657220  0.024248 20.832  < 2e-16 ***
SexMale                   0.374856  1.454782  0.017578 21.326  < 2e-16 ***
bmi                       0.026416  1.026768  0.001781 14.834  < 2e-16 ***
EthnicityWhite           -0.060490  0.941303  0.035508 -1.704 0.088462 .  
Smoking_statusCurrent     0.863901  2.372397  0.024971 34.596  < 2e-16 ***
Smoking_statusPrevious    0.171028  1.186524  0.019480  8.780  < 2e-16 ***
UKB_centreBirmingham     -0.002458  0.997545  0.071553 -0.034 0.972595    
UKB_centreBristol        -0.033257  0.967290  0.064998 -0.512 0.608894    
UKB_centreBury            0.169318  1.184497  0.065526  2.584 0.009767 ** 
UKB_centreCardiff         0.048245  1.049428  0.072930  0.662 0.508276    
UKB_centreCroydon         0.003938  1.003946  0.070215  0.056 0.955273    
UKB_centreEdinburgh       0.109151  1.115331  0.071257  1.532 0.125575    
UKB_centreGlasgow         0.145384  1.156484  0.070048  2.076 0.037939 *  
UKB_centreLeeds          -0.081601  0.921640  0.062489 -1.306 0.191607    
UKB_centreLiverpool       0.013492  1.013584  0.065989  0.204 0.837991    
UKB_centreManchester     -0.018731  0.981443  0.075389 -0.248 0.803777    
UKB_centreMiddlesborough  0.018251  1.018419  0.070794  0.258 0.796557    
UKB_centreNewcastle      -0.022029  0.978212  0.066129 -0.333 0.739044    
UKB_centreNottingham      0.020192  1.020398  0.066431  0.304 0.761157    
UKB_centreOxford         -0.060203  0.941574  0.078946 -0.763 0.445715    
UKB_centreReading        -0.094110  0.910182  0.068409 -1.376 0.168916    
UKB_centreSheffield       0.016122  1.016253  0.067750  0.238 0.811907    
UKB_centreStoke          -0.037203  0.963481  0.071415 -0.521 0.602409    
UKB_centreSwansea         0.417125  1.517592  0.117249  3.558 0.000374 ***
UKB_centreWrexham        -0.022106  0.978137  0.306992 -0.072 0.942595    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                         exp(coef) exp(-coef) lower .95 upper .95
AO.fev1fvc                  1.6572     0.6034    1.5803     1.738
SexMale                     1.4548     0.6874    1.4055     1.506
bmi                         1.0268     0.9739    1.0232     1.030
EthnicityWhite              0.9413     1.0624    0.8780     1.009
Smoking_statusCurrent       2.3724     0.4215    2.2591     2.491
Smoking_statusPrevious      1.1865     0.8428    1.1421     1.233
UKB_centreBirmingham        0.9975     1.0025    0.8670     1.148
UKB_centreBristol           0.9673     1.0338    0.8516     1.099
UKB_centreBury              1.1845     0.8442    1.0417     1.347
UKB_centreCardiff           1.0494     0.9529    0.9097     1.211
UKB_centreCroydon           1.0039     0.9961    0.8749     1.152
UKB_centreEdinburgh         1.1153     0.8966    0.9699     1.283
UKB_centreGlasgow           1.1565     0.8647    1.0081     1.327
UKB_centreLeeds             0.9216     1.0850    0.8154     1.042
UKB_centreLiverpool         1.0136     0.9866    0.8906     1.154
UKB_centreManchester        0.9814     1.0189    0.8466     1.138
UKB_centreMiddlesborough    1.0184     0.9819    0.8865     1.170
UKB_centreNewcastle         0.9782     1.0223    0.8593     1.114
UKB_centreNottingham        1.0204     0.9800    0.8958     1.162
UKB_centreOxford            0.9416     1.0621    0.8066     1.099
UKB_centreReading           0.9102     1.0987    0.7960     1.041
UKB_centreSheffield         1.0163     0.9840    0.8899     1.161
UKB_centreStoke             0.9635     1.0379    0.8376     1.108
UKB_centreSwansea           1.5176     0.6589    1.2060     1.910
UKB_centreWrexham           0.9781     1.0224    0.5359     1.785

Concordance= 0.619  (se = 0.003 )
Likelihood ratio test= 2726  on 25 df,   p=<2e-16
Wald test            = 3080  on 25 df,   p=<2e-16
Score (logrank) test = 3237  on 25 df,   p=<2e-16

Export results in a table

From the summary results, we can export the results in a table, where exp(coef) corresponds to the Hazard Ratio (HR), we also report the 95% Confidence Intervals and the P-Value for our outcome and each covariate.

coxph_LTRC_airflow %>%
  tbl_regression(exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  modify_table_styling(
    columns = "label",
    rows = variable == "AO.fev1fvc",
    label = "Airflow obstruction"
  )
Airflow obstruction HR1 95% CI1 p-value
AO.fev1fvc 1.66 1.58, 1.74 <0.001
Sex


    Female
    Male 1.45 1.41, 1.51 <0.001
bmi 1.03 1.02, 1.03 <0.001
Ethnicity


    Other
    White 0.94 0.88, 1.01 0.088
Smoking_status


    Never
    Current 2.37 2.26, 2.49 <0.001
    Previous 1.19 1.14, 1.23 <0.001
UKB_centre


    Barts
    Birmingham 1.00 0.87, 1.15 >0.9
    Bristol 0.97 0.85, 1.10 0.6
    Bury 1.18 1.04, 1.35 0.010
    Cardiff 1.05 0.91, 1.21 0.5
    Croydon 1.00 0.87, 1.15 >0.9
    Edinburgh 1.12 0.97, 1.28 0.13
    Glasgow 1.16 1.01, 1.33 0.038
    Leeds 0.92 0.82, 1.04 0.2
    Liverpool 1.01 0.89, 1.15 0.8
    Manchester 0.98 0.85, 1.14 0.8
    Middlesborough 1.02 0.89, 1.17 0.8
    Newcastle 0.98 0.86, 1.11 0.7
    Nottingham 1.02 0.90, 1.16 0.8
    Oxford 0.94 0.81, 1.10 0.4
    Reading 0.91 0.80, 1.04 0.2
    Sheffield 1.02 0.89, 1.16 0.8
    Stoke 0.96 0.84, 1.11 0.6
    Swansea 1.52 1.21, 1.91 <0.001
    Wrexham 0.98 0.54, 1.79 >0.9
1 HR = Hazard Ratio, CI = Confidence Interval
#Save table as docx
#flex_tbl <- as_flex_table(tb1)
# Save as Word document
#read_docx() %>%
#  body_add_flextable(value = flex_tbl) %>%
#  print(target = "allcause_mortality.docx")

Present results in a forest plot

The results from the Cox Proportional Hazard model may also be presented in a graphical manner using a forest plot.

Forest plot with all variables that were included in the model

Forest plot selecting variables we want to present

# Extract the coefficients and confidence intervals
coefs <- summary(coxph_LTRC_airflow)$coefficients
confint <- summary(coxph_LTRC_airflow)$conf.int

# Exclude UKB_centre variables
exclude_indices <- grepl("UKB_centre", rownames(coefs))
coefs <- coefs[!exclude_indices, ]
confint <- confint[!exclude_indices, ]

# Create a dataframe for plotting
forest_data <- data.frame(
    Variable = rownames(coefs),
    HR = coefs[, "exp(coef)"],
    LowerCI = confint[, "lower .95"],
    UpperCI = confint[, "upper .95"]
)

# Rename the specific variable
# Optionally, clean up other variable names for better readability
forest_data$Variable <- ifelse(forest_data$Variable == "AO.fev1fvc", "Airflow obstruction", forest_data$Variable)

forest_data$Variable <- ifelse(forest_data$Variable == "bmi", "BMI", forest_data$Variable)

forest_data$Variable <- ifelse(forest_data$Variable == "EthnicityWhite", "European (White) Ancestry", forest_data$Variable)

forest_data$Variable <- ifelse(forest_data$Variable == "SexMale", "Males", forest_data$Variable)

forest_data$Variable <- ifelse(forest_data$Variable == "Smoking_statusCurrent", "Current Smokers", forest_data$Variable)

forest_data$Variable <- ifelse(forest_data$Variable == "Smoking_statusPrevious", "Previous Smokers", forest_data$Variable)

# Specify the desired order of variables
desired_order <- c("Airflow obstruction", "BMI", "European (White) Ancestry", "Males", "Current Smokers", "Previous Smokers")

# Set the levels of the Variable column according to the desired order
forest_data$Variable <- factor(forest_data$Variable, levels = rev(desired_order))

# Create the forest plot
forest_plot <- ggplot(forest_data, aes(x = HR, y = Variable)) +
    geom_point() +
    geom_errorbarh(aes(xmin = LowerCI, xmax = UpperCI), height = 0.2) +
    geom_vline(xintercept = 1, linetype = "dashed") +
    labs(title = "Forest Plot for Cox Proportional Hazards Model",
         x = "Hazard Ratio",
         y = "Variables") +
    theme_minimal()

# Display the plot
print(forest_plot)

Interpretation of results

Think, think, think

From these results, we can conclude that:

  • Participants with airflow obstruction have significantly increased all cause mortality risk compared to participants without airflow obstruction (HR = 1.66, 95% CI = 1.58, 1.74)

  • The risk is 1.45 times higher in males compared to females

  • The risk is 2.37 times higher among current smokers and 1.19 times among previous smokers compared to never smokers

  • Model is adjusted for age (as used as the time scale), sex, smoking status, BMI, Ethnicity and UK Biobank Centre

Similar conclusions can be drawn from the other predictor variable used, but the idea is to report the change in risk between the two groups we are comparing.