Cracking time’s code: Survival analysis of large datasets

By Valentina Quintero Santofimio, Imperial College London

Description of the project

This is a special type of analysis that takes into consideration when the event occurred rather than if the event occurred. In other words, we are focused on acquiring the rate, which is the number of events per unit time. In this exemplar will introduce you to the concept of survival analysis (also known as a time-to-event analysis) using large data sets using R.

  1. Firstly, I will highlight the steps needed to effectively clean the data, including correct censoring of the participants.

  2. Secondly, I will demonstrate two approaches: a univariate and a multivariate Cox Proportional regression model (which allows the incorporation of confounders).

  3. Thirdly, we will be able to estimate the hazard ratio of an event occurring between two groups and present our findings in a Kaplan-Meier curve (univariate analysis) or in a table/forest plot (multivariate analysis).

This special type of analysis allows to calculate the risk of the event (disease, death, etc.) occurring at a given time (hazard ratio). A common timescale used in survival analysis is time-to-event, however in large cohort studies data may be left-truncated (participants entering the study at different time points), making the time-scale unsuitable. Instead, age should be considered as the timescale. This is most relevant when exploring age-dependent associations between exposures and outcomes. Therefore, it ensures accurate estimation of hazard risk and median survival times, preventing underestimation.

Quick summary

In summary, survival analysis using large cohort data may be challenging and require comprehensive understanding of the data to ensure correct interpretation of the results.

Learning Outcomes

  • Understand the different types of censoring and how to curate your data
  • Conduct univariate and multivariate survival analysis using R
  • Graphically present the findings of a survival analysis
  • Interpret the results from a survival analysis
Task Time
Introduction 3 hours
Data curation 2 hours
Survival analysis 3 hours
Extension task: Advanced survival analysis 2 hours

Let’s get coding