Data curation

The UK Biobank (synthetic) dataset

Throughout this project, we will use a synthetic version of the UK Biobank data. The UK Biobank is a large cohort data set composed of over 500,000 volunteers recruited across United Kingdom between 2006-2010. For this project, we will use data from 252,877 participants who have high quality lung function data to select participants with airflow obstruction (AO) (defined as FEV1/FVC < LLN). We will use AO as out outcome to determine the mortality risk in people with AO vs people without AO.

Essential Variables

Event (death) variables Description Coding Outcome variable Description Coding
cofdmain Main cause of death coded by International Classification of Disease v10 (ICD10) This variable will be used later in the course to stratify the analysis by type of death 0 = No event (alive/censored) 1 = Event (dead) AO.fev1fvc Airflow obstruction derived as: forced expiratory volume at 1 s/ forced vital capacity below the lower limit of normal (from NHANES equations) (FEV1/FVC<LLN) 0 = No airflow obstruction 1 = Airflow obstruction
Date of death Date of participant’s death obtained from the National Death Registry As character but must be changed into numeric format

Additional variables used in the multivariable (adjusted model)

Covariates name Description Coding
eid Unique participant Identification Number for anonymised data Interger
Age Age of participant at recruitment (when they entered the study) Interger
Sex Sex of participant Male, Female
bmi Body Max Index of participant Numeric
UKB_centre UK Biobank Centre at which the participant was recruited Character
Ethnicity Genetically-derived ethnicity of the participant White vs Other
Smoking_status Smoking Status of the participant at recruitment Never, current, previous

Data Curation

Note: the Surv() function in the in the {survival} package, which will be used in the Survival analysis chapter, accepts different formatting of the event date e.g. TRUE/FALSE, where TRUE is event and FALSE is censored; 1/0 where 1 is event and 0 is censored; or 2/1 where 2 is event and 1 is censored.

Please make sure that the event is properly formatted. In this example we will be using 0/1.

# Load Data set 

death <- read.table("data/synthetic_data_ReCoDe.txt", header=TRUE, sep="\t")

#Months need to be in number and not character
death$monthofbirth <- as.character(death$monthofbirth)
death$monthofbirth[death$monthofbirth == "January"] <- "01"
death$monthofbirth[death$monthofbirth == "February"] <- "02"
death$monthofbirth[death$monthofbirth == "March"] <- "03"
death$monthofbirth[death$monthofbirth == "April"] <- "04"
death$monthofbirth[death$monthofbirth == "May"] <- "05"
death$monthofbirth[death$monthofbirth == "June"] <- "06"
death$monthofbirth[death$monthofbirth == "July"] <- "07"
death$monthofbirth[death$monthofbirth == "August"] <- "08"
death$monthofbirth[death$monthofbirth == "September"] <- "09"
death$monthofbirth[death$monthofbirth == "October"] <- "10"
death$monthofbirth[death$monthofbirth == "November"] <- "11"
death$monthofbirth[death$monthofbirth == "December"] <- "12"

#Format of year XXXX- MonthXX- day XX (As a date)
death$dateofbirth <- paste0(death$yearofbirth,"-", death$monthofbirth, "-01")
death$dateofbirth <- as.Date(death$dateofbirth)
death$date_of_death <- as.Date(death$date_of_death)
death$now_cens <- as.Date("2023-01-01") #Change to "end of study" censoring date

Now that the dates are formatted, we need to calculate the difference between start and end dates in some units, in this example the time will be calculated in years.

As participants could enter the study at any point between 2006-2010, there is a delayed entry into the study. This type of data is know as left-truncated, and as we will define an “end date” the data will also be right-censored. We therefore use age as the time-scale to account for these.

death$timetoevent <- death$date_of_death - death$dateofbirth 
death$timetocens <- death$now_cens - death$dateofbirth 

#Select final time if not dead then censoring date
death$time <- ifelse(is.na(death$date_of_death), death$timetocens, death$timetoevent)

death$time_years <- death$time/365.25 #To have time in years

##Censoring of participants- do we know when they died? if so = 1 (Event occurs)
##Otherwise = 0 as we do not know when this could happen. 

death$allcause_death <- as.character(death$cofdmain)
death$allcause_death[is.na(death$allcause_death)] <- 0 #No event occured (ALIVE)
death$allcause_death[death$allcause_death > 0] <- 1 #Event occurs (Death)

Overview of the data

library(knitr)

head_data <- head(death[, c("Sex","Smoking_status", "allcause_death", "time_years")], 15)

# Create a table
kable(head_data, caption = "First 15 Observations of Survival Analysis Data")
First 15 Observations of Survival Analysis Data
Sex Smoking_status allcause_death time_years
Female Never 1 67.56742
Female Never 0 58.25051
Male Never 0 72.41889
Female Never 0 64.75291
Female Never 0 82.58453
Male Previous 0 65.41821
Female Previous 0 64.25188
Female Never 0 79.67146
Male Previous 0 81.50308
Female Previous 0 67.08556
Female Never 0 64.08487
Male Never 0 84.00000
Male Current 0 62.16564
Female Never 0 74.58453
Female Never 0 71.33470

How many participants have experienced the event? (e.g. dead)

library(dplyr)
library(tidyr)
library(knitr)

# Create a summary table
summary_table <- death %>%
  group_by(Sex, allcause_death) %>%
  summarise(count = n(), .groups = 'drop') %>%
  pivot_wider(names_from = Sex, values_from = count, values_fill = list(count = 0)) %>%
  mutate(Total = Male + Female)

# Display the summary table
kable(summary_table, caption = "Number of participants that have died by sex")
Number of participants that have died by sex
allcause_death Female Male Total
0 142563 95574 238137
1 6986 7754 14740

How many participants have airflow obstruction (outcome)?

library(dplyr)
library(tidyr)
library(knitr)

summary_table <- death %>%
  group_by(Sex, AO.fev1fvc) %>%
  summarise(count = n(), .groups = 'drop') %>%
  pivot_wider(names_from = Sex, values_from = count, values_fill = list(count = 0)) %>%
  mutate(Total = Male + Female)

# Display the summary table
kable(summary_table, caption = "Number of participants with airflow obstruction by sex")
Number of participants with airflow obstruction by sex
AO.fev1fvc Female Male Total
0 139183 91844 231027
1 10366 11484 21850
#save(death, file = "data/result.RData")