Statistical Analysis

Methodology

For our project, we are interested in using survival analysis. Suppose that there is a true survival time, T, as well as a true censoring time, C. The survival time represents the time at which the event of interest occurs: in this dataset, the time (in days) at which a participant is voted out. The censoring time is the time at which a participant drops out of the game show or survived until the last day of the show.

We observed the survival time T and censoring time C. Suppose there is a random variable Y

\[Y = min(T,C)\]

In other words, if the event occurs before the censoring such that T \(<\) C, then we observed the true survival time T. If censoring occurs before the event such as T \(>\) C, then we observe the censoring time. The status indicator as,

\[\delta = \left\{ \begin{array}{ll} 1 & T\leq C \\ 0 & T > C \\ \end{array} \right. \]

Thus, \(\delta\) = 1 if we observe the true survival time, and \(\delta\) = 0 if we observe the censoring.

We use the Cox-propositional hazard model to evaluate the effect of several factors on survival time in Survivor. It allows us to examine how specified factors influence the rate of the event that we are interested in at a particular point in time. This rate is the hazard rate. Covariates are age, gender, poc, personality type, and region in this model. The Cox model is expressed by the hazard function denoted by h(t). Briefly, the hazard function can be interpreted as the risk of voted out at time t. It can be estimated as follows: \[h(t) = h_0(t)\exp(x_1\beta_1 + x_2\beta_2 + x_3\beta_3 + x_4\beta_4 + x_4\beta_5)\] where,

\(t\) represents the survival time
\(h(t)\) is the hazard function determined by a set of participants’ covariates \((x_1, x_2, x_3, x_4, x_5)\): \(x_1\) is age during the show, \(x_2\) is gender, \(x_3\) is POC, , \(x_4\) is personality type, and \(x_5\) is region
the coefficients \((\beta1, \beta2, \beta3, \beta4, \beta5)\) measure the impact of covariates
\(h_0(t)\geq 0\) is the baseline hazard function. It is the hazard function for an individual if all \((x_1, x_2, x_3, x_4, x_5)\) are equal to zero.

Connection with the Log-Rank Test

Suppose we have a single predictor \((p=1)\), which we assume to be binary. We aim to determine whether there is a difference between the survival times of the observations in the group. For instance, we consider a test to compare difference survival times between females and males. The null hypothesis is that the two groups have identical hazard functions: \[H_0: h_1(t) = h_2(t)\]

Result

## reading in data
survivor_data_final = 
  read.csv("data/survivor_data_final.csv")
## Create status and time variable
# filtering of seasons moved to data wrangling file
# replaced all survivor_final with survivor_data_final since filtering was moved to use for EDA
status <- c()
time <- c()
for (i in 1:nrow(survivor_data_final))
  {
  if (survivor_data_final[i,17] == "Quit") {
    status[i] <- 0
    time[i] <- survivor_data_final$days_survived[i]
  }
  else if (survivor_data_final[i,17] == "Sole Survivor") {
    status[i] <- 0
    time[i] <- survivor_data_final$days_survived[i]
  } 
  else if (survivor_data_final[i,17] == "Runner-up") {
    status[i] <- 0
    time[i] <- survivor_data_final$days_survived[i]
  } else if (survivor_data_final[i,17] == "Co-runner-up") {
    status[i] <- 0
    time[i] <- survivor_data_final$days_survived[i]
  }
  else {
    status[i] <- 1
    time[i] <- survivor_data_final$days_survived[i]
  }
}
survivor_data_final[,25] <- status
survivor_data_final[,26] <- time
colnames(survivor_data_final)[25] <- "status"
colnames(survivor_data_final)[26] <- "time"

surv_model_cox1 <- coxph(data = survivor_data_final,
  Surv(time, status) ~ age_during_show + gender + poc + personality_type_binary + region)
# summary(surv_model_cox1)
# Clean table for Cox PH
tabcoxph(surv_model_cox1,
         var.labels = list(age_during_show = "Age", gender = "Gender", poc = "POC",  personality_type_binary = "Personality Type",  region = "Region"),
         decimals = 2)

Variable	Beta (SE)	HR (95% CI)	P
Age	-0.00 (0.00)	1.00 (0.99, 1.01)	0.95
Gender
Female (ref)	–	–	–
Male	-0.05 (0.08)	0.95 (0.81, 1.11)	0.50
POC
POC (ref)	–	–	–
White	-0.11 (0.09)	0.89 (0.75, 1.07)	0.21
Personality Type
Extrovert (ref)	–	–	–
Introvert	0.07 (0.08)	1.07 (0.91, 1.25)	0.42
Region
Midwest (ref)	–	–	–
Northeast	-0.00 (0.15)	1.00 (0.75, 1.33)	0.99
South	0.26 (0.14)	1.30 (1.00, 1.70)	0.05
West	0.17 (0.13)	1.18 (0.91, 1.53)	0.20

HR interpretations:

As age increases in years, there is no difference in survival since the hazard ratio is 1. We are 95% confident that the HR falls between 0.9922 to 1.008. However, the HR is not statistically significant since the p-value of 0.998 > 0.05.

For the gender variable, the HR value of 0.96 would suggest that the survival time for Male contestants compared to Female contestants (the reference category) is longer. We are 95% confident that the HR falls between 0.82 and 1.12. However, the HR is not statistically significant since the p-value of 0.57 > 0.05.

For the POC variable, the HR value of 0.8881 would suggest that the survival time for White contestants compared to POC contestants (the reference category) is longer. We are 95% confident that the HR falls between 0.7456 and 1.058. However, the HR is not statistically significant since the p-value of 0.184 > 0.05.

For personality type, the HR value of 1.0631 would suggest that the survival time for Introvert contestants compared to Extrovert contestants (the reference category) is shorter. We are 95% confident that the HR falls between 0.9076 - 1.245. However, the HR is not statistically significant since the p-value of 0.448 > 0.05.

For region, Midwest is the reference category. However, all of these hazard ratios are not statistically significant, p-value > 0.05.
- The HR value of 0.99 for Northeast suggests that the survival time for contestants from the NE is longer compared to contestants from the Midwest.
- The HR value of 1.29 for the South suggests that the survival time for contestants from the South is shorter compared to contestants from the Midwest.
- The HR value of 1.18 for the West suggests that the survival time for contestants from the West is shorter compared to contestants from the Midwest.

Estimation and Statistical Inference

We use Kaplan-Meier Curves to estimate the survival function for contestants. We apply the ggsurvplot function to draw the survival curves and calculate log-rank test p-value. The following plots are the estimated probability of survival for contestants in each group. Based on the log rank test results, most of the log-rank test p-values are not statistically significant at \(\alpha=0.05\). While we perform the log-rank test for contestants from Northeast and non-Northeast regions, the p-value = 0.049, which implies there is significant difference between Northeast and non-Northeast contestants.

Kaplan-Meier plotter-personality

surv_model_per <- survfit(Surv(time, status)~ survivor_data_final$personality_type_binary)
ggsurvplot(
  surv_model_per,
  data = survivor_data_final,
  size = 1,                 # change line size
  palette =
    c("#E7B800", "#2E9FDF"),# custom color palettes
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value show log-rank test
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("Extrovert", "Introvert"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)

#basic graph
# plot(surv_model_per, xlab = "Days",
#      ylab = "Estimated Probability of Survival", col = c(2,4))

Kaplan-Meier plotter-White vs Non-White

surv_model_poc <- survfit(Surv(time, status)~ survivor_data_final$poc)
ggsurvplot(
  surv_model_poc,
  data = survivor_data_final,
  size = 1,                 # change line size
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("POC", "White"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)

Kaplan-Meier plotter-gender

surv_model_sex <- survfit(Surv(time, status)~ survivor_data_final$gender)
ggsurvplot(
  surv_model_sex,
  data = survivor_data_final,
  size = 1,                 # change line size
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("female", "male"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)

Kaplan-Meier plotter-Region

surv_model_Region <- survfit(Surv(time, status)~ survivor_data_final$region)
ggsurvplot(
  surv_model_Region,
  data = survivor_data_final,
  size = 1,                 # change line size
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("Midwest", "Northeast", "South", "West"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)

Kaplan-Meier plotter-Northeast

surv_model_NE <- survfit(Surv(time, status)~ survivor_data_final$NE)
ggsurvplot(
  surv_model_NE,
  data = survivor_data_final,
  size = 1,                 # change line size
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("Non-Northeast", "Northeast"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)

Kaplan-Meier plotter-South

surv_model_South <- survfit(Surv(time, status)~ survivor_data_final$South)
ggsurvplot(
  surv_model_South,
  data = survivor_data_final,
  size = 1,                 # change line size
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("Non-South", "South"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)

Kaplan-Meier plotter-West

surv_model_West <- survfit(Surv(time, status)~ survivor_data_final$West)
ggsurvplot(
  surv_model_West,
  data = survivor_data_final,
  size = 1,                 # change line size
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("Non-West", "West"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)

Kaplan-Meier plotter-Midwest

surv_model_Midwest <- survfit(Surv(time, status)~ survivor_data_final$Midwest)
ggsurvplot(
  surv_model_Midwest,
  data = survivor_data_final,
  size = 1,                 # change line size
  conf.int = FALSE,          # Add confidence interval
  pval = TRUE,              # Add p-value
  risk.table = TRUE,        # Add risk table
  risk.table.col = "strata",# Risk table color by groups
  legend.labs =
    c("Non-Midwest", "Midwest"),    # Change legend labels
  risk.table.height = 0.35, # Useful to change when you have multiple groups
  ggtheme = theme_bw()      # Change ggplot2 theme
)