Exploratory Data Analysis

Data Cleaning and Processing

In order to analyze the show Survivor, we used data sources from the castaway_details and castaways datasets,¹ as well as others, from the survivoR package.² This fan-made package contains data from all seasons of Survivor. According to the authors of the package, “the data was sourced from Wikipedia and the Survivor Wiki. Other data, such as the tribe colours, was manually recorded and entered by myself and contributors.” As each dataset contained distinct information on the contestants for each season, it was necessary to use joins to combine datasets to produce a final dataset to be analyzed. This was performed using a full join on the contestants’ full names. We primarily rely on contestant occurrences as the unit of analysis in our report, though it should be noted that distinct individuals can appear in multiple seasons and/or in multiple episodes per season. Additionally, in order to standardize our results for the survival analysis and exploratory data analysis, we removed seasons 2, 41, 42, and 43 as the seasons contained data for a different number of days from the standard 39 days. As the original data contains information from several versions around the globe, it was integral to filter and only analyze data from the U.S. edition. In order to best model our covariates of interest, we then created a new personality type variable (extracting whether a person is an introvert or extrovert) and a POC indicator variable (provided by the package) instead of individual races. We also used contestants’ home states to code contestants into a region based on census regions and divisions of the United States.³ Furthermore, we determined that missing data was not an issue, as it occurred in low frequency and was not patterned in nature. The final dataset used in the analysis contains unique information for each castaway for each season, including the following key variables:

version_season: version and season number
full_name: contestant full name
age_during_show : age, in years
poc : POC indicator, if known. Else, marked as White.
gender: 2 levels: Female, Male.
personality_type_binary : Extracted from the Myer-Briggs personality type of the castaway. 2 levels: Extrovert, Introvert.
days_survived: Number of days survived in the show until elimination
region: region in the U.S. where the contestant is from. We created this variable based on the state variable available in the dataset. 4 levels: West, Midwest, Northeast, South. Indicator variables for each of the 4 regions have also been created for analyses.

## cleaning Castaway Details dataset: filtering-out non-US seasons, creating personality type variable
castaway_details_us = castaway_details %>% 
  filter(str_detect(castaway_id, '^US')) %>%
  mutate(
    personality_type_binary = ifelse(
      str_detect(personality_type, '^E'), "Extrovert", "Introvert")) %>%
  select(-c(castaway_id, castaway, personality_type))

## Castaways dataset: filtering-out non-US seasons, renaming variables
castaways_us = castaways %>% filter(version == "US") %>%
  select(-c(version, season_name, season, castaway_id, castaway, jury_status, original_tribe)) %>%
  rename(age_during_show = age, days_survived = day)

## joining datasets
survivor_data_final = full_join(castaway_details_us, castaways_us, by = "full_name")

# check for multiple unique names per season
contestant_count_unique_rec = survivor_data_final %>%
  group_by(version_season, full_name) %>%
  summarise(count = n()) %>%
  mutate(num = row_number())

## summarizing the number of contestants per season & adding to joined dataset
contestant_count_df = survivor_data_final %>%
  group_by(version_season) %>%
  summarise(contestant_count = n_distinct(full_name))

survivor_data_final = full_join(survivor_data_final, contestant_count_df, by = "version_season")

## reordering variables, create new variable ethnicity, fix personality type variable
survivor_data_final = survivor_data_final %>%
  select(c("version_season", "full_name", "age_during_show", "race", "poc", "date_of_birth", "date_of_death", "occupation", "gender", "ethnicity", "personality_type_binary", "episode", "days_survived", "order", "contestant_count", "result", "city", "state")) %>% 
    mutate(ethnicity = ifelse(
      str_detect(poc, 'White'), survivor_data_final$poc, survivor_data_final$race)) %>% 
  arrange(version_season) %>%
  mutate(personality_type_binary = as.factor(personality_type_binary))

## adding region variable
survivor_data_final = survivor_data_final %>%
mutate(region = 
    ifelse(
    state == "Connecticut" 
    | state == "Maine" 
    | state == "Massachusetts" 
    | state == "New Hampshire"
    | state == "Rhode Island"
    | state == "Vermont" 
    | state == "New Jersey" 
    | state == "New York" 
    | state == "Pennsylvania", 
    "Northeast",
    
    ifelse(
    state == "Delaware" 
    | state == "District of Columbia" 
    | state == "Florida" 
    | state == "Georgia" 
    | state == "Maryland" 
    | state == "North Carolina"
    | state == "South Carolina" 
    | state == "Virginia"
    | state == "West Virginia" 
    | state == "Alabama"
    | state == "Kentucky" 
    | state == "Mississippi"
    | state == "Tennessee" 
    | state == "Arkansas" 
    | state == "Louisiana" 
    | state == "Oklahoma" 
    | state == "Texas", 
      "South", 
      
    ifelse(
      state == "Arizona"
      | state == "Colorado"
      | state == "Idaho"
      | state == "New Mexico"
      | state == "Montana"
      | state == "Utah" 
      | state == " Nevada"
      | state == "Wyoming"
      | state == "Alaska"
      | state == "California"
      | state == "Hawaii"
      | state == "Oregon"
      | state == "Washington"
    ,  "West", "Midwest"
    )))
  ) %>%
  mutate(NE = ifelse(region == "Northeast", 1, 0)) %>%
  mutate(South = ifelse(region == "South", 1, 0)) %>%
  mutate(West = ifelse(region == "West", 1, 0)) %>%
  mutate(Midwest = ifelse(region == "Midwest", 1, 0))

survivor_data_final_including_2_41_42_43 = survivor_data_final

write.csv(survivor_data_final_including_2_41_42_43, file = "./data/survivor_data_final_including_2_41_42_43.csv")

# preparing data for survival analysis

## season 41 and 42, the longest survival time is 26 days, exclude these seasons
## season 2, the longest survival time is 42 days, exclude this season
## season 43 is incomplete, exclude this season

survivor_data_final = survivor_data_final %>%
  filter(!(version_season %in% c("US02", "US41", "US42", "US43")))

## Saving final dataset as csv file
write.csv(survivor_data_final, file = "./data/survivor_data_final.csv")

## calculating percent NAs for all variables
survivor_data_final %>% summarise_all(list(name = ~sum(is.na(.))/length(.)))

survivor_data_final = read.csv("data/survivor_data_final.csv")

Summary

As a first step in our exploratory data analysis, we summarize frequencies of contestant occurrences by characteristics of interest, as well as report mean age and survival time across the population of contestant occurrences.

survivor_data_final %>% 
  select(gender, poc, personality_type_binary, age_during_show, days_survived, region) %>% 
  tbl_summary(type = list(gender ~ "categorical", 
                          poc ~ "categorical", 
                          personality_type_binary ~ "categorical", 
                          region ~ "categorical",
                          age_during_show ~ "continuous", 
                          days_survived ~ "continuous"), 
              statistic = list(all_continuous() ~ "{mean} ({sd})"), 
              digits = all_continuous() ~ 1,
              label = list(c(gender) ~ "Gender",
                           c(poc) ~ "Race Identifier",
                           c(personality_type_binary) ~ "Personality Type",
                           c(region) ~ "Region",
                           c(age_during_show) ~ "Age During Show (Years)",
                           c(days_survived) ~ "Survival Time on Show (Days)")) %>% 
  bold_labels()

Characteristic	N = 728¹
Gender
Female	356 (49%)
Male	368 (51%)
Unknown	4
Race Identifier
POC	199 (27%)
White	525 (73%)
Unknown	4
Personality Type
Extrovert	401 (56%)
Introvert	320 (44%)
Unknown	7
Age During Show (Years)	33.4 (10.1)
Survival Time on Show (Days)	23.9 (12.1)
Region
Midwest	99 (14%)
Northeast	153 (21%)
South	207 (28%)
West	269 (37%)
¹ n (%); Mean (SD)

Note: N = 728 refers to the total count of records (i.e. contestant occurrences) in survivor_data_final; distinct persons may be listed in multiple records, across seasons and/or within seasons.

Mean Days Survived, by Contestant Characteristics

We proceed to provide an overview of distinct person counts, contestant occurrences, and mean survival times for contestant occurrences by personality type, POC status, gender, and region.

## Personality Type
survivor_data_final %>% 
  group_by(personality_type_binary) %>% 
  summarize(n_personality_dist = n_distinct(full_name),
            n_personality_occ = n(),
            mean_days_survived = mean(days_survived)) %>% 
    na.omit()  %>%
  knitr::kable(digits = 1, col.names = c("Personality Type", "Distinct Persons", "Contestant Occurrences", "Mean Days Survived"))

Personality Type	Distinct Persons	Contestant Occurrences	Mean Days Survived
Extrovert	309	401	24.0
Introvert	271	320	23.6

## POC Status
survivor_data_final %>% 
  group_by(poc) %>% 
  summarize(n_poc_dist = n_distinct(full_name),
            n_poc_occ = n(),
            mean_days_survived = mean(days_survived, na.rm = TRUE)) %>% 
    na.omit()  %>%
  knitr::kable(digits = 1, col.names = c("POC Status", "Distinct Persons", "Contestant Occurrences", "Mean Days Survived"))

POC Status	Distinct Persons	Contestant Occurrences	Mean Days Survived
POC	164	199	22.6
White	418	525	24.3

## Gender
survivor_data_final %>% 
  group_by(gender) %>% 
  summarize(n_gender_dist = n_distinct(full_name),
            n_gender_occ = n(),
            mean_days_survived = mean(days_survived, na.rm = TRUE)) %>%
    na.omit()  %>%
  knitr::kable(digits = 1, col.names = c("Gender", "Distinct Persons", "Contestant Occurrences", "Mean Days Survived"))

Gender	Distinct Persons	Contestant Occurrences	Mean Days Survived
Female	292	356	23.1
Male	290	368	24.5

Note: We report both distinct person counts and contestant occurrences by personality type, POC status, and gender.

POC and Gender Representation Across Seasons

Next, we run line plots to visualize changes in the relative proportions of distinct individual contestants by POC status and by gender over the course of successive Survivor seasons.

The initial seasons included limited POC representation, with a noticeable spike observed in seasons in 13 and 14. Subsequent seasons have exhibited an increase in overall levels of POC participation, relative to the first twelve seasons of the show.

The proportion of women contestants appearing on the show has remained broadly consistent, hovering either at, or in some cases slightly below, 50 percent across time; season 28 is an exception, with fewer than 40 percent comprised of women.

fill_color = brewer.pal(9, "Reds")[4]
survivor_poc_over_time = survivor_data_final %>% 
  group_by(version_season, poc) %>%
  summarize(count = n_distinct(full_name)) %>%
  mutate(freq = count / sum(count)) %>%
  filter(poc == "POC") %>%
  separate(col = version_season, into = c('NA', 'season'), sep = 2) %>%
  dplyr::select(-"NA") %>%
  mutate(season = as.numeric(season))

ggplot(data = survivor_poc_over_time, aes(x = season, y = freq, group = 1)) +
  geom_line(color = fill_color) +
  geom_point(color = fill_color) +
  geom_smooth(se = FALSE, color = "seashell") + 
  ggtitle("Change in Proportion of POC Over Time ") +
  xlab("Season") + ylab("Proportion of POC")

Note: Distinct person counts by POC status.

fill_color = brewer.pal(9, "Purples")[5]
survivor_gender_over_time = survivor_data_final %>% 
  group_by(version_season, gender) %>%
  summarize(count = n_distinct(full_name)) %>%
  mutate(freq = count / sum(count)) %>%
  filter(gender == "Female") %>%
  separate(col = version_season, into = c('NA', 'season'), sep = 2) %>%
  dplyr::select(-"NA") %>%
  mutate(season = as.numeric(season))

ggplot(data = survivor_gender_over_time, aes(x = season, y = freq, group = 1)) +
  geom_line(color = fill_color) +
  geom_point(color = fill_color) +
  geom_smooth(se = FALSE, color = "purple") + 
  ggtitle("Change in Proportion of Women Over Time") +
  xlab("Season") + ylab("Proportion of Women")

Note: Distinct person counts by gender.

Concentration of Contestants by Age and Geography

Finally, we run a histogram to display the distribution of ages for contestant occurrences, as well as a map showing the geographic concentration of contestant occurrences by state. The distribution of ages is right-skewed, with contestants in their early twenties through early thirties heavily represented on the show. The most popular state of origin for contestants is California, whereas Alaska and West Virginia, among other states, are not represented among contestants.

fill_color = brewer.pal(9,"PuBuGn")[5]
ggplot(survivor_data_final, aes(x = age_during_show)) + 
  geom_histogram(aes(y = after_stat(count/sum(count))), bins = 15, fill = fill_color, col = "white") +
  scale_y_continuous(labels = scales::percent) + 
  ggtitle("Distribution of Ages Across Contestant Occurrences") +
  xlab("Age (During Show)") + ylab("Count")

Note: Since contestants can re-appear across seasons at different ages, we rely on discrete records from survivor_data_final (i.e. contestant occurrences) as the unit of analysis for this plot in order to ensure comprehensiveness of age data.

survivor_state = survivor_data_final %>%
  group_by(state) %>%
  summarize(n = n())

plot_usmap(
  data = survivor_state, values = "n", lines = "blue"
) + 
  scale_fill_continuous(type = "viridis", name = "Contestant Count", label = scales::comma
  ) + 
  labs(title = "US States", subtitle = "Geographic Distribution of Contestants") +
  theme(legend.position = "right")

Notes: (i) Seasons 2, 41, 42, and 43 have been removed from the exploratory analysis due to inconsistent number of days. (ii) Since contestants can re-appear across seasons with different states of residence, we similarly rely on discrete records from survivor_data_final (i.e. contestant occurrences) as the unit of analysis for this plot in order to ensure comprehensiveness of location data.