Friday, July 26, 2019

Baseball Data Science: Part 1 - Gathering and preparing data for modelling

Baseball Data Science, Part 1

I have 3 main research questions right now:
1) How well can I predict the likelihood of past baseball players making the BBWAA Hall of Fame using different methods like generalized linear models, machine learning models, and heuristic models? (Model fit)
2) What features are most important among HOFers? (Feature selection)
3) Can we create a graphic visualization of the different “tiers” of HOFers? (Visualization)

The goal for this first portion is to aggregate the data using dplyr. I’m going to get career data for players, then filter out people without the requirements to be in the baseball hall of fame.

# Packages I'll need that aren't in base R are in this chunk, to keep the clutter down.
### This is a function that my professor (Dr. Bergen) wrote. It uses more robust standard error estimates using some math that I don't really understand. There is a slight power loss according to our simulation studies, but it has a much better time predicting models that have non-constant variance than the standard method. 
##The following function prints standard lm/glm "summary" outputwith naive (standard) and robust 'sandwich' (HC3) SE, pvalue, and 95% CI
summary.hc <- function(model.obj, expo = FALSE) {
  require(sandwich)
  fam <- ifelse(class(model.obj)[1]=='glm',family(model.obj)$family,'gaussian')
  betahats <- model.obj$coefficients
  n <- nrow(model.obj$model)
  p <- length(betahats)
  vvHC <- vcovHC(model.obj)
  robust.SE <- sqrt(diag(vvHC))
  test.stat <- betahats/robust.SE
  robust.pval <- pnorm(abs(test.stat), lower.tail=FALSE)*2
  crit <- ifelse(fam=='gaussian',qt(.975,n-1),1.96)
  pval <- ifelse(fam=='gaussian',2*pt(abs(test.stat), df = n-1, lower.tail=FALSE), 2*pnorm(abs(test.stat),lower.tail=FALSE))
  default.table <- summary(model.obj)$coefficients
  default.CI <- confint(model.obj)
  robust.CI <- matrix(c(betahats - crit*robust.SE, betahats + crit*robust.SE),p,2)
  newtab.naive <- data.frame(cbind(default.table,default.CI))
  newtab.robust <- data.frame(cbind(betahats, robust.SE,test.stat, robust.pval, robust.CI))
  if(expo) {
    newtab.naive$Estimate <- exp(newtab.naive$Estimate)
    newtab.naive$X2.5.. <- exp(newtab.naive$X2.5..)
    newtab.naive$X97.5.. <- exp(newtab.naive$X97.5..)
    newtab.robust$betahats <- exp(newtab.robust$betahats)
    newtab.robust$V5 <- exp(newtab.robust$V5)
    newtab.robust$V6 <- exp(newtab.robust$V6)
  }
  names(newtab.naive) <- names(newtab.robust) <-  c('Estimate','Std Error','t','p-value','95% LCL','95% UCL')
  if(fam!='gaussian')   names(newtab.naive) <- names(newtab.robust) <-  c('Estimate','Std Error','z','p-value','95% LCL','95% UCL')
  if(expo) {
    newtab.naive <- newtab.naive[-1,]
    newtab.robust <- newtab.robust[-1,]
  }
  output <- list('Naive' = round(newtab.naive,4), 'Robust' = round(newtab.robust,4))
  return(output)
}
### This is another function my professor wrote to take linear combinations of variables. It takes a vector of beta values and combines them to get a total effect of the model. 
##The following function computes linear combination of regression coefficients
#using HC3 covariance matrix along with p-values and 95% CI
# Use expo = TRUE if dealing with e.g. logistic regression models
lincom <- function(model.obj, contrast.vector, expo = FALSE, name = NULL,  digits = 4) {
  require(sandwich)
  fam <- ifelse(class(model.obj)[1]=='glm',family(model.obj)$family,'gaussian')
  n <- nrow(model.obj$model)
  vmat <- vcovHC(model.obj)
  betahat <-coef(model.obj)
  if(length(betahat)!=length(contrast.vector)) stop('Contrast vector must include one element for each coefficient')
  cb <- as.numeric(contrast.vector%*%betahat)
  se.cb <- as.numeric(sqrt(contrast.vector%*%vmat%*%contrast.vector))
  test.stat <- cb/se.cb
  crit <- ifelse(fam=='gaussian',qt(.975,n-1),1.96)
  pval <- ifelse(fam=='gaussian',2*pt(abs(test.stat), df = n-1, lower.tail=FALSE), 2*pnorm(abs(test.stat),lower.tail=FALSE))
  ci <- cb+c(-crit, crit)*se.cb
  if(expo) {
    cb <- exp(cb)
    ci <- exp(ci)
  }
  ci <- round(ci, digits)
  output <- data.frame('estimate'=round(cb,digits),'pval'=round(pval,digits), 'LCL.95pct' = ci[1], 'UCL.95pct' = ci[2])
  rownames(output) <- name
  return(output)
}
# Tidyverse will be doing the bulk of the work for the data cleaning and data visualization steps. I'm most familiar with dplyr in particular for data cleaning but I think I'll play around with tibbles and other things in the tidyverse. I recently bought the R for Data Science book and I'll read that and implement it when possible. 
library(tidyverse)
# The Lahman package is the main place to find the data I need for the analysis
library(Lahman)
# The cvAUC package is used for getting cross validated AUC estimates using the AUC command. 
library(cvAUC)

First step is to load the data sets. Now seems like a good place to talk about where I got the data from - there are several places. The bulk of the data is from the Lahman database, using the “Lahman” package. There is also data from baseball-reference.com about advanced statistics that aren’t in the Lahman database. I got some data of the Mitchell Report for players who have been implicated or caught using Performance Enhancing Drugs (PEDs) from Wikipedia. I’m still working on finding a way to include advanced statistics from fangraphs.com, but if I had to I could recreate the formulas from there from scratch. That would not be ideal…

# Generate Batting Data
data(Batting)
# Get career data for players
Batting_career <- as_tibble(Batting) %>% 
  group_by(playerID) %>%
  select(-c(yearID)) %>% 
  summarise_if(is.numeric, sum) %>% 
  print
## # A tibble: 19,428 x 19
##    playerID stint     G    AB     R     H   X2B   X3B    HR   RBI    SB
##    <chr>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 aardsda…     9   331     4     0     0     0     0     0     0     0
##  2 aaronha…    23  3298 12364  2174  3771   624    98   755  2297   240
##  3 aaronto…     7   437   944   102   216    42     6    13    94     9
##  4 aasedo01    13   448     5     0     0     0     0     0     0     0
##  5 abadan01     3    15    21     1     2     0     0     0     0     0
##  6 abadfe01    10   363     9     0     1     0     0     0     0     0
##  7 abadijo…     3    12    49     4    11     0     0     0     5     1
##  8 abbated…    11   855  3044   355   772    99    43    11   324   142
##  9 abbeybe…     7    79   225    21    38     3     3     0    17     3
## 10 abbeych…     5   452  1756   307   493    67    46    19   280    93
## # … with 19,418 more rows, and 8 more variables: CS <int>, BB <int>,
## #   SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>

What I just did was get the Batting data from the Lahman Database package in R, and get the sum for every player if the column is numeric data (group_by playerID and summarize_if). If the column isn’t numeric, taking the sum doesn’t make any sense so I got rid of it. R doesn’t know how I want to handle that situation. One obvious issue is that the sum of the yearID doesn’t make any sense, so I selected all the columns except that one. Now that I have all the career batting stats from the Lahman data, I should work on connecting it to the baseball-reference.com data. Another issue I might run into is a difference between how often the baseball-reference (BR from now on) data and Lahman data are updated. I propose that this shouldn’t be an issue for the majority of my analysis because later on I’m going to filter out players that aren’t eligible for the HOF (Elgibility criteria: >= 10 years played and retired at least 5 years). Current players are typically not retired 5+ years (Except Brett Favre. He still may make a comeback yet). The data can be found at www.baseball-reference.com/data. Since what I have is batting data only, I can filter out the pitchers early on to make the dataset significantly smaller. I have a problem, though, because some players were both pitchers and hitters, especially in the dead-ball era. Babe Ruth is a famous example - according to these data, he was classified as a pitcher for his first 4 (age 19-22) seasons and a hitter for the rest of his career. I’m going to ignore this for now but make a note of it to consider later.

# Don't want all the same summary statistics for this, because WAR is cumulative but things like OPS+ need a weighted average to be meaningful
bWAR_batting <- read.csv("/Users/nn7886gk/Desktop/Baseball/war_daily_bat.txt")
bWAR_batting_career <- as_tibble(bWAR_batting) %>%
  filter(pitcher == "N") %>% 
  group_by(player_ID) %>% 
  summarise(career_WAR = sum(WAR), career_oWAR = sum(WAR_off), career_dWAR = sum(WAR_def),            career_PA = sum(PA), career_Inn = sum(Inn), career_WAA = sum(WAA), career_oWAA = sum(WAA_off),     career_dWAA = sum(WAA_def), career_OPSplus = weighted.mean(OPS_plus, PA))
# eval = FALSE, so it's not trying to run this code, but I wasn't careful enough about the importing of the data, so I was getting the error message "Sum not meaningful for factors". I was trying to add things together that R thought were factors but were supposed to be numeric. I solved this issue in the next chunk. 

Great. I need to fix some things because right now they’re factors when they should be numerical (like plate appearances for example). Since you can’t sum factors in R, I need to fix this before I can move forward.

# About an hour later, I figured out how to make this work finally - the NA values in this dataset weren't "NA" but instead they were "NULL", causing R to think that NULL was a level of the factors instead of missingness
bWAR_batting <- as_tibble(read.csv("/Users/nn7886gk/Desktop/Baseball/war_daily_bat.txt", na.strings = "NULL"))
# This code takes like 4 minutes to run on my newish Macbook. I'm guessing the summarise function isn't super efficient? 
bWAR_batting_career <- bWAR_batting %>% 
  filter(pitcher == "N") %>% 
  group_by(player_ID) %>% 
  summarise(career_WAR = sum(WAR), career_oWAR = sum(WAR_off), career_dWAR = sum(WAR_def),            career_PA = sum(PA), career_Inn = sum(Inn), career_WAA = sum(WAA), career_oWAA = sum(WAA_off),     career_dWAA = sum(WAA_def), career_OPSplus = round(weighted.mean(OPS_plus, PA), 0)) %>%
  print(bWAR_batting_career)
## # A tibble: 10,683 x 10
##    player_ID career_WAR career_oWAR career_dWAR career_PA career_Inn
##    <fct>          <dbl>       <dbl>       <dbl>     <int>      <dbl>
##  1 aaronha01    143.         132.         -4.6      13941     27947.
##  2 aaronto01     -2.8         -2.13       -2.66      1045      2158.
##  3 abadan01      -0.37        -0.37       -0.03        25        46 
##  4 abadijo01     -0.06         0.04       -0.09        49        NA 
##  5 abbated01      8.68        15.0        -2.98      3461        NA 
##  6 abbeych01      1.78         2.48       -2.63      1965        NA 
##  7 abbotfr01      0.52         0.41        1.11       560        NA 
##  8 abbotje01     -1.43         0.44       -2.18       651      1337.
##  9 abbotku01      0.540        3.77       -1.06      2227      4545.
## 10 abbotod01     -0.41        -0.42       -0.05        82        NA 
## # … with 10,673 more rows, and 4 more variables: career_WAA <dbl>,
## #   career_oWAA <dbl>, career_dWAA <dbl>, career_OPSplus <dbl>

Now to put them together. I want to do an inner join because I don’t want to have a lot of unnecessary missingness in the final dataset, plus that would remove all the non-pitchers from the Lahman database and all the players that I don’t have in the BR data. Then I want to make some new variables and put them together to make a final career data to use in models and stuff.

# Add Retirement year
## Year_ID doesn't like being run multiple times 
Retirement_year <- aggregate(year_ID ~ player_ID, FUN = max, data = bWAR_batting)

bWAR_batting_career <- bWAR_batting_career %>% 
  inner_join(Retirement_year, by = "player_ID") %>% 
  print()
## # A tibble: 10,683 x 11
##    player_ID career_WAR career_oWAR career_dWAR career_PA career_Inn
##    <fct>          <dbl>       <dbl>       <dbl>     <int>      <dbl>
##  1 aaronha01    143.         132.         -4.6      13941     27947.
##  2 aaronto01     -2.8         -2.13       -2.66      1045      2158.
##  3 abadan01      -0.37        -0.37       -0.03        25        46 
##  4 abadijo01     -0.06         0.04       -0.09        49        NA 
##  5 abbated01      8.68        15.0        -2.98      3461        NA 
##  6 abbeych01      1.78         2.48       -2.63      1965        NA 
##  7 abbotfr01      0.52         0.41        1.11       560        NA 
##  8 abbotje01     -1.43         0.44       -2.18       651      1337.
##  9 abbotku01      0.540        3.77       -1.06      2227      4545.
## 10 abbotod01     -0.41        -0.42       -0.05        82        NA 
## # … with 10,673 more rows, and 5 more variables: career_WAA <dbl>,
## #   career_oWAA <dbl>, career_dWAA <dbl>, career_OPSplus <dbl>,
## #   year_ID <int>

I kept running into issues where I would accidentally run the whole chunk and break my year_ID variable. I’ll seperate them to solve that issue

# Final dataset, needs only a few further modifications
Batting_career_final <- inner_join(Batting_career, bWAR_batting_career, by = c("playerID" = "player_ID")) 
## Warning: Column `playerID`/`player_ID` joining character vector and factor,
## coercing into character vector
# Adds HOF eligibility, 0 if not eligible and 1 if eligible
Batting_career_final <- mutate(Batting_career_final, HOF_Eligible = ifelse(((stint >= 10) & (year_ID + 6 <= 2019)), 1, 0)) %>% print()
## # A tibble: 10,499 x 30
##    playerID stint     G    AB     R     H   X2B   X3B    HR   RBI    SB
##    <chr>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 aaronha…    23  3298 12364  2174  3771   624    98   755  2297   240
##  2 aaronto…     7   437   944   102   216    42     6    13    94     9
##  3 abadan01     3    15    21     1     2     0     0     0     0     0
##  4 abadijo…     3    12    49     4    11     0     0     0     5     1
##  5 abbated…    11   855  3044   355   772    99    43    11   324   142
##  6 abbeych…     5   452  1756   307   493    67    46    19   280    93
##  7 abbotfr…     3   160   513    48   107    21     6     1    49    14
##  8 abbotje…     5   233   596    82   157    33     2    18    83     6
##  9 abbotku…    11   702  2044   273   523   109    23    62   242    22
## 10 abbotod…     1    22    70     2    13     2     1     0     6     3
## # … with 10,489 more rows, and 19 more variables: CS <int>, BB <int>,
## #   SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>,
## #   career_WAR <dbl>, career_oWAR <dbl>, career_dWAR <dbl>,
## #   career_PA <int>, career_Inn <dbl>, career_WAA <dbl>,
## #   career_oWAA <dbl>, career_dWAA <dbl>, career_OPSplus <dbl>,
## #   year_ID <int>, HOF_Eligible <dbl>
## How can anyone else get these data? Better data somewhere else? 
# Merge with dataset of PEDs users
PEDs <- read.csv("/Users/nn7886gk/Desktop/Baseball/PEDs.csv")
Batting_career_final2 <- left_join(Batting_career_final, PEDs, by = "playerID") 
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
print(Batting_career_final2)
## # A tibble: 10,504 x 31
##    playerID stint     G    AB     R     H   X2B   X3B    HR   RBI    SB
##    <chr>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 aaronha…    23  3298 12364  2174  3771   624    98   755  2297   240
##  2 aaronto…     7   437   944   102   216    42     6    13    94     9
##  3 abadan01     3    15    21     1     2     0     0     0     0     0
##  4 abadijo…     3    12    49     4    11     0     0     0     5     1
##  5 abbated…    11   855  3044   355   772    99    43    11   324   142
##  6 abbeych…     5   452  1756   307   493    67    46    19   280    93
##  7 abbotfr…     3   160   513    48   107    21     6     1    49    14
##  8 abbotje…     5   233   596    82   157    33     2    18    83     6
##  9 abbotku…    11   702  2044   273   523   109    23    62   242    22
## 10 abbotod…     1    22    70     2    13     2     1     0     6     3
## # … with 10,494 more rows, and 20 more variables: CS <int>, BB <int>,
## #   SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>,
## #   career_WAR <dbl>, career_oWAR <dbl>, career_dWAR <dbl>,
## #   career_PA <int>, career_Inn <dbl>, career_WAA <dbl>,
## #   career_oWAA <dbl>, career_dWAA <dbl>, career_OPSplus <dbl>,
## #   year_ID <int>, HOF_Eligible <dbl>, Implicated.for.PEDs <int>
# 5 more rows in this data frame for some reason... Further inspection shows that they were duplicated (by a bad join?)

# Extra duplicated rows need to be taken out - I'll do it at the end but I'll leave the code right now. 
#Batting_career_final2 <- Batting_career_final[!duplicated(Batting_career_final[1]),]

# When I joined the tables, I only had the ~200 players suspected or admitted to PEDs, so I had a lot of NAs. NA values are actually 0, as in they were not found to have used PEDs at this time. The code below changes the NA values to 0s, but if I accidentally run it twice then I would have a problem since it would change everything to a 0. There's probably a more elegant solution to this but I don't know it
Batting_career_final2$Implicated.for.PEDs <-      ifelse(is.na(Batting_career_final2$Implicated.for.PEDs), 0, 1)
print(Batting_career_final2)
## # A tibble: 10,504 x 31
##    playerID stint     G    AB     R     H   X2B   X3B    HR   RBI    SB
##    <chr>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 aaronha…    23  3298 12364  2174  3771   624    98   755  2297   240
##  2 aaronto…     7   437   944   102   216    42     6    13    94     9
##  3 abadan01     3    15    21     1     2     0     0     0     0     0
##  4 abadijo…     3    12    49     4    11     0     0     0     5     1
##  5 abbated…    11   855  3044   355   772    99    43    11   324   142
##  6 abbeych…     5   452  1756   307   493    67    46    19   280    93
##  7 abbotfr…     3   160   513    48   107    21     6     1    49    14
##  8 abbotje…     5   233   596    82   157    33     2    18    83     6
##  9 abbotku…    11   702  2044   273   523   109    23    62   242    22
## 10 abbotod…     1    22    70     2    13     2     1     0     6     3
## # … with 10,494 more rows, and 20 more variables: CS <int>, BB <int>,
## #   SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>,
## #   career_WAR <dbl>, career_oWAR <dbl>, career_dWAR <dbl>,
## #   career_PA <int>, career_Inn <dbl>, career_WAA <dbl>,
## #   career_oWAA <dbl>, career_dWAA <dbl>, career_OPSplus <dbl>,
## #   year_ID <int>, HOF_Eligible <dbl>, Implicated.for.PEDs <dbl>

To recap, I joined my baseball-reference data and my Lahman data, added a retirement year and a binary eligible variable (1 for yes, 0 for no). Then I added a PEDs Y/N variable. All I should need is some more statistics that aren’t calculated like batting average and on-base percentage, and obviously I need an indicator variable for inducted into the HOF.

# Join in the HOF induction data
data("HallOfFame")
HallOfFame <- HallOfFame %>% select(-c(yearID, needed_note)) %>% filter(category == "Player")

Batting_career_final3 <- Batting_career_final2 %>% 
  left_join(HallOfFame, by = "playerID") %>% 
  # Add career BA, SLG, OBP, OPS
  mutate(BA = round(H/AB, 3), 
         SLG = round(((H-X2B-X3B-HR) + 2*X2B + 3*X3B + 4*HR)/AB, 3),
         OBP = round((H + BB + HBP)/career_PA, 3),
         OPS = OBP + SLG)

Batting_career_final3$inducted <- ifelse(is.na(Batting_career_final3$inducted), 0, 1)
Batting_career_final3$votes <- ifelse(is.na(Batting_career_final3$votes), 0, Batting_career_final3$votes)

# Remove duplicate rows
Batting_career_final3 <- Batting_career_final3[!duplicated(Batting_career_final3[1]),]
print(Batting_career_final3)
## # A tibble: 10,499 x 41
##    playerID stint     G    AB     R     H   X2B   X3B    HR   RBI    SB
##    <chr>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 aaronha…    23  3298 12364  2174  3771   624    98   755  2297   240
##  2 aaronto…     7   437   944   102   216    42     6    13    94     9
##  3 abadan01     3    15    21     1     2     0     0     0     0     0
##  4 abadijo…     3    12    49     4    11     0     0     0     5     1
##  5 abbated…    11   855  3044   355   772    99    43    11   324   142
##  6 abbeych…     5   452  1756   307   493    67    46    19   280    93
##  7 abbotfr…     3   160   513    48   107    21     6     1    49    14
##  8 abbotje…     5   233   596    82   157    33     2    18    83     6
##  9 abbotku…    11   702  2044   273   523   109    23    62   242    22
## 10 abbotod…     1    22    70     2    13     2     1     0     6     3
## # … with 10,489 more rows, and 30 more variables: CS <int>, BB <int>,
## #   SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>,
## #   career_WAR <dbl>, career_oWAR <dbl>, career_dWAR <dbl>,
## #   career_PA <int>, career_Inn <dbl>, career_WAA <dbl>,
## #   career_oWAA <dbl>, career_dWAA <dbl>, career_OPSplus <dbl>,
## #   year_ID <int>, HOF_Eligible <dbl>, Implicated.for.PEDs <dbl>,
## #   votedBy <chr>, ballots <int>, needed <int>, votes <dbl>,
## #   inducted <dbl>, category <fct>, BA <dbl>, SLG <dbl>, OBP <dbl>,
## #   OPS <dbl>

I should be ready to fit some preliminary models for batters. My next part of the project is to clean the pitcher data and prepare that for modelling, and get data for the top 7 seasons for each player as an estimate of their “peak” performance.

# Only fit models for HOF eligible batters 
Eligible_only <- filter(Batting_career_final3, HOF_Eligible == 1)
# Very basic logistic model, only career WAR to predict if they're in the HOF or not
mod_WAR <- glm(inducted ~ career_WAR, data = Eligible_only, family = "binomial")
summary.hc(mod_WAR, expo = T)
## Loading required package: sandwich
## Waiting for profiling to be done...
## $Naive
##            Estimate Std Error       z p-value 95% LCL 95% UCL
## career_WAR   1.1149    0.0046 23.4775       0   1.105  1.1253
## 
## $Robust
##            Estimate Std Error       z p-value 95% LCL 95% UCL
## career_WAR   1.1149    0.0047 23.1542       0  1.1047  1.1252
# Predictions of the model, no validation (yet...)
pred_WAR <- predict(mod_WAR)
# Finds AUC of the super duper basic model
auc_WAR <- cvAUC(predictions = pred_WAR, labels = Eligible_only$inducted)
print(auc_WAR$fold.AUC)
## [1] 0.8652139
# And a plot of the AUC of the model for good measure
plot(auc_WAR$perf)

# The prediction of the odds increase from a 1 SD increase in career WAR 
lincom(mod_WAR, c(0, sd(Eligible_only$career_WAR)), expo = T)

So above is a super basic logistic model, with only one predictor variable WAR. WAR is Wins Above Replacement, a measure of how much better a player is than a league average replacement player, someone that any team could get for no value. Note that this is different than WAA, or Wins Above Average, which is a measure of how much better the player is than a league average player at that position. Both WAR and WAA control for things like the era they played in, the league they played in (National League vs American League, since the AL has the designated hitter), the player's position, and the park the player plays in, because some parks are more hitter or pitcher friendly. My conclusion from that model is that career WAR is a pretty strong differentiator, since the AUC is 0.86. Given a randomly chosen inducted and a randomly chosen not inducted player, knowing only their career WAR gives about a 86% chance of differentiating them correctly. For a 1 unit increase in career WAR, the odds of being inducted into the HOF increases by about 11.5% (p < 0.0001). For a 1 standard deviation (about 16 career WAR) increase in career WAR, the odds of them being inducted into the Hall of Fame are about 9.4 times higher (95% CI: 7.76, 11.33).
This is the end of Part 1 of the Baseball Data Science project. It took the better part of 3 months, although my first attempt was not good programming practice and did not use R Markdown. Part 2 will be me doing more data manipulation to find some other important statistics, and trying to answer my first and second research questions - what methods are best at predicting the likelihood a player makes the Hall of Fame, and what features are common in the typical Hall of Famer? I’ll fit a bunch of different models to see what I can learn, and do my best to interpret them, if they’re interpretable at all. It should be pretty fun! I’m hoping to have Part 2 done by the end of the summer so I can get some feedback from my professors before they’re completely swamped with work after school gets super busy.