Hey guys and gals, I have been working on this project for a couple of weeks for Regression Analysis here for my stats major and thought you would like to take a look at it (hopefully to not tear it to shreds!). We were tasked to find and use a data set that had at least 5 predictors, one qualitative predictor. So, I decided that if I was going to spend my time to do this anyway, my as well make it something somewhat interesting and so I picked 18 predictors of "normal" stats in hopes of predicting wins.
I only sampled the team stats and team wins for the 2011 season in these models.
I tried to choose some normal stats and then grab a few more advanced stats to test in a few models in hopes that they would do a better job of predicting. Thus, I chose 18 predictors consisting of: Runs, AVG, OBP, SLG (Fangraphs), RBI, BABIP, SPD (Fangraphs), wRC (Fangraphs), ERA, FIP, K/9, BB/9, HR/9 AVG Against, Fielding %, UZR and League.
First, I looked at the correlation matrix and scatterplot matrices, to get a basic idea of which of these predictors do the best job of predicting wins. I thought about including the full correlation matrix but didn't think that people would appreciate looking at a 324 cell matrix. If you are interested in the colinearity of these predictors you can see a screengrab here:
So I just included the correlation of these predictors for wins below:
Note: The dummy variable for league assigns 1 if NL.
Overall I found the results mildly surprising. As expected things like BABIP, SPD and league have very little correlation with wins, however, I was a little surprised that BB/9, ERA and AVG against were the most important predictors for 2011 wins. Overall, comparing single predictor correlation values has little value, as I hoped to build a model that would give me a moderate ability to predict wins going forward.
I next tried to fit my own models based on various assumptions I had with baseball to little avail. I consistently was receiving R^2 adjusted values of ~.7, which is alright but I hoped that I could do better. I tried to use my elementary knowledge of the Akaike Information Criterion and the Bayesian Information Criterion in hopes of choosing the best model out of the few predictors I had in the model.
For a few reasons I chose to use the AIC, and in R used the stepAIC function to wade through the mundane calculations to find the best model in this data. I ignored interaction terms, as in most cases with this data they would not be interpretable. (Really what would OBP*SLG really tell you?) For those wondering the AIC equation is:
AIC=n*ln(SSE)-n*ln(n) + 2p
where we are looking for the smallest value, as we would like to use the least amount of predictors while still having a small sum of squared errors.
Eventually I came up with the equation:
Wins= 9.85428 + 514.44222(OBP) + 530.83208(SLG) + 0.17120(RBI) - 0.33873(wRC) -15.68742 (BB/9) -537.70089 (AVG Against)
This model had an R^2 of 0.8578 and an adjusted R^2 of 0.8207. None of the values are particularly surprising outside of the negative slope on the wRC and the fact that it is important enough to include in the model. I would be interested in hearing your guys' opinions of why this is?
This means that this model does a moderately strong job of predicting wins using 2011 stats to predict 2011 wins. However, despite the extrapolation to 2012, through ~25 games we would like to predict 2012 wins with 2012 stats. This is horrible statistics, but I was curious and we will see how it holds up!
Since there are two counting stats wRC and RBIs, I will pro-rate them for the entire season.
9.85428 + 514.44222(.326) + 530.83208(.455) + 0.17120(830.25) - 0.33873(816.75) -15.68742 (3.38) -537.70089 (.281)
Projected wins for the 2012 Rockies: 80.45468
Overall this doesn't tell us a lot going forward but thought it would be at least interesting to use the 2011 stats to predict in 2012. However, like most teams, the Rockies peripherals are matching up with their current record. Hopefully the Rockies exceed this expectation.
In conclusion, I hope to use this model going forward to see if it has continued viability, as well as start adding more predictors into the AIC and BIC equations to see which ones stick. By adding more predictors and more years, hopefully we are able to better predict wins going forward. I am interested in hearing your guys' thoughts on this!