Predicting wins with basic stats

Hey guys and gals, I have been working on this project for a couple of weeks for Regression Analysis here for my stats major and thought you would like to take a look at it (hopefully to not tear it to shreds!). We were tasked to find and use a data set that had at least 5 predictors, one qualitative predictor. So, I decided that if I was going to spend my time to do this anyway, my as well make it something somewhat interesting and so I picked 18 predictors of "normal" stats in hopes of predicting wins.

I only sampled the team stats and team wins for the 2011 season in these models.

I tried to choose some normal stats and then grab a few more advanced stats to test in a few models in hopes that they would do a better job of predicting. Thus, I chose 18 predictors consisting of: Runs, AVG, OBP, SLG (Fangraphs), RBI, BABIP, SPD (Fangraphs), wRC (Fangraphs), ERA, FIP, K/9, BB/9, HR/9 AVG Against, Fielding %, UZR and League.

First, I looked at the correlation matrix and scatterplot matrices, to get a basic idea of which of these predictors do the best job of predicting wins. I thought about including the full correlation matrix but didn't think that people would appreciate looking at a 324 cell matrix. If you are interested in the colinearity of these predictors you can see a screengrab here:

So I just included the correlation of these predictors for wins below:

Runs 0.44564366
AVG 0.19276043
OBP 0.35779569
SLG 0.45215204
RBI 0.46950504
BABIP -0.08159882
SPD 0.03686307
wRC 0.41034588
ERA -0.62198238
FIP -0.55594998
K.9 0.31280869
BB.9 -0.62168528
HR.9 -0.46516425
AVG.against -0.59902303
Fielding 0.35560152
UZR 0.35130232
League -0.12451507
Wins 1.00000000

Note: The dummy variable for league assigns 1 if NL.

Overall I found the results mildly surprising. As expected things like BABIP, SPD and league have very little correlation with wins, however, I was a little surprised that BB/9, ERA and AVG against were the most important predictors for 2011 wins. Overall, comparing single predictor correlation values has little value, as I hoped to build a model that would give me a moderate ability to predict wins going forward.

I next tried to fit my own models based on various assumptions I had with baseball to little avail. I consistently was receiving R^2 adjusted values of ~.7, which is alright but I hoped that I could do better. I tried to use my elementary knowledge of the Akaike Information Criterion and the Bayesian Information Criterion in hopes of choosing the best model out of the few predictors I had in the model.

For a few reasons I chose to use the AIC, and in R used the stepAIC function to wade through the mundane calculations to find the best model in this data. I ignored interaction terms, as in most cases with this data they would not be interpretable. (Really what would OBP*SLG really tell you?) For those wondering the AIC equation is:

AIC=n*ln(SSE)-n*ln(n) + 2p

where we are looking for the smallest value, as we would like to use the least amount of predictors while still having a small sum of squared errors.

Eventually I came up with the equation:

Wins= 9.85428 + 514.44222(OBP) + 530.83208(SLG) + 0.17120(RBI) - 0.33873(wRC) -15.68742 (BB/9) -537.70089 (AVG Against)

This model had an R^2 of 0.8578 and an adjusted R^2 of 0.8207. None of the values are particularly surprising outside of the negative slope on the wRC and the fact that it is important enough to include in the model. I would be interested in hearing your guys' opinions of why this is?

This means that this model does a moderately strong job of predicting wins using 2011 stats to predict 2011 wins. However, despite the extrapolation to 2012, through ~25 games we would like to predict 2012 wins with 2012 stats. This is horrible statistics, but I was curious and we will see how it holds up!

Since there are two counting stats wRC and RBIs, I will pro-rate them for the entire season.

Colorado Rockies:

9.85428 + 514.44222(.326) + 530.83208(.455) + 0.17120(830.25) - 0.33873(816.75) -15.68742 (3.38) -537.70089 (.281)

Projected wins for the 2012 Rockies: 80.45468

Overall this doesn't tell us a lot going forward but thought it would be at least interesting to use the 2011 stats to predict in 2012. However, like most teams, the Rockies peripherals are matching up with their current record. Hopefully the Rockies exceed this expectation.

In conclusion, I hope to use this model going forward to see if it has continued viability, as well as start adding more predictors into the AIC and BIC equations to see which ones stick. By adding more predictors and more years, hopefully we are able to better predict wins going forward. I am interested in hearing your guys' thoughts on this!

Eat. Drink. Be Merry. But the above FanPost does not necessarily reflect the attitudes, opinions, or views of Purple Row's staff (unless, of course, it's written by the staff [and even then, it still might not]).

Log In Sign Up

Log In Sign Up

Please choose a new SB Nation username and password

As part of the new SB Nation launch, prior users will need to choose a permanent username, along with a new password.

Your username will be used to login to SB Nation going forward.

I already have a Vox Media account!

Verify Vox Media account

Please login to your Vox Media account. This account will be linked to your previously existing Eater account.

Please choose a new SB Nation username and password

As part of the new SB Nation launch, prior MT authors will need to choose a new username and password.

Your username will be used to login to SB Nation going forward.

Forgot password?

We'll email you a reset link.

If you signed up using a 3rd party account like Facebook or Twitter, please login with it instead.

Forgot password?

Try another email?

Almost done,

By becoming a registered user, you are also agreeing to our Terms and confirming that you have read our Privacy Policy.

Join Purple Row

You must be a member of Purple Row to participate.

We have our own Community Guidelines at Purple Row. You should read them.

Join Purple Row

You must be a member of Purple Row to participate.

We have our own Community Guidelines at Purple Row. You should read them.




Choose an available username to complete sign up.

In order to provide our users with a better overall experience, we ask for more information from Facebook when using it to login so that we can learn more about our audience and provide you with the best possible experience. We do not store specific user data and the sharing of it is not required to login with Facebook.