Hey everyone,
This is a very long fanpost, so if you just want the goodies (whatever that means), skip to the end. If you want to read and learn about some statistics and projecting our players, then make a sandwich, open a soda/beer/juicebox, and continue after the jump off.
So at some point last week, Andrew Fisher and I had some brief exchange in the comments talking about the number of home runs Troy Tulowitzki could hit next season (which was part of a larger series of comments discussing if Tulo could in fact approach 40 home runs). Basically, what happened was that I said I expect about 35 home runs from Tulo if he's healthy, and then I also said, I wonder what the standard deviation of home runs would be for power hitters (which I defined as any batter that qualified for the batting title and had an ISO equal to or greater than .200*) to determine if 40 home runs for our shortstop was feasible or would take a monumental amount of luck. ATF responded by saying this sounds like a good idea for a fanpost, so here it is.
*ISO is computed by subtracting batting average from slugging percentage (or SLG  BA)
So to be clear, this exercise is less about hyping up Tulo, and more about trying to get a good grasp on what can be expected for next season using actual data.
Soooooo, the first thing I did, was go to fangraphs and using their supercoolfantastic ‘export to excel' function, I exported all the data on their dashboard page, for every batter that qualified for the batting title for the past 3 seasons (if this is confusing to you, and it probably is because what the hell is a dashboard page anyway? go here)
First off, why 3 seasons? Because bigger sample sizes are always better, and originally, I wanted to look at guys with an ISO over 200, which would be a very small sample in one season (about 50 people).
Next question then, why only 3 seasons? Because I made an arbitrary cutoff point, and felt that beyond 3 years (though technically it should be 4), there's still a lot of what I consider to be PED stuff in the numbers (which is why for the earlier part of the 2000's decade, there were usually something like 9 guys who hit at least 40 home runs, but since the Mitchell report came out, there's been more like 3 guys a year). So maybe it's not PED's at all, but since I noticed there were generally lower home run totals around the league postMitchell report, I decided not to go back too far for this exercise (but let's not make this a discussion about PED's because those discussions are generally boring and generally suck).
So once these three seasons were collected, I then imported all the data into SPSS and got ready to run some analyses! First thing I did, was remove everyone who had an ISO less than .200 Remember what I said about large samples are better? I guess I lied, so this resulted in taking out about 300 batters (and my apologies to all you 7 scrubs who had .199 ISO's, better luck next year!).
Next, I created an incredibly rudimentary model, a linear regression model that has home runs as the dependent variable, and games played and ISO as the independent variables. I ran a twostep regression, with the first step having just the number of games predicting home runs, and then the second step with both the number of games and .ISO, so we could get a better grasp on what ISO is adding to the model that games alone can't explain.
Here's what I got:
Descriptive Statistics 


Mean 
Std. Deviation 
N 
HR 
30.21 
6.144 
151 
G 
147.11 
11.236 
151 
ISOnew 
239.25 
28.067 
151 
Correlations 


HR 
G 
ISOnew 

Pearson Correlation 
HR 
1.000 
.455 
.805 
G 
.455 
1.000 
.058 

ISOnew 
.805 
.058 
1.000 

HR 
. 
.000 
.000 

G 
.000 
. 
.241 

ISOnew 
.000 
.241 
. 

N 
HR 
151 
151 
151 
G 
151 
151 
151 

ISOnew 
151 
151 
151 
So first off, we see that the average number of home runs for guys with .200 ISO's or greater is 30, and the average ISO is .239. The standard deviation is 6, so essentially, if a dude with 30hr power were to play the 2011 season 100 times, 95 of those times, his home run totals would be between 18 and 42 (this all isn't exact but whatever, I'm rolling with it).
We also see some correlations, specifically that ISO and HR's have a correlation of .81, games and HR's have a correlation of .455, and lastly, games and ISO have a correlation of .058 (so basically, no correlation). All of these numbers are pretty much exactly what we'd expect them to be (though that games and HR correlation is a little high).
Next, is the regression output:
Model Summaryc 

Model 
R 
R Square 
Adjusted R Square 
Std. Error of the Estimate 
Change Statistics 

R Square Change 
F Change 
df1 
df2 
Sig. F Change 

dimension0 
1 
.455a 
.207 
.202 
5.490 
.207 
38.930 
1 
149 
.000 
2 
.903b 
.816 
.813 
2.655 
.609 
489.168 
1 
148 
.000 

a. Predictors: (Constant), G 

b. Predictors: (Constant), G, ISOnew 

c. Dependent Variable: HR 
Coefficientsa 

Model 
Unstandardized Coefficients 
Standardized Coefficients 
t 
Sig. 

B 
Std. Error 
Beta 

1 
(Constant) 
6.404 
5.885 

1.088 
.278 
G 
.249 
.040 
.455 
6.239 
.000 

2 
(Constant) 
43.718 
3.309 

13.214 
.000 
G 
.224 
.019 
.410 
11.609 
.000 

ISOnew 
.171 
.008 
.781 
22.117 
.000 

a. Dependent Variable: HR 
So among all of this, two things jump out, the first is the adjusted R Square statistic of .813 for model 2. So by including games and ISO, this model accounts for 81% of the variance in home run totals (not too shabby!). The second numbers are on the second table, where the Beta value (B) is .224 for games, and .171 for ISO. Essentially, this means that when controlling for games, a one point increase in ISO leads to a .171 increase in home runs. One problem pops out however, and that is that apparently, games has a higher beta (.224) to ISO (.171). This is most likely due to including only the most powerful hitters and is corrected in the model including the total sample of batters that qualified for the batting title.
So now for the fun part, what I then did was estimate how many home runs Troy Thomas Tulowitzki would hit next season (Editor's note: Thomas is probably not his real middle name, but was essential for alliteration). To do this, I utilized some of the awesome work by RhodeIslandRoxFan which can be found here. RIRF pointed out that Tulo has been a different hitter since he changed his stance, so rather than use his career ISO, I used the ISO that he's put up since he changed the stance.
The Goodies (I promised them didn't I?)
Troy had a .298 ISO over the last 432 at bats in 2009, and a .253 ISO in 2010. I weighted them and since his stance change, Tulo has had an ISO of .262. Then, I entered into my data that Tulo would play 155 games next season with a .262 ISO and remember my prediction of 35 hrs? Well the machine spit out a projected 34.8 home runs for 2011 (that's right ladies, I'm 6'2", shock of blonde hair, and can do regressions effortlessly in my head***). Coupling this number of 34.8 with the standard deviation of ~6, Tulowitzki could hit 41 home runs next year and still be within 1 standard deviation  no too hard to do statistically. The converse of course is that he can just as easily hit 29 home runs.
*** nothing in this sentence is factually accurate.
Some odds and ends and caveats.
1. I should have probably used atbats or plate appearances rather than mere games. However, as these are all batters that qualified for the batting title, they were fulltime starters who didn't have a lot of wacky pinchhit appearances.
2. Home Runs and ISO are incredibly correlated. When I ran these analyses using the full data set (not just over .200 ISO's), home runs and ISO had a .954 correlation, suggesting they are redundant. Ideally, you would not use ISO to then predict home runs because they are essentially the same thing. Oh well.
3. However, in the second model, ISO had a greater beta weight than games, which is what we would expect, so that's good.
4. The model, like any true model, starts to break down at the extremes. For instance, a batter who hit 1 home run, was then projected by the model to hit 3. I'm not exactly sure where the range of the model ends but I'd cut it off at ISO's under .075 since few batters have that number anyway.
5. Some of you may be wondering about park effects, I decided that for Tulo(and most players), the effects of Coors will be captured in the ISO number. Someone like Jose Lopez however, will be much trickier to peg down.
Even More Goodies
With the caveats in mind, I thought I could make an exercise out of this for all of the people at Purple Row. Since we're much more familiar with our team, I figured we could do our own fan scouting report. So what I'm suggesting is, in addition to whatever discussion ensues, people can post what they think the number of games played and the ISO a person will put up next season for our starting 8. I'll average those numbers then put them into the equation and see what it spits out. Then we can track what actually happens at the end of next season. Of course, if you're going to submit any numbers, don't post what a person could do, but rather what you think they will do. For instance, Ian Stewart could have an ISO of .250, but I personally think he'll have one around .200 next season.
So discuss it all and let me know. If you have questions about my shaky methodology, want to suggest different variables to use to predict home runs, or anything else, please let me know. If the feedback is good and people are giving predictions, look for another fan post in a week or two compiling everyone's thoughts.