Any study on the reliability of pitching stats is by default a paper on DIPS. When Voros McCracken wrote the original DIPS paper, it looked at the simple fact that while the correlation between strikeout rates, walk rate, HR rates, and HBP rates were fairly consistent from year to year, the correlation for BABIP (Batting average on balls in play) was not. Voros, in technical terms, was measuring test-retest reliability. In the intervening few years since his discovery, several folks have picked apart his findings and basic premise that a pitcher has great control (as opposed to luck) over things that happen without the involvement of the defense (hence “Defense Independent Pitching Statistics”), but little control over what happens when the ball is actually in play. Despite the large amount of work spent on attempting to disprove the theory, it has generally stood up to the tests thrown at it.
A little while ago, I looked at the reliability of batting statistics using split-half reliability. Basically, I coded each plate appearance and coded it as even numbered or odd numbered, and separated the two groups accordingly. I looked at stats like OBP in a player’s even-numbered at bats and his odd-numbered at-bats. If OBP is reliable, then OBP in even-numbered plate appearances and OBP in odd-numbered plate appearances will correlate well with each other. Also, I looked to see when a stat became reliable enough to be useful in analyses, using a standard cutoff of a split-half reliability of .70. The more plate appearances (or for pitchers, batters faced) a player accumulates, the better idea we have of his true talent level, and the more reliable (reproduceable) a stat will be over the same time frame in the future. If a batter hit 10 HR in 400 PA, and HR/PA was very reliable, we’d guess that he’s going to hit 10 HR or so in his next 400 PA. If it’s not reliable, the fact that he hit 10 HR in his last 400 PA means nothing in terms of predicting his next 400 PA.
Now I turn my attention to pitching stats. I’ve hidden the numerical spaghetti behind the cut, and if you want to read it, it’s all there for you. I’ve used a method similar to my previous article, in that I look at the issue in three ways.
- First, I looked to answer the question of what the minimum number of PA or BF should be used in research studies. That is, suppose I want to do work on what strikeout rates predict to. Usually, I would say something like “all pitchers with a minimum of X batters faced” so that the “cup of coffee” call up guys won’t contaminate the sample. What should that minimum number be? To do this, I found the number of BF where the split-half correlation for the sample composed of that minimum number was at least .70.
- The second method is a look at how long until a stat becomes meaningful for a particular player. If Tuffy Rhodes hits 3 HR on Opening Day, we know that’s not a big enough sample size to tell us any meaningful information about him. However, after a few hundred PA’s, we can probably make some pretty good conclusions about his ability to hit for power. I went in 50 PA intervals to test this. For example, to test whether a stat was reliable at 50 PA’s, I took a player’s first 100 PA’s in the data base (2 - 50 PA samples) and calculated whatever stats were of interest from the even numbered PA’s and the odd-numbered. The number where the stat crossed a split-half reliability of .70 was where it became officially certified as appropriately reliable.
- Finally, I looked at what the split-half correlations were for the pitching stats under observation at 300 BF and 750 BF. This gives us an idea of how reliable stats are for a starter and a reliever, using some rough cutoffs.
Again, all of the data in all its glory is below the cut, but the main findings are:
- Strikeouts are the one outcome over which pitchers seem to have the most control. Walks are slightly less reliable, but still worthy of mention as a reliable/skill-based outcome. This checks out with previous DIPS work, including my own.
- Pitchers are astonishingly reliable in what sort of balls come off the bat when they pitch. At 750 batters faced, the split-half reliabilities for line drives and grounders were above .90. So, to say that once the ball is hit, the pitcher has no control over what happens, is false. The pitcher seems to have a good amount of stability in inducing different types of batted balls. There are going to be ground ball pitchers and fly ball pitchers, and that isn’t the product of random chance. Where that ball in play lands, either in someone’s glove or on the grass for a hit, doesn’t appear to be as reliable. I’ve previously shown that pitchers’ results on fly balls are more consistent (at least as a matter of degree… the reliability numbers themselves aren’t overwhelming) than their results on ground balls. Still, overall BABIP is still largely unstable, suggesting that there is little (although not nil) skill involved on the pitcher’s side.
- Contrary to original DIPS theory, home run rate isn’t very stable. In fact, a ball in play stat, singles/PA, is more reliable than HR/PA. This could be something that has to do with the pitchers or perhaps it has something to do with my methodology. Split-half controls for the four gentlemen standing directly behind him on the infield, so it may be that defense and pitching are once again entangled. Still, HR/PA reliability stats are fairly low. Even with a good sample size (750 BF), the split-half correlations were only .34 or so. Seems like a full season isn’t a good measure of a pitcher’s HR/PA ability.
- HR/FB was very very unstable for pitchers. For batters, HR/FB stablized pretty quickly. This suggests that the pitcher may be the one who gives up the fly ball, but the batter is the one who makes it leave the yard. So, if your favorite pitcher gave up a lot of HR/FB last year, fear not. Chances are he’ll be better next year.
- Relievers are hard to project because at the small sample sizes that relievers have in terms of batters faced, the stats used to describe pitchers are largely unreliable. This means that regeression to the mean will take its toll on a reliever very quickly. Relievers who rely mostly on the strikeout are less likely to have this trouble.
- And while I’m in the neighborhood, a post at Lookout Landing rating pitchers in a way very much consistent with what I’ve found here. Worth a read.
Methodology and Results
Some methodological notes: Again, I used Retrosheet files for 2001-2006 in two-year windows. (I lumped 2001-2002 together, 2003-2004, and 2005-2006.) It’s not ideal because for some folks, these plate appearances under study occurred a year and a half apart, but that’s the only way to get enough of a sample of one man’s work to draw any kind of conclusions. At least it’s better that they’re consecutive years. It does bring up the issue, especially with pitchers that there is some selective sampling going on. Consistent pitchers (and consistently good pitchers, especially) tend to get more playing time. Alas, baseball is a wonderful data set from a methodological point of view, but it’s not a perfect one.
Again, I realize that .70 is an arbitrary cutoff. I’ve laid out my reasons for using it before and I stick by them. (Short version: .70 means that you have an r-squared of .49. Anything north of that means that the majority of the variance is consistent within a player.)
Also, there’s one minor annoyance. I had to number the events by the way that Retrosheet does events. So, non-pitching events such as stolen base attempts, as well as passed balls were counted as events. (As were pitching events that don’t result in the end of a plate appearance, such as balks or wild pitches.) So, when I say 100 batters faced, it’s probably not 100 full batters, but actually something like 95 batters plus 5 (balks, WP, PB, SB, CS, etc.) It’s annoying enough that I should mention it, but probably not a big enough deal as to affect the major conculsions of the study.
Then there’s another issue of which pitching stats to study. Several of the usual stats used to evaluate pitchers are game-level stats. Wins and saves are generally the yardsticks by which we measure pitchers, but they are a poor gauge of what a pitcher actually did that day. To say that C.C. Sabathia picked up a win might mean that he barely survived five innings, gave up 7 runs, and got bailed out by some run support and the bullpen. It might mean that he threw a two-hit shutout. They’re rather imprecise. ERA is also a puzzler in this methodolgy, because a pitcher can give up an earned run (or an unearned run) despite the fact that he wasn’t even in the game at the time (and I never liked ERA or the concept of “earned runs” to begin with). I stuck to looking at various rate stats (K rate, walk rate), some one-number stats (AVG, OBP), and the batted ball profile.
Part I: Setting sampling minimum cutoffs for research
A few of the “How often does he” stats:
- K/PA - 50 BF; K/9 - 60 BF
- BB/PA - 250 BF; BB/9 - 300 BF
- K/BB ratio - 250 BF
- HR/PA and HR/9 - never did (at 750 BF, they were at .32 and .34, respectively)
- 1B/PA and 1B/9 - never did (at 750 BF, .57 and .50, respectively)
- 2B + 3B/PA and per 9 - never did (.33 and .36)
- HBP - never did (.53 and .54)
- WP - never did (.54 for both)
Looks like stats measured per PA, rather than per 9 innings stablize a bit more quickly, but it also looks like outside of walks and strikeouts, there is little consistency in a sample on issues of balls in play. The surprising finding was that in the 750 BF or more sample, singles were much more consistent than home runs.
Some one-number stats:
- I’ll give you the short version: I looked at AVG, OBP, SLG, OPS, and BABIP. None of them reached the magic cutoff of .70
- Interestingly enough, good old batting average against was the most reliable stat in the 750 BF sample (split-half r = .569), with OBP and SLG at .52 and .49. OPS was at .49. BABIP was at .238, which is certainly more than zero, but certainly nothing to write home about.
A spin through the batted ball profile.
- Ground balls / Ball in play - less than 50 BF
- Line drives - less than 50 BF
- Fly balls - less than 50 BF
- Pop ups - 325 BF
- HR/FB - never made it to .70, at 750 BF, it had a split half of .208
Again, the numbers above are for researchers looking to set a cutoff of “X number of batters faced or above” for their studies.
Part II: Evaluating individual players. When does a stat become meaningful for an individual pitcher?
Again, I used 50 BF intervals from 50 to 750. I’ll present the cutoffs and which stats hit the magic .70 mark at each one. I also only calculated stats per PA, rather than per 9 innings, since those seemed to be the more reliable stats, if only by a bit.
- 50 BF - nothing
- 100 BF - nothing
- 150 BF - K/PA, grounder rate, line drive rate
- 200 BF - flyball rate, GB/FB
- 250 BF - nothing
- 300 BF - nothing
- 350 BF - nothing
- 400 BF - nothing
- 450 BF - nothing
- 500 BF - K/BB, pop up rate
- 550 BF - BB/PA
- 600 BF - nothing
- 650 BF - nothing
- 700 BF - nothing
- 750 BF - nothing
You can’t tell a lot about a pitcher by looking at his stats over a single season. You can get a pretty good idea of how often he walks and strikes batters out, and what type of batted balls he gives up generally… but that’s about it.
Part III: How reliable is that stat?
Using the same methodology as part two, I present split-half reliability numbers at two cutoffs: 300 batters faced and 750 batters faced. At 300 batters faced, here are the split-half reliability numbers, in order from most reliable stats to least reliable.
- K/PA - .821
- BB/PA - .597
- K/BB - .575
- 1B/PA - .340
- HR/PA - .262
- 2B+3B/PA - .216
- OBP - .430
- OPS - .386
- AVG - .379
- SLG - .364
- BABIP - .135
Batted ball profile:
- Line drive/ball in play - .861
- Ground ball/BIP - .816
- GB/FB - .788
- Fly ball/BIP - .779
- Pop-up/BIP - .586
- HR/FB - .145
And at 750 Batters Faced, same idea:
- K/PA - .873
- K/BB - .806
- BB/PA - .789
- 1B/PA - .525
- HR/PA - .323
- 2B+3B/PA - .237
- AVG - .527
- OBP - .522
- OPS - .459
- SLG - .455
- BABIP - .188
Batted ball stats:
- Line drives - .936
- Ground balls - .905
- Fly balls - .862
- GB/FB - .852
- Pop ups - .764
- HR/FB - .207