## On the reliability of pitching stats

*Executive Summary*

Any study on the reliability of pitching stats is by default a paper on DIPS. When Voros McCracken wrote the original DIPS paper, it looked at the simple fact that while the correlation between strikeout rates, walk rate, HR rates, and HBP rates were fairly consistent from year to year, the correlation for BABIP (Batting average on balls in play) was not. Voros, in technical terms, was measuring test-retest reliability. In the intervening few years since his discovery, several folks have picked apart his findings and basic premise that a pitcher has great control (as opposed to luck) over things that happen without the involvement of the defense (hence “Defense Independent Pitching Statistics”), but little control over what happens when the ball is actually in play. Despite the large amount of work spent on attempting to disprove the theory, it has generally stood up to the tests thrown at it.

A little while ago, I looked at the reliability of batting statistics using split-half reliability. Basically, I coded each plate appearance and coded it as even numbered or odd numbered, and separated the two groups accordingly. I looked at stats like OBP in a player’s even-numbered at bats and his odd-numbered at-bats. If OBP is reliable, then OBP in even-numbered plate appearances and OBP in odd-numbered plate appearances will correlate well with each other. Also, I looked to see when a stat became reliable enough to be useful in analyses, using a standard cutoff of a split-half reliability of .70. The more plate appearances (or for pitchers, batters faced) a player accumulates, the better idea we have of his true talent level, and the more reliable (reproduceable) a stat will be over the same time frame in the future. If a batter hit 10 HR in 400 PA, and HR/PA was very reliable, we’d guess that he’s going to hit 10 HR or so in his next 400 PA. If it’s not reliable, the fact that he hit 10 HR in his last 400 PA means nothing in terms of predicting his next 400 PA.

Now I turn my attention to pitching stats. I’ve hidden the numerical spaghetti behind the cut, and if you want to read it, it’s all there for you. I’ve used a method similar to my previous article, in that I look at the issue in three ways.

- First, I looked to answer the question of what the minimum number of PA or BF should be used in research studies. That is, suppose I want to do work on what strikeout rates predict to. Usually, I would say something like “all pitchers with a minimum of X batters faced” so that the “cup of coffee” call up guys won’t contaminate the sample. What should that minimum number be? To do this, I found the number of BF where the split-half correlation for the sample composed of that minimum number was at least .70.
- The second method is a look at how long until a stat becomes meaningful for a particular player. If Tuffy Rhodes hits 3 HR on Opening Day, we know that’s not a big enough sample size to tell us any meaningful information about him. However, after a few hundred PA’s, we can probably make some pretty good conclusions about his ability to hit for power. I went in 50 PA intervals to test this. For example, to test whether a stat was reliable at 50 PA’s, I took a player’s first 100 PA’s in the data base (2 - 50 PA samples) and calculated whatever stats were of interest from the even numbered PA’s and the odd-numbered. The number where the stat crossed a split-half reliability of .70 was where it became officially certified as appropriately reliable.
- Finally, I looked at what the split-half correlations were for the pitching stats under observation at 300 BF and 750 BF. This gives us an idea of how reliable stats are for a starter and a reliever, using some rough cutoffs.

Again, all of the data in all its glory is below the cut, but the main findings are:

- Strikeouts are the one
*outcome*over which pitchers seem to have the most control. Walks are slightly less reliable, but still worthy of mention as a reliable/skill-based outcome. This checks out with previous DIPS work, including my own. - Pitchers are astonishingly reliable in what sort of balls come off the bat when they pitch. At 750 batters faced, the split-half reliabilities for line drives and grounders were above .90. So, to say that once the ball is hit, the pitcher has no control over what happens, is false. The pitcher seems to have a good amount of stability in inducing different types of batted balls. There are going to be ground ball pitchers and fly ball pitchers, and that isn’t the product of random chance. Where that ball in play lands, either in someone’s glove or on the grass for a hit, doesn’t appear to be as reliable. I’ve previously shown that pitchers’ results on fly balls are more consistent (at least as a matter of degree… the reliability numbers themselves aren’t overwhelming) than their results on ground balls. Still, overall BABIP is still largely unstable, suggesting that there is little (although not nil) skill involved on the pitcher’s side.
- Contrary to original DIPS theory, home run rate isn’t very stable. In fact, a ball in play stat, singles/PA, is more reliable than HR/PA. This could be something that has to do with the pitchers or perhaps it has something to do with my methodology. Split-half controls for the four gentlemen standing directly behind him on the infield, so it may be that defense and pitching are once again entangled. Still, HR/PA reliability stats are fairly low. Even with a good sample size (750 BF), the split-half correlations were only .34 or so. Seems like a full season isn’t a good measure of a pitcher’s HR/PA ability.
- HR/FB was very very unstable for pitchers. For batters, HR/FB stablized pretty quickly. This suggests that the pitcher may be the one who gives up the fly ball, but the batter is the one who makes it leave the yard. So, if your favorite pitcher gave up a lot of HR/FB last year, fear not. Chances are he’ll be better next year.
- Relievers are hard to project because at the small sample sizes that relievers have in terms of batters faced, the stats used to describe pitchers are largely unreliable. This means that regeression to the mean will take its toll on a reliever very quickly. Relievers who rely mostly on the strikeout are less likely to have this trouble.
- And while I’m in the neighborhood, a post at Lookout Landing rating pitchers in a way very much consistent with what I’ve found here. Worth a read.

*Methodology and Results*

Some methodological notes: Again, I used Retrosheet files for 2001-2006 in two-year windows. (I lumped 2001-2002 together, 2003-2004, and 2005-2006.) It’s not ideal because for some folks, these plate appearances under study occurred a year and a half apart, but that’s the only way to get enough of a sample of one man’s work to draw any kind of conclusions. At least it’s better that they’re consecutive years. It does bring up the issue, especially with pitchers that there is some selective sampling going on. Consistent pitchers (and consistently good pitchers, especially) tend to get more playing time. Alas, baseball is a wonderful data set from a methodological point of view, but it’s not a perfect one.

Again, I realize that .70 is an arbitrary cutoff. I’ve laid out my reasons for using it before and I stick by them. (Short version: .70 means that you have an r-squared of .49. Anything north of that means that the majority of the variance is consistent within a player.)

Also, there’s one minor annoyance. I had to number the events by the way that Retrosheet does events. So, non-pitching events such as stolen base attempts, as well as passed balls were counted as events. (As were pitching events that don’t result in the end of a plate appearance, such as balks or wild pitches.) So, when I say 100 batters faced, it’s probably not 100 full batters, but actually something like 95 batters plus 5 (balks, WP, PB, SB, CS, etc.) It’s annoying enough that I should mention it, but probably not a big enough deal as to affect the major conculsions of the study.

Then there’s another issue of which pitching stats to study. Several of the usual stats used to evaluate pitchers are game-level stats. Wins and saves are generally the yardsticks by which we measure pitchers, but they are a poor gauge of what a pitcher actually did that day. To say that C.C. Sabathia picked up a win might mean that he barely survived five innings, gave up 7 runs, and got bailed out by some run support and the bullpen. It might mean that he threw a two-hit shutout. They’re rather imprecise. ERA is also a puzzler in this methodolgy, because a pitcher can give up an earned run (or an unearned run) despite the fact that he wasn’t even in the game at the time (and I never liked ERA or the concept of “earned runs” to begin with). I stuck to looking at various rate stats (K rate, walk rate), some one-number stats (AVG, OBP), and the batted ball profile.

*Part I: Setting sampling minimum cutoffs for research*

A few of the “How often does he” stats:

- K/PA - 50 BF; K/9 - 60 BF
- BB/PA - 250 BF; BB/9 - 300 BF
- K/BB ratio - 250 BF
- HR/PA and HR/9 - never did (at 750 BF, they were at .32 and .34, respectively)
- 1B/PA and 1B/9 - never did (at 750 BF, .57 and .50, respectively)
- 2B + 3B/PA and per 9 - never did (.33 and .36)
- HBP - never did (.53 and .54)
- WP - never did (.54 for both)

Looks like stats measured per PA, rather than per 9 innings stablize a bit more quickly, but it also looks like outside of walks and strikeouts, there is little consistency in a sample on issues of balls in play. The surprising finding was that in the 750 BF or more sample, singles were much more consistent than home runs.

Some one-number stats:

- I’ll give you the short version: I looked at AVG, OBP, SLG, OPS, and BABIP. None of them reached the magic cutoff of .70
- Interestingly enough, good old batting average against was the most reliable stat in the 750 BF sample (split-half r = .569), with OBP and SLG at .52 and .49. OPS was at .49. BABIP was at .238, which is certainly more than zero, but certainly nothing to write home about.

A spin through the batted ball profile.

- Ground balls / Ball in play - less than 50 BF
- Line drives - less than 50 BF
- Fly balls - less than 50 BF
- Pop ups - 325 BF
- HR/FB - never made it to .70, at 750 BF, it had a split half of .208

Again, the numbers above are for researchers looking to set a cutoff of “X number of batters faced or above” for their studies.

*Part II: Evaluating individual players. When does a stat become meaningful for an individual pitcher?*

Again, I used 50 BF intervals from 50 to 750. I’ll present the cutoffs and which stats hit the magic .70 mark at each one. I also only calculated stats per PA, rather than per 9 innings, since those seemed to be the more reliable stats, if only by a bit.

- 50 BF - nothing
- 100 BF - nothing
- 150 BF - K/PA, grounder rate, line drive rate
- 200 BF - flyball rate, GB/FB
- 250 BF - nothing
- 300 BF - nothing
- 350 BF - nothing
- 400 BF - nothing
- 450 BF - nothing
- 500 BF - K/BB, pop up rate
- 550 BF - BB/PA
- 600 BF - nothing
- 650 BF - nothing
- 700 BF - nothing
- 750 BF - nothing

You can’t tell a lot about a pitcher by looking at his stats over a single season. You can get a pretty good idea of how often he walks and strikes batters out, and what type of batted balls he gives up generally… but that’s about it.

*Part III: How reliable is that stat?*

Using the same methodology as part two, I present split-half reliability numbers at two cutoffs: 300 batters faced and 750 batters faced. At 300 batters faced, here are the split-half reliability numbers, in order from most reliable stats to least reliable.

Rate stats:

- K/PA - .821
- BB/PA - .597
- K/BB - .575
- 1B/PA - .340
- HR/PA - .262
- 2B+3B/PA - .216

One-number stats:

- OBP - .430
- OPS - .386
- AVG - .379
- SLG - .364
- BABIP - .135

Batted ball profile:

- Line drive/ball in play - .861
- Ground ball/BIP - .816
- GB/FB - .788
- Fly ball/BIP - .779
- Pop-up/BIP - .586
- HR/FB - .145

And at 750 Batters Faced, same idea:

Rate stats:

- K/PA - .873
- K/BB - .806
- BB/PA - .789
- 1B/PA - .525
- HR/PA - .323
- 2B+3B/PA - .237

One-number stats:

- AVG - .527
- OBP - .522
- OPS - .459
- SLG - .455
- BABIP - .188

Batted ball stats:

- Line drives - .936
- Ground balls - .905
- Fly balls - .862
- GB/FB - .852
- Pop ups - .764
- HR/FB - .207

10 Responses to “On the reliability of pitching stats”

January 6th, 2008 at 9:15 pm

Might the /PA stats have their reliability inflated because of strikeouts and walks? That is, suppose that HR/PA were completely random, except that some pitchers give up fewer because, with high SO and BB, the batters don’t get much chance to hit the ball. Wouldn’t you get a higher (perhaps significant) correlation even if HR/BIP had zero reliability?

January 7th, 2008 at 10:33 am

Pitchers that give up alot of HR do no exist in MLB.

Correlation is not a measure of skill, but a measure of variance. If the variance of the true skill HR is low, correlation will be lower than from a “general population”. Lots of pitchers give up lots of singles. They can still be successful by doing other things.

In short, if you take the top 5000 pitchers in the world as your general population, and get to select 500 of them to play MLB, the singles rate of the MLB population will be alot closer to the general population than the HR rate.

***

In Retro, there is a “batter event flag” (somewhere around field 50 or so… don’t remember exactly). Select when that = ‘T’, and that gets rid of your running events.

***

For your part 3, can you show what the mean PA is for the 300 PA and 750 PA cutoffs?

***

Great job!

January 7th, 2008 at 11:45 am

Could you run the numbers for ERA? Afterall most non stats people live and die by ERA as a pitcher stat I know it is not going to show good results but it sure would help to be able to point to this article and show people just why ERA is not reliable over a single season.

January 7th, 2008 at 11:46 am

Ender, if you want to see some insight into why ERA is not a reliable stat over a season, look to my article from this blog from a few weeks ago, titled “Reevaluating ERA”.

January 7th, 2008 at 4:26 pm

Very interesting stuff, I guess people worrying about Santana b/c of his HR increase can shut up.

January 7th, 2008 at 7:55 pm

If HR/9 or HR/PA fluctuate wildly, that seems to further the idea that pitchers don’t control how many of their fly balls go for home runs (or at least that it’s mostly due to park effects).

January 7th, 2008 at 10:51 pm

BJ—I wonder what the people to whom you refer would have said about the only pitcher in the 500 home run club—as the victim, that is. It didn’t exactly keep Robin Roberts out of the Hall of Fame to have surrendered one lifetime bomb more than Eddie Murray hit.

Don’t see why everyone gets on mah hitting. I go the other way. I’vethrownsome of the longest balls in baseball history.—Weak-hitting, homer-prone Brooklyn Dodger lefthander Preacher Roe.—JeffJanuary 7th, 2008 at 11:59 pm

A few answers, now that I’m finally back in one city (for now…)

Phil - I have no doubt that the per PA stats are inflated a bit in their reliability in response to the excessive reliability of the K and BB rates.

Tango - pitchers who give up a lot of HR don’t exist… for very long… as to part 3, the mean is exactly 300/750 BF on those correlation numbers. In those cases, I artificially clipped everyone down to 300/750 (by taking their first 300/750), so long as they had 300 to give. (So someone who had only 295 was politely excused from the sample.)

Ender - here’s the problem with ERA in this framework. To start off a season, a pitcher gives up a single (odd BF). In the next PA (even), he gives up a double and the runner scores. Then he strikes out the next batter (odd), but the next guy singles in the guy on second (even). So, in the even-numbered plate appearances, two earned runs score, though he hasn’t recorded an out. In the odd numbered ones, he’s thrown 1/3 IP with no runs. But wait, one of the runs that scored got on during an odd-numbered PA. Earned runs are (home runs excepted), the stringing together of a couple of plate appearances. Even if I can get over my dislike for the “earned run”, I can’t figure out a way to appropriately partial things out. I suppose the only way I could do it would be even-numbered innings vs. odd-numbered innings pitched, but that only works for pitchers who complete the inning.

January 8th, 2008 at 10:35 am

Pizza, ok great.

At 750 “PA”, you’ve got K/PA with an r=.873

Using the equation:

r=PA/(PA+x)

.873=750/(750+x)

we get an x=109

So, our general equation for correlation of K/PA is:

r=PA/(PA+109)

So, if you have 300 PA, we can estimate the likely r. Using the above equation, and we get r=.73. Your sample data shows .82. I’m not happy with this difference.

***

Let’s continue with BB/PA.

Using 750 PA, the equation is

r=PA/(PA+201)

So, at 300 PA, we’d expect r=.60. Your sample shows r=.60! Bingo!

***

We CANNOT do K/BB. That is a ratio of two independent events. Unless you take the log, you need to reform it as: K/(K+BB), and then run the regression. You need to create a rate stat if you want to apply linear regression. Otherwise, why not do BB/K? You’ll actually get different results.

***

At 750 PA, the equation for HR/PA becomes:

r=PA/(PA+1572)

If you have 300 PA, we expect r=.16. Your sample shows r=.26. Again, not happy here.

***

At 750 PA, the equation for 1B/PA becomes:

r=PA/(PA+679)

If you have 300 PA, we expect r=.31. Your sample shows r=.34. Pretty close.

***

At 750 PA, the equation for XBH/PA becomes:

r=PA/(PA+2415)

At 300 PA, r=.11. Your sample shows r=.22. This is very inconsistent. Your r is very close at both PA=300 and PA=750. This certainly makes little sense, and you have some sort of bias in the data here, be it park, or whatnot.

***

Here is how the batted ball info looks like:

r at 750 PA Event

0.936 Line drives

0.905 Ground balls

0.862 Fly balls

0.764 Pop ups

0.207 HR/FB

the “x” Event

51 Line drives

79 Ground balls

120 Fly balls

232 Pop ups

2,873 HR/FB

expected r at 300 PA Event

0.85 Line drives

0.79 Ground balls

0.71 Fly balls

0.56 Pop ups

0.09 HR/FB

sample r at 300 PA Event

0.86 Line drives

0.82 Ground balls

0.78 Fly balls

0.59 Pop ups

0.15 HR/FB

result Event

Bingo! Line drives

Pretty close Ground balls

Eh, not bad Fly balls

Pretty close Pop ups

a bit off HR/FB

I find presenting the general “r” equation as I am doing provides what you need for *any* level of PA.

***

Here’s another way to think about the BB/PA. I took all the players with at least 2000 BFP, from 2001-2006. That’s 158 pitchers. I figure the zScore for each pitcher, from Brad Radke’s -11 standard deviations to Ishii’s +12 SDs. The standard deviation of all those zScores was 4.408. The average PA was 3596.

The r (which is likely the same intraclass correlation that Pizza is talking about) is r=1-(1/4.408)^2= .9485

Plugging this into:

r=PA/(PA+x)

and we get:

.9485=3596/(3596+x)

Solving for x=195

So, our BB/PA equation is:

r=PA/(PA+195)

At PA=300, we’d expect r=.606. Pizza’s sample says r=.597.

At PA=750, we’d expect r=.794. Pizza’s sample says r=.789.

That’s a huge bingo!

The advantage here is that it’s a supersnap to do in Excel. Plus, you get an actual regression equation based on PA (or whatever your denominator is).

January 8th, 2008 at 10:39 am

Pizza, looks like my post was too long. I’ve posted on my blog (click on my name).

Leave a comment