Monday, April 25, 2011

Did NFL teams discriminate against black coaching candidates? Part II

I posted recently about a "Rooney Rule" study that appeared in the Journal of Sports Economics. In that paper, the authors found that, from 1990 to 2002, NFL teams with black head coaches won 1.1 games per season more than teams with white head coaches. The authors took this as evidence that the NFL was discriminating against black candidates -- hiring only the best black coaches, and not the average ones.

A few more thoughts on the issue:


1. I'm not a subject matter expert (SME) on NFL coaching, but it seems to me very, very unlikely that a sample of 29 coaches, no matter how you selected them, could be, on average, as much as 1.1 games better than average. That seems way too high. Maybe one coach could, sure, under very specific circumstances (say, if he figures out he should start Tom Brady instead of Drew Bledsoe). But the average of 29 coaches? That would be nearly impossible, wouldn't it?

And it's not like the study chose the best 29 coaches -- they chose the only 29 black coaches there were. That means the best black coaches of the 29 would have to be substantially better than 1.1 wins, season after season. That, again, seems implausible.

It's a critical question, because, if the effect is too big to be coaching, the study is no evidence at all -- it literally has zero value!

Here's the logic. If you argue that the 1.1 games is statistically significant, then you're saying that there's evidence that the teams with the black coaches are significantly different, in some way, from the teams with the white coaches. You may believe that the difference is the coach's race. But since 1.1 is too big an effect to be just the coaches, the difference must be, in part, something else. So, since there must be something else going on, you have very little basis for thinking that there's evidence that even *any part of it* is coaching. After all, whatever the "something else" is, it could be just as easily responsible for all of the 1.1 as part of it. In fact, it could be responsible for *more* than 1.1 games, and the black coaches might be *worse* than the white coaches!

If you get an effect size that couldn't possibly be what you're looking for, then all you have evidence for is that there's something else causing the effect. That means there are confounding factors your study hasn't controlled for, which means you have no evidence at all for your particular hypothesis. That doesn't mean you're wrong -- it's not that you have evidence against it, it's just that you have no evidence *for* it.

This is a little bit counterintuitive -- it means a small effect is better evidence than a large effect. If you get statistical significance with a difference of 1.1 wins, that means nothing. But if you If you get statistical significance with a difference of 0.1 wins, now at least there's a chance that you're seeing something real.


2. In a different post a while ago, I quoted Bill James on psychology:

"... in order to show that something is a psychological effect, you need to show that it is a psychological effect -- not merely that it isn't something else. Which people still don't get. They look at things as logically as they can, and, not seeing any other difference between A and B conclude that the difference between them is psychology."


After this coaching study, it occurs to me that Bill's argument holds for *any* possible cause, not just psychology. Racial bias, for instance. Editing Bill's quote:

"... in order to show that something is a racial bias effect, you need to show that it is a racial bias effect -- not merely that it isn't something else. Which people still don't get. They look at things as logically as they can, and, not seeing any other difference between A and B conclude that the difference between them is racial bias."


The typical study will spend a lot of time and paragraphs and numbers persuading you that there is evidence that A and B are different at a statistically significant level. But then they'll give you only a few sentences *about what that evidence really means*. Shouldn't it be the other way around?

It's as if you're on trial for murder, and the prosecution spends five days nailing down how many millions of dollars you stand to inherit from the victim. They call a stockbroker, a banker, a real estate agent, all of whom testify for hours about how much the guy left you in his will, down to the penny. And then, after all that, the prosecutor says to the judge, "so, obviously, the accused must have done it. We rest our case."

That's backwards. Showing that A and B are different is the easy part -- it's just regression. The hard part is figuring out *why* A and B are different. Most of the effort should go into the argument, not into the statistics.


3. A reader was kind enough to send me a similar study from "Labour Economics." It's called "Moving on up: The Rooney rule and minority hiring in the NFL," by Benjamin L. Solow, John L. Solow, and Todd B. Walker. (Here's a press release.)

The authors create a model to predict whether a "level-two" assistant coach is promoted to head coach, based on performance, age, calendar year, years of experience, and race. It turns out that race is not significant, either before or after the Rooney Rule. Nonetheless, the coefficient for "minority coach" (most are black) is slightly negative (-0.6 SD) before, and slightly positive (+0.8 SD) after.

If you choose to interpret the pre-2003 coefficient at face value, even though it's not statistically significant (which I don't recommend), it's equivalent to two extra years of high-level coaching experience.



Labels: , ,

Friday, April 15, 2011

Can managers induce "career years" from their players?

Over at the "Ask Bill" section of Bill James' website, there was some discussion last week about the 1980 Yankees (subscription required; start at April 7). They finished 103-59 despite a team that didn't look that great on paper. Was it that manager Dick Howser somehow got more out of the players than expected?

A few years ago, I did a study that tried to estimate how much a team was affected by the "career years" or "slump years" of their players. (Go here, look for "1994 Expos".) What I did, basically, was take a weighted average of a player's stats the two years before and two years after, regress it to the mean a bit, and use that as an estimate of what the guy "should have" done that year. Any difference, I attributed to luck. In the 1980 Yankees case, it was 12 games of "career years" from their hitters, and effectively zero for their pitchers.

A bit of discussion followed; Bill James wrote that he wasn't convinced:

"I am leery of describing as luck things that we don't understand. It may well be that players had good years because Howser or someone else was able to help them have good years."


Fair enough. In response, I posted a short statistical argument that if it *was* the manager, it couldn't happen very often, and another reader (Chris DeRosa) disputed what I said (partly, I think, because I didn't say it very well).

Since "Ask Bill" is not a good place for a long explanation, I thought I start again here and better explain what I'm talking about.

----------

Suppose we knew the exact talent level of every team in the majors. That is: for every single game, between any two teams, we know the exact chance either team will win. If both teams have an equal chance, it's exactly like flipping a fair coin. If the favorite has a 64 percent chance of winning, it's like flipping a coin that has a 64 percent chance of landing heads.

In real life, this pretty much the way it works. If not, the Vegas odds on baseball games wouldn't be so close to even. If you could look at the specifics of a game and have a 90% idea of who would win that day, Vegas would routinely offer 9:1 odds on underdogs. And they don't. That means that a huge part of who wins a baseball game is unpredictable.

So, a team's season record is like a series of 162 coin tosses -- heads is a win, tails is a loss. Mathematically, using the binomial approximation to the normal distribution, you can show that the SD of team wins over a season, for a .500 team is about 6.3 wins. That is, you expect 81-81, but you could easily wind up 87-75, or even 69-93, just due to luck.

The SD drops as the team gets better or worse than .500, but it doesn't drop much. If it's a .600 team, rather than a .500 team, the SD due to "coin tossing" is still 6.2 wins. Even for a .700 team, the SD is still about six games a season -- 5.8, to be exact.

Also, there's no need to keep the assumption that all games are the same. Suppose, before every game starts, you know the exact talent of both teams, and even the exact home field advantage for that game. You can even be omniscient enough to adjust for the weather, and injuries, and the fact that the starting pitcher had a big fight with his wife last night. Before the game starts, you'll have an extremely accurate estimate of the chance of the home team winning.

Still, that chance will be substantially less than 100%. You'll still have a huge amount of luck happening. Your estimate is almost always going to be less than, say, .700. It is absolutely impossible to get much better than that, for the same reason it's impossible to predict what the temperature will be exactly one year from now.

In theory, it could be predictable -- but the predictability is over uncountable numbers of molecules, beyond any possible computing capability humans could ever devise. So what is left is essentially random.


That means that, when we total up your wins and losses for the season compared to talent, no matter how accurate your talent estimates are, you're going to find that your SD is *still* around 6.2. That's a unalterable, natural limit of the universe, like the speed of light.

--------

If you have a model for estimating team talent, a good test of that model is how close your error can get to the natural lower bound of 6.2 wins.

The most naive model is when you predict that every team will wind up 81-81. If you check that, you'll find that the standard error of your estimates is around 11 wins. If you use a prediction method like Tom Tango's "Marcel", you'll get substantially closer. You could also check any other predictions, like the Vegas over/under line. I don't actually know what those are, but I'm guessing they'd be around 8 or 9 wins.

My model is at 7.2 wins. I'm pretty sure it's better than Marcels and Vegas, but that's only because it uses more data. Oddsmakers are predicting the team's talent *before* it happens; I'm predicting it after. Obviously, I have a huge amount more information to work with. From looking at the rest of Norm Cash's career, I know that Norm Cash wasn't as good a player in 1962 as his 1961 suggested, and I can adjust accordingly. Marcel looks only backwards, so it doesn't know that.

If that seems like I'm cheating, well, not really. I'm not using the method to show how good a predictor I am. I'm using it to try to figure out, after the fact, how good a team actually was. I'm not trying to predict the future; I'm trying to explain the past.

---------

My method works like this. Suppose you have a team that talent of X wins, but, instead, it got Y wins. The difference between Y and X is, by definition, luck. How might we measure that luck?

I think that these five measurements completely add up to the amount of luck, without overlapping:

-- how much the team's hitters got lucky and had a career year;
-- how much the team's pitchers got lucky and had a career year;
-- how much the team differed from its Runs Created estimate;
-- how much the team's opponents differed from their Runs Created estimate; and
-- how much the team's wins differed from its Pythagorean Projection.

The first two items deal with the raw batting and pitching lines. The second two items deal with converting those lines to runs. And the last item deals with converting those runs to wins. (You don't have to consider the opposition's "career year", because the opposition's career year in hitting is your career year in pitching, and vice-versa.)

Any source of luck you can think of winds up in one of those five categories. A pitcher has a lucky BABIP? That shows up as a career year. Team gets lucky and hits unusually well in the clutch? Partly career years, partly beating their Runs Created estimate. Team gets lucky and goes 15-6 in extra inning games? Shows up in their Pythagorean discrepancy. Your shortstop has a lucky defensive year? That shows up in a pitcher's career year (which is based on opposition batting outcomes, and therefore includes defense).

It's all there.

--------

So, for every team since 1961, I figured out their luck in each of the five categories. As I said earlier, the "career year" luck was by players' talent estimates based on the four surrounding years. The Runs Created and Pythagorean estimates were straightforward.

After all that, the unexplained discrepancy, as I said above, was 7.2 games.

That seems very close to the law-of-the-universe binomial limit of 6.2 games. The difference, however, is substantial: it's 3.7 games. (It works that way because 7.2 squared minus 6.2 squared equals 3.7 squared).

What does that 3.7 represent? It's not luck we haven't accounted for, because, I think, we've accounted for all the luck. We haven't accounted for it perfectly -- Pythagoras and Runs Created aren't exact. And, of course, the way I estimated a player's talent isn't perfect either.

So, here's what accounts for that extra 3.7 game standard deviation:

1. imperfections in Pythagoras and Runs Created
2. the fact that my method of estimating talent for "career years" is probably not that great
3. managerial influence in temporarily making players better or worse for a single season (Billy Martin's 1980 pitchers?)
4. injury patterns that make players look better or worse (but not injuries affecting playing time; that's reflected in the estimates already)
5. other sources of good or bad single years that aren't luck or injuries (steroids? Steve Blass disease?)
6. other things I'm forgetting (let me know in the comments and I'll add them here).

If I had to guess, I'd say that #2 is the biggest of all these things. My method just looks at four years. It may not be regressing to the mean properly. It doesn't distinguish between starters and relievers. It doesn't consider age (which is fine for most ages, but not for, say, 27, when it should give an extra boost over the average of 25, 26, 28, and 29). It takes previous or future career years at face value, so that, for instance, it predicts Brady Anderson's 1997 expectation based significantly on his 1996. (If you showed a human Brady's entire career, he probably wouldn't weight 1996 quite so high.)

UPDATE: Tango describes it better than I do:

"As for the reason for that 3.7, a large portion of that is almost certainly the uncertainty of the true talent for each player. There’s only so much we can know about a player, given such a small sample as 3000 plate appearances, combined with such a narrow talent base that is MLB."
----------

In light of all that, my point about Dick Howser is this: since the entire unexplained residual SD is only 3.7 games, then there can't be a whole lot of manager influence in temporarily increasing a player's talent. It's certainly possible that Dick Howser managed his team into an extra 12 games of extra talent, but things like that certainly can't happen very often.

If you square the unexplained SD of 3.7, you get an unexplained variance of about 14. Multiply that by the 26 teams that existed in 1980, and you get about 356 total units of unexplained variance.

If Dick Howsers are routine, and there's typically one every season creating a discrepancy SD of 12, that Dick Howser singlehandedly contributes a variance of 144. That's about 40 percent of the total unexplained variance for a typical league. That's a lot.

Furthermore, it's absolutely impossible for there to be an average of two and a half Dick Howsers in MLB per year, each boosting his team by 12 wins worth of talent. If that were the case, then that would account for the entire 356 units of variance, which means all the other sources of error would have to be zero. That's obviously impossible.

Even if there were only half a Dick Howser every year, that would still be 21 Howsers in the period I studied. In that case, instead of seeing the discrepancies normally distributed, we'd see a normal distribution with 21 outliers.

But we don't.

If "batting career year discrepancy" is normally distributed, we should expect about 24 teams out of 1042 to have discrepancies of 2 SD or more. The actual number of teams at 2 SD or more in the study: 25, almost exactly as expected.

We should also expect 24 teams to have discrepancies of 2 SD or more going the other way. Actual number: 22.

So there is no evidence at all that there's anything more than luck going on. That still doesn't mean that Dick Howser can't be a special case ... it could be that career years are just random, *except for 1980 Dick Howser.* But, obviously, the number alone doesn't give us any reason to believe he is. A certain number of managers are going to have as big an effect as the 1980 Yankees, regardless. (And, in fact, three other teams beat them; the 1993 Phillies led the study with a "career year hitting" effect of 13.1 games.)

So, if you think Dick Howser is something other than a random point on the tail of the normal distribution, you have to explain why. It's like when Daphne Weedington, from Anytown, Iowa, wins the $200 million lottery jackpot. You don't know *for sure* that Daphne doesn't have some kind of supernatural power. But, after all, *someone* had to win. Why not Daphne?


Labels: , , ,

Sunday, April 10, 2011

Buck Showalter's $2,000,000 tactic

From Tom Verducci's article on Buck Showalter, in the March 28, 2011 issue of Sports Illustrated:

"Showalter had schooled his players on this: runners at first and third, less than two outs and a ground ball that the second baseman fields near the baseline. Most runners on first are taught either to stop or head toward the infield grass, making it hard for the second baseman to tag them and still have time to throw to first for the double play. Showalter taught the Orioles to slide directly into the second baseman, essentially breaking up a double play in the baseline. "That's six to 10 outs a year if we do it right," Showalter said. Which is 0.2% of the more than 4,000 outs a team gets over a season."


Well, an extra six to ten outs is a lot. Plus, it's not just the outs: it's also the extra runner at first base.

Assuming the runner on third always stays put, and doing a little arithmetic with Tango's base/out matrix:

Suppose there's one out. If the team turns the double play, the inning ends and the run expectancy is zero. If they don't, it's first and third with two outs, which is worth .538 runs.

Suppose there's no outs. Runners on 1st and 3rd with one out is worth 1.243 runs. Runner on 3rd with two outs is worth .387 runs. Difference: .856 runs.

Now, most of the time there'll be one out (it's a lot easier to get two runners on with one out than with no outs). Again from Tango, it's about a 2:1 ratio of one out over no outs. That means the .538 happens twice as often as the .856, which means each broken-up double play averages .644 runs.

"Six to 10" instances of saving .644 runs is 4 to 6 runs. Call it 5.

A free-agent win is worth about $4.5 million. A win is about 10 runs. So, at free-agent rates, 5 runs is worth over two million dollars.

So Buck Showalter has saved his team $2,000,000 -- over half his salary -- in that one small on-field strategy change.

------

I don't know anything about on-field strategy, so I have no way to evaluate all that. So these questions are for you SMEs reading this.

Will Showalter's strategy work? Is 6-10 outs a reasonable estimate of what it saves? Are there unstated drawbacks that negate those outs?

By sharing the strategy with Sports Illustrated, Showalter runs the risk that all other teams will adopt it, completely negating the Orioles' $2 million advantage. Why would he do that?

I guess I'm thinking that the story sounds a bit too pat. But, I don't really know. Your comments?



Labels: , ,

Saturday, April 09, 2011

Did NFL teams discriminate against black coaching candidates?

The "Rooney Rule," adopted by the NFL in December, 2002, required all teams searching for a head coach to interview at least one black candidate. Between 2002 and 2009, the number of black coaches roughly doubled. Was this the result of the rule, or not?

A paper in the latest "Journal of Sports Economics," by Janice Fanning Madden and Matthew Ruther, looks at some evidence on the question. It's called "Has the NFL's Rooney Rule Efforts "Leveled the Field" for African American Head Coach Candidates?" A version of the paper can be found here (.pdf).

The authors find that before the Rooney Rule, black head coaches guided their teams to significantly superior records: an average of 9.1 wins (instead of the overall and white coaches' mean of 8). For first-year coaches, the difference was even bigger: 9.6 wins versus 7.1 wins.

They note that these numbers are consistent with the hypothesis that black coaches had to be significantly better than average to get the job. That suggests discrimination on the part of hiring teams.

Again, that was before the Rooney rule. Afterwards, there was no appreciable difference between white and black coaches. Is the difference between the two time periods significant?

The authors start by doing a t-test on the pre-Rooney race difference of 1.1 games, and they find significance at 2.57 standard deviations from the mean. However, I'm not so sure about that. I think their t-test assumes all observations are independent. In real life, they're not. A team's record this year is positively correlated with its record last year. One black coach being hired by one (perennially) good team might have made all the difference.

And, indeed, the authors do find that black coaches get hired by better teams. They don't give us the data, but they mention it:

"... the teams that hired African American coaches in the 1990-2002 period had better records prior to the hires ... "


So they run a regression that tries to control for team quality, and they still get a significant result. But that regression uses payroll as a proxy for quality. The relationship between payroll and wins is probably pretty decent, but not as good as other possibilities. I'm sure you could find lots of teams that were excellent despite average payrolls, and, again, all it might take is one black coach to be hired by such a team.

Then, they try a regression that uses the Sports Illustrated preseason prediction as a variable. Again, that's not perfect, but it should be pretty good. Actually, it should be better than pretty good. SI writers are subject matter experts, and will use a wide assortment of data to make their predictions. They're probably not perfect, of course, and they're not as good as Vegas odds might be, but I think this is a pretty good way of doing it.

And, now, the result is no longer significant. It's only 1.43 SD, and probably less when you correct for the fact that seasons aren't independent.

But, in fairness, and as the authors mention, that may understate the significance if the SI staff adjust their predictions for the realization that the coach is of higher quality. I'm guessing that's not much of a factor, though.

------

The authors then look at firings. Controlling for several variables, including wins, whether the team made the playoffs, how many years the coach was with the team (and the square of that figure), they find that, before the Rooney rule, black coaches were more likely to be let go. After the Rooney rule, the difference disappeared.

But ... aren't coaches fired for performance relative to expectations rather than for raw performance? Since the black coaches started with better teams in the first place, you'd expect them to get fired faster for a given record, because it's easier to disappoint from a higher level than from a lower level. If you start 10-6, and then fall to 7-9, your job will be in jeopardy. But if you go from 8-8 to 7-9, you're more likely to be safe.

Since that result is only barely significant (2.15 SD), I'm guessing that if you used more realistic "disappointment" variables, the significance would disappear.

------

Finally, the authors look at offensive and defensive coordinators. They find no significant difference in the performances of black and white coordinators, either before or after the Rooney Rule. However, they do find that in the entire period of the study -- 1984 to 2009 -- not even one black offensive coordinator was promoted to head coach. The authors say that's statistically significant at p=.01.

But, again, I think the authors are assuming independence, which causes the significance level to be overstated. Moreover, the authors' own Table 8 shows that black offensive coordinators worked for worse teams than white offensive coordinators. After the Rooney Rule, for instance, black offensive coordinators worked for teams in the 34th percentile of performance, while black defensive coordinators worked for teams in the 54th percentile. Perhaps that explains part of the difference.

Also, there are many comparisons in the authors' charts, so it becomes more likely that at least one of them will show significance. My unscientific feeling is that this one datapoint is a random anomaly, and, in any case, not all that significant anyway.

------

My overall impression when reading this paper was ... geez, there were only 29 black coaches in the pre-Rooney Rule era. Why not actually look at them and see if their performance was unexpectedly good? That would require the assistance of subject matter experts (SME) -- people who knew the NFL -- which, admittedly, is not usual for an academic paper of this sort. And, of course, any SME judgments would necessarily be subjective.

But, still, if you want to get the best answer to the question, instead of the most journal-publishable answer to the question, that's the way to do it. Maybe coach X was hired just when player Y blossomed into a superstar, and so it would be incorrect to attribute the team's playoff success to the coach. Maybe black coaches are unproven, and teams are willing to hire an unproven coach only when they have a hugely disappointing season -- which suggests bad luck, which suggests maybe they bounce back to their previous level of excellence.

If there were thousands of datapoints, you couldn't check all those things. But, 29? That doesn't seem too difficult an obstacle. And, it's telling that the regression that comes closest to doing that -- the one that takes into account the SMEs at Sports Illustrated -- was the one that didn't find statistical significance.



Labels: , ,

Thursday, April 07, 2011

"Pinburgh": pinball sabermetrics

On the weekend of March 18, I competed in the huge (and phenomenally well-run) "Pinburgh" match-play pinball tournament in Pittsburgh.

I finished roughly in the middle of the field of 173 competitors, and, I wondered, if I'm really average, what would my chances be of winning the whole thing next year just by luck?


So I wasted a day or so and wrote a simulation.


It turns out that I'm probably a 2000:1 longshot, unless I get better, or unless I'm *already* better and don't realize it. Still, on average, I should win back half my entry fee.


This is probably of no interest to more than a handful of people in the entire world, but I wrote up a whole bunch of results anyway. They're
here.


Labels:

Tuesday, March 29, 2011

"Sabermetrics" and "Analytics"

What is sabermetrics?

We sabermetricians think we know what it means ... one definition is that it's the scientific search for understanding about baseball through its statistics. But, like a lot of things, it's something that's more understood in practice than by strict definition. I think a few years ago Bill James quoted Potter Stewart: "I know it when I see it."

But how we see it seems to be different from how the rest of the world sees it. The recent book "Scorecasting" is full of sabermetrics, isn't it? There are studies on how umpires call more strikes in different situations, on how hitters bat when nearing .300 at the end of the season, on how hitters aren't really streaky even though conventional wisdom says they are, and on how lucky the Cubs have been throughout their history.

So why isn't "Scorecasting" considered a book on sabermetrics? It should be, shouldn't it? None of the reviews I've seen have called it that. The authors don't describe themselves as sabermetricians either. In fact, on page 120, they say,

"Baseball researchers and Sabermetricians have been busily gathering and applying the [Pitch f/x] data to answer all sorts of intriguing questions."


That suggests that they think sabermetricians are somehow different from "baseball researchers".

Consider, also, the "MIT Sloan Sports Analytics Conference," which is about applying sabermetrics to sports management. But, no mention of "sabermetrics" there either -- just "analytics".

What's "analytics"? It's a business term, about using data to inform management decisions. The implication seems to be that the sabermetrician nerds work to provide the data, and then the executives analyze that data to decide whom to draft.

But, really, that's not what's going on at all. The executives make the decisions, sure, but it's the sabermetricians who do the actual analysis. Sabermetrics isn't the field of creating the data, it's the field of scientifically *analyzing* the data in order to produce valid scientific knowledge, both general and specific.

For instance, here's a question a GM needs to consider. How much is free agent batter X worth?

Well, towards that question, sabermetricians have:

-- come up with methods to turn raw batter statistics into runs
-- come up with methods to turn runs into wins
-- come up with methods to estimate future production from past production
-- come up with methods to quantify player defense, based on observation and statistical data
-- come up with methods to compare players at different positions
-- come up with methods to estimate the financial value teams place on wins.

But isn't that also what "analytics" is supposed to do? I don't understand how the two are different. I suppose you could say, the sabermetricians figure out that the best estimate for batter X's value next year will be, say, $10 million a season. And then the analytics guy says, "well, after applying my MBA skills to that, and analyzing the $10 million estimate the sabermetricians have provided, I conclude that the data suggest we offer the guy no more than $10 million."

I don't think that's what the MIT Sloan School of Management has in mind.

Really, it looks like everyone who does sabermetrics knows that they're doing sabermetrics, but they just don't want to call it sabermetrics.

Why not? It's a question of signalling and status. Sabermetrics is a funny, made-up, geeky word, with the flavor of nerds working out of their mother's basements. Serious people, like those who run sports teams, or publish papers in learned journals, are far too accomplished to want to be associated with sabermetrics.

And so, an economist might publish a paper with ten pages of analysis of sports statistics, and three paragraphs evaluating the findings in the light of economic theory. Still, even though that paper is sabermetrics, it's not called sabermetrics. It's called economics.

A psychologist might analyze relay teams' swim times, discover that the first swimmer is slower than the rest, and conclude it's because of group dynamics. Even though the analysis is pure sabermetrics, the paper isn't called sabermetrics. It's called psychology.

A new MBA might get hired by a major-league team to find ways of better evaluate draft prospects. Even though that's pure sabermetrics, it's not called sabermetrics. It's called "analytics," or "quantitative analysis."

I think that word, sabermetrics, is costing us a lot of credibility. My perception is that "sabermetrics" has (incorrectly) come to be considered the lower-level, undisciplined, number crunching stuff, while "analytics" and "sports economics" have (incorrectly) come to symbolize the serious, learned, credible side. If you looked at real life, you might come to the conclusion that the opposite is true.

My perception is that there isn't a lot of enthusiasm for the word "sabermetrics." Most of the most hardcore sabermetric websites -- Baseball Analysts, The Hardball Times, Inside The Book, Baseball Prospectus -- don't use the word a whole lot. Even Bill James, who coined the word, has said he doesn't like it. From 1982 to 1989, Bill James produced and edited a sabermetrics journal. He didn't call it "The Sabermetrician." He called it "The Baseball Analyst." It was a great name. About ten years ago, I suggested resurrecting that name for SABR's publication, to replace "By the Numbers" (.pdf, see page 1). I was voted down (.pdf, page 3). I probably should have tried harder.

In light of all that, I wonder if we should consider slowly moving to accept MIT's word and start calling our field "analytics."

It's a good word. We've got a historical precedent for using it. It will help correct misunderstandings of what it is we do. And it'll put us on equal footing with the MIT presenters and the JQAS academics and the authors of books of statistical analysis -- all of whom already do pretty much exactly what we do, just under a different name.


Labels: ,

Saturday, March 26, 2011

The swimming psychology paper

In the previous post, I wrote about a paper (gated) that showed an anomaly in team swimming relays. It turned out that the first swimmer's times were roughly equivalent to his times in individual events -- but the second through fourth swimmers had relay times are significantly faster than their individual times.

The paper concluded that this happens because people put more effort into group tasks than individual tasks. They do this because other people are depending on their contributions. However, the leadoff swimmer's time is seen to be less important to the team's finish than the other three swimmers' times, and that explains why swimmers 2-4 are more motivated to do better in the team context.

My point was not really to criticize that individual paper, but to make a broader point -- that just because the results are *consistent* with your hypothesis, doesn't necessarily mean that's what's causing them. In this case, I agreed with an anonymous e-mailer, who speculated that it might have to do with reaction times. The first swimmer starts by a gun, while the other swimmers start by watching the preceding swimmer touch the wall. I said that I didn't know whether the authors of the paper considered this or other possibilities.

Commenter David Barry kindly sent me a copy of the study, and it turns out the authors *did* consider that:


"We corrected both performance times for the swimmer's respective reaction time by subtracting the time the athlete spent on the starting block after the starting signal (also retrieved from [swimrankings.net]). ... Please note, however, that previous research did not find any differences between the individual and relay competition after a swimming distance of 10m. Thus, faster swimming times for relay swimmers are unlikely to be merely due to differences in the starting procedure."


Fair enough. But ... well, the effect is so strong that I'm still skeptical. Could it really be that swimmers, who have trained their entire lives for this one Olympic individual moment, are still sufficiently unmotivated that they can give so much more to their relay efforts?

Here are the results for the four relay positions. (Times are an average of 100m and 200m):

#1: individual 78.19, team 78.38. Diff: -0.19
#2: individual 87.30, team 86.92. Diff: +0.38
#3: individual 87.73, team 87.39. Diff: +0.34
#4: individual 87.40, team 86.66. Diff: +0.74

It seems to me that the 2-4 differences are *huge*. Are the #4 individual swimmers so blase about the Olympics that they swim almost three-quarters of a second slower than they could if they were just more motivated? My gut says: no way.

One thing I wonder, following Damon Rutherford's comment in the previous post: could it be that correcting for the swimmer's reaction time to the starting gun isn't enough? Mr. Rutherford implies that the first swimmer's reaction time is for him to *start moving*. But he implies that subsequent swimmers are already well into their diving motions when the previous swimmers touch the wall. That could explain the large discrepancies, if the reaction time correction only compensates for part of the difference.

Is there anyone who knows swimming and is able to comment?

Oh, and there's one more issue with the differences, and that's a selective sampling issue. The authors write,

"We focused our analysis on the data from the semi-finals to obtain a reasonable sample size. If a swimmer did not advance to the semi-finals in the individual competition, we included his/her performance time from the first heats."


That means the individual times are going to be skewed slow: if the swimmer did poorly in the heats, his unsuccessful time is included in the sample. But if the swimmer did *well* in the heats, his successful result is thrown away in favor of his semi-final time.

That would certainly account for some of the differences observed.


Labels: ,

Sunday, March 20, 2011

Psychology should be your last resort

Bill James, from a 2006 article:

"... in order to show that something is a psychological effect, you need to show that it is a psychological effect -- not merely that it isn't something else. Which people still don't get. They look at things as logically as they can, and, not seeing any other difference between A and B conclude that the difference between them is psychology."


Bill wrote something similar in one of the old Abstracts, too. At the time, I thought it referred to things like clutch hitting, and clubhouse chemistry, where people would just say "psychology" as (in James' words) a substitute for "witchcraft." It was kind of a shortcut for "I don't know what's going on."

Today, it's a little more sophisticated. They don't say "psychology" just like that as if that one word answers the question. Now, they do a little justification. Here's a recent sports study, described recently in the New York Times by David Brooks:

"Joachim Huffmeier and Guido Hertel ... studied relay swim teams in the 2008 Summer Olympics. They found that swimmers on the first legs of a relay did about as well as they did when swimming in individual events. Swimmers on the later legs outperformed their individual event times."


Interesting! Why do you think this happens? The authors, of course, say it's psychology. But they have an explanation:

"in the heat of a competition, it seems, later swimmers feel indispensible to their team’s success and are more motivated than when swimming just for themselves."


OK ... but what's the evidence?

"A large body of research suggests it’s best to motivate groups, not individuals. [Other researchers] compared compensation schemes in different manufacturing settings and found that group incentive pay and hourly pay motivate workers more effectively than individual incentive pay."


Well, that paragraph actually makes sense, and I have no objection with the finding that group pressure is a good motivator. Still, that doesn't constitute evidence that that's what's going on in the swimming case. Yes, it shows that the results are *consistent* with the hypothesis, but that's all it shows.

You can easily come up with a similar argument in which the same logic would be obviously ridiculous. Try this:

I've done some research, and I've found that a lot of runs were scored in Colorado's Coors Field in the 1990s -- more than in any other National League ballpark. Why? It's because the Rockies led the league in attendance.

How do I know that? Because there's a large body of research that shows that people are less likely to slack off when lots of other people are watching them. Since Coors Field had so many observers, batters on both teams were motivated to concentrate harder, and so more runs were scored.


See?

The point is that, as Bill James points out, it's very, very hard to prove psychology is the cause, when there are so many other possible causes that you haven't looked at. When the Brooks article came out, someone e-mailed me saying, couldn't it be that later swimmers do better "because they can see their teammate approach the wall, and time their dive, as opposed to reacting to starter's gun"? Well, yes, that would explain it perfectly, and it's very plausible. Indeed, it's a lot better than the psychology theory. Because, why would later swimmers feel more indispensable to their team's success than the first swimmer? Does the second guy really get that much more credit than the first guy?


In fairness, I haven't read the original paper, so I don't know if the authors took any of these arguments into account. They might have. But even so, couldn't there be other factors? Just off the top of my head: maybe in later legs, the swimmers are more likely to be spread farther apart from each other, which creates a difference in the current, which makes everyone faster. Or maybe in later legs, each swimmer has a worse idea of his individual time, because he can't gauge himself by comparing himself to the others. Maybe he's more likely to push his limits when he doesn't know where he stands.

I have no idea if those are plausible, or even if the authors of the paper considered them. But the point is: you can always come up with more. Sports are complicated endeavors, full of picky rules and confounding factors. If you're going to attribute a certain finding to psychology, you need to work very, very hard to understand the sport you're analyzing, and spend a lot of time searching for factors that might otherwise explain your finding.

Your paper should go something along the lines of, "here are all the things I thought of that might make the second through fourth swimmers faster. Here's my research into why I don't think they could be right. Can you think of any more? If not, than maybe, just maybe, it's psychology."


If you don't to do that, you're not really practicing science. You're just practicing wishful thinking.


Labels: ,

Sunday, March 13, 2011

An adjusted NHL plus-minus stat

There were a whole bunch of new research papers presented at last week's MIT Sloan Sports Analytics Conference. I actually didn't see any of the research presentations -- I concentrated more on the celebrity panels, as did most of the attendees -- but that doesn't matter much, because every attendee got an electronic copy of all the papers presented. Also, there were poster summaries of most of the presentations, with the authors there to answer questions.

Anyway, I'm slowly going through those papers, and my plan is to summarize a few of them here.

I'll start with a hockey paper. This one (.PDF) is called "An Improved Adjusted Plus-Minus Statistic for NHL Players." It's by Brian Macdonald, a civilian math professor at West Point.

In hockey, the "plus-minus" statistic is the difference between the number of goals (excluding power-play goals) a team scores when a player is on the ice, and the number the opposition scores when the player is on the ice. The idea is great. The problem, though, is that a player's plus-minus depends heavily on his teammates and the quality of the opposition. Even the best player on a bad team would struggle to score a plus, if his linemates are giving up the puck all the time and missing the net.

So what this paper does is try to adjust for that. The author took the past three seasons' worth of hockey data, and ran a huge regression, which tries to predict goals scored based on which players are on the ice. In that regression, every row represents a "shift" -- a period of time in which the same players (for both teams) are on the ice. The regression helps to estimate out the value of a player, by simultaneously teasing out the values of his linemates and opponents, and adjusting for those.

Another improvement that Macdonald's stat holds over traditional plus-minus is that he was able to include power play and shorthanded situations as well. He did that by running separate regressions for those situations, and combining them. Also, in addition to the identities of the players on the ice, he included one additional variable -- which zone the originating faceoff was in (if, indeed, the shift started with a faceoff).

Here are his results. The numbers are "per season", by which Macdonald means the number of minutes the guy actually played on average over the three years. The number in brackets at the end is the standard error of the estimate.

+52.2 Pavel Datsyuk (20.9)
+45.8 Ryan Getzlaf (19.6)
+45.3 Jeff Carter (15.7)
+43.0 Mike Richards (17.2)
+42.6 Joe Thornton (17.6)
+42.1 Marc Savard (15.5)
+40.2 Alex Burrows (13.3)
+40.0 Jonathan Toews (15.5)
+39.8 Nicklas Lidstrom (25.9)
+38.3 Nicklas Backstrom (18.4)

As you can see, the standard errors are pretty big. I'd say you have pretty good assurance that these players are good, but not very much hope that the method gets the order right. You look at the list and see Pavel Datsyuk looks like the best player, but with such wide error bars, it's much more likely that one of the other players is actually better.

The standard errors are large because there's not a whole lot of data available, compared to all the players you're trying to estimate. But why do the standard errors vary so much from player to player? Because they depend on how many different sets of teammates and opponents a player was combined with. The highest overall standard error was Henrik Sedin (+33.8, SE 27.0), because he and his twin brother Daniel "spend almost all of their time on the ice playing together, and the model has difficulty separating the contributions of the two players."

(If I'm not mistaken, this problem is why a similar Adjusted Plus-Minus technique doesn't work well in the NBA. With so few players on a basketball team, and most of the superstars spending a lot of time playing together, there aren't enough "control" shifts to allow the contributions of the various players to be separated.)

However, the imprecision doesn't mean the statistic isn't useful. It's still a lot better than traditional plus-minus. That may not be obvious, because traditional plus-minus doesn't come with estimates of the standard error, like this study does. But if it did, those SEs would be significantly higher -- and the estimates would be biased, too. As far as I know, Macdonald's statistic is the best plus-minus available for hockey, and the fact that it explicitly acknowledges and estimates its shortcomings is a positive, not a negative.

---

Oh, a couple more things.

Macdonald actually ran separate regressions for offense and defense (the numbers above are the sum of the two). It turns out that the way Datsyuk wound up leading the league, was, in large part, by virtue of his defense. His +52.2 is comprised of +37.8 offense and +14.5 defense. But Datsyuk's is not the league's highest defensive score: the superstar of defense turns out to be the Canucks' Alex Burrows, at least among the players Macdonald lists in the paper. Burrows looks like most of his value came on D: +21.3, versus +18.9 on offense.

And: where's Sidney Crosby, who's supposedly the best player in the NHL? He's down the list at +33.6: +36.4 on offense, and -2.8 on defense. But the standard error is 16.5, so if you tack on two SEs to his score, he pretty much doubles to 66.6. So you can't really say where Crosby really ranks -- it's still very possible that he's the best.

Also, the numbers make it look like Crosby is below-average on defense, which I suppose he might be ... but the relevant statistic is the sum of the two components, not how they're broken up. The idea is to score more than your opponents, whether it's 2-1 or 5-4.

Alexander Ovechkin is similar to Crosby: +38.7 offense, -1.5 defense, total +37.2. Nicklas Backstrom has the best power play results; Alex Burrows is, by far, the highest-ranking penalty killer. Download the paper for lots more.


Labels: , ,

Sunday, March 06, 2011

Is "superstar bias" caused by Bayesian referees?

Would you rather have referees be more accurate, or less biased in favor of superstars?

In the NBA, a foul is called when the player with the ball makes significant contact with a defender while he's moving to take a shot. But which player is charged with the foul? Is it an illegal charge on the offensive player (running into a defender who's set and immobile), or is it an illegal block by the defensive player (who illegally gets in the way of a player in the act of shooting)?

It's a hard one to call, because it depends on the sequence of events. As this Sports Illustrated article says,

"... the often-fractional difference between a charge and a block call is decided by a referee who has to determine, in a split second: a) were the defender's feet set, b) was he outside the court's semicircle, c) who initiated contact, and d) does the contact merit a call at all?"


It seems reasonable to assume that, in a lot of cases, the referee doesn't know for sure, and has to make an uncertain call. Maybe he's 80% sure it's a charge, or 70% sure it's a block, and makes the call according to that best guess. (Not that the ref necessarily thinks in terms of those numbers, but he might have an idea in his mind of what the chances are.)

Now, suppose there's a case where, to the referee's eyes, he sees a 60% chance it was a charge, and only a 40% chance it was a block. He's about to call the charge. But, now, he notices who the players are. Defensive player B ("bad guy") is known as a reckless defender, and gets called for blocks all the time. Offensive player G ("good guy") is known to be a very careful player with his head in the game, who doesn't charge very often at all.

Knowing the characteristics of the two players, the referee now guesses there's an 80% chance it's really a block. Instead of 60/40, the chance is now 20/80.

What should the ref do? Should he call the charge, as he originally would have if he hadn't known who the players were? Or should he take into account that G doesn't foul often, while B is a repeat offender, and call the block instead?

------

If the ref calls the foul on player B, he'll be right a lot more often than if he calls it on G. When the NBA office reviews referees on how accurate their calls are, he'll wind up looking pretty good. But, B gets the short end of the stick. He'll be called for a lot more fouls than he actually commits, because, any time there's enough doubt, he gets the blame.

On the other hand, if the ref calls the foul on G, he'll be wrong more often. But, at least there's no "profiling." G doesn't get credit for his clean reputation, and there's no prejudice against B because of his criminal past.

Still, one player gets the short end of the stick, either way. The first way, B gets called for too many fouls. The second way, G gets called for too many fouls. Either way, one group of players gets the shaft. Do we want it to be the good guys, or the bad guys?

Maybe you think it's better that the bad guys, the reckless players, get the unfair calls. If you do, you shouldn't be complaining about "superstar bias," the idea that the best players get favorable calls from referees. Because, I'd guess, superstars are more likely to be Gs than Bs. Tell me if I'm wrong, but here's my logic.

First, they're better players, so they can be effective without fouling, and probably are better at avoiding fouls. Second, because they're in the play so much more than their teammates, they have more opportunities to foul. If they were Bs, they'd foul out of games all the time; this gives them a strong incentive to be Gs. And, third, a superstar fouling out costs his team a lot more than a marginal player fouling out. So superstars have even more incentive to play clean.

So if superstar bias exists, it might not be subconscious, irrational bias on the part of referees. The refs might be completely rational. They might be deciding that, in the face of imperfect information on what happened, they're going to make the call that's most likely to be correct, given the identities and predilections of the players involved. And that happens to benefit the stars.

------

When I started writing this, I thought of it as a tradeoff: the ref can be as accurate as possible, or he can be unbiased -- but not both. But, now, as I write this, I see the referee *can't* be unbiased. If there's any doubt in his mind on any play, his choices are: act in a way in which there will be a bias against the Bs; or act in a way in which there will be a bias against the Gs.

Is there something wrong with my logic? If not, then I have two questions:

1. Which is more fair? Should the ref be as Bayesian as possible, and profile players to increase overall accuracy at the expense of the Bs? Or should the referee ignore the "profiling" information, and reduce his overall accuracy, at the expense of the Gs?

2. For you guys who actually follow basketball -- what do you think refs actually do in this situation?




Labels: , ,

Saturday, February 26, 2011

Why is there no home-court advantage in foul shooting?

There's a home-site advantage in every sport.

Why is that? Nobody knows. One hypothesis is that it's officials favoring the home team. One piece of data that appears to support that hypothesis is that when you look at situations that don't involve referee decisions, the home field advantage (HFA) tends to disappear. In "Scorecasting," for instance, the authors report that, in the NBA, the overall home and road free-throw percentages are an identical .759. Also, in the NHL, shootout results seem to be the same for home and road teams, and likewise for penalty kick results in soccer.

However, there's a good reason for the results to look close to identical even if HFA is caused by something completely unrelated to refereeing.

The reason is that free-throw shooting involves only one player. At the simplest level, you could argue that foul shooting is offense. All other points scored in basketball are a combination of offense and defense. Not only is the offense playing at home, but the defense is playing on the road, which, in a sense, "doubles" the advantage. Therefore, if the home free-throw shooting advantage is X, the home field-goal shooting advantage should be at least 2X.

That's an oversimplification. A better way to think about it is that a foul shot attempt is the work of one player. A field goal attempt, on the other hand, is the end result of the efforts of *ten* players. Not every player is directly involved in the eventual shot attempt, but every player has the potential to be. A missed defensive assignment could lead to an easy two points, and the offense will take advantage regardless of which of the five defensive players is at fault. The same for offense: if a player beats his man and gets open, he's much more likely to be the one who gets the shot away. The weakest or strongest link could be any one of the ten players on the court.

So it might be better to guess that the HFA for a possession is 10X, rather than just X. We can't say that for sure -- it could be that the things a player has to do on a normal possession are so much more complex than a free throw, that the correct number is 20X. Or it could be that a normal possession is less complex than a free throw, so perhaps 5X is better. I don't know the answer to this, but 10X seems like a reasonable first approximation.

------

What would the actual numbers look like?

The home court advantage in basketball is about three points. That means that instead of (say) 100-100, the average game winds up 101.5 to 98.5.

Three points per game, divided by 10 players, is 0.3 points per game per player. Over (say) 200 possessions, that's 0.0015 points per possession per player.

If home-court advantage were made up only of serious mistakes, mistakes that turn a normal 50 percent possession into a 100 percent or zero percent possession, then that works out to exactly one point per mistake. In that case, the average player would make one such extra mistake every 667 possessions. That's a little less than one every three games. If you assume that a mistake is worth only half a point, then it's one mistake per player for every 333 team possessions.

In reality, of course, it's probably not nearly as granular as "mistakes" or "good plays". It's probably something like this: a player plays his role with an overall average of 50 effectiveness units, random between possessions, plus or minus some variation. But that's an average of home, where he plays with an average of 51 effectiveness units, and road, with an average 49 effectiveness units.

Still, that doesn't matter to the argument: the important thing is HFA is one point per player for every 667 total team possessions, regardless of how it manifests itself.

------

Now, let's go back to free throws. I'm going to assume that a player's HFA on a single possession should be about the same as a player's HFA on a single free throw. Is that OK? It's a big assumption. I don't have any formal justification for it, but it doesn't seem unreasonable. I'd have to admit, though, that there are a lot of alternative assumptions that also wouldn't seem unreasonable.

But the point of this post is that it is NOT reasonable to assume that a player's HFA on a free throw should be the same as the overall HFA for an entire game. That wouldn't make any sense at all. That would be like seeing that the average team wins 50 percent of games, and therefore expecting that the average team should win 50 percent of championships. It would be like seeing that the Cavaliers are winning 17 percent of their games, and expecting that they score 17 percent of the total points.

In any case, the overall argument stays the same even if you argue that the HFA on a single possession should be twice that of a single free throw, or half, or three times. But I'll proceed anyway with the assumption that it's one time.

If the HFA on a free-throw is the same 0.0015 points per player as on a possession, then you'd expect the difference between home and road free throw percentages to be 0.15%. Instead of the observed .759 home and road, it should be something like .75975 home, and .75825 road.

Why don't we see this? Well, here's one possible explanation. Visiting teams are behind more often, so will commit more deliberate fouls late in the game. They will try to foul home players who are worse foul shooters. Therefore, the pool of home foul shooters is worse than the pool of road foul shooters, which is why it looks like there's no home field advantage in foul shooting.

Since we're talking about such a very small HFA in the first place, this doesn't seem like an unreasonable explanation. It would be interesting to run the numbers, but controlling for who the shooter is. I suspect if you have enough data, you'd spot a very small home-court advantage in foul shooting.



Labels: , , ,

Monday, February 21, 2011

"Scorecasting" reviews

Coincidentally, Chris Jaffe and I both have reviews of "Scorecasting" out today. Here's Chris, at The Hardball Times. Here's me, at Baseball Prospectus.




Labels:

Thursday, February 17, 2011

Two issues of "By the Numbers" available

Two new issues of SABR's "By the Numbers" are now available at my website. One came out today, the other two weeks ago.

The issues are pretty thin, due to low submission volume. I hope to get more aggressive in asking online authors to allow us to reprint.

Labels: ,

Saturday, February 12, 2011

Bleg: Know any good referee studies?

I've been invited to this year's MIT Sloan Sports Analytics Conference, to participate in the "Referee Analytics" panel. I guess if I'm going to be talking about refereeing, I should try to get up to date on some of the research that's been going on.

So, a bleg: could you guys refer me, either in the comments or online, to what you think is important research on refereeing/umpiring in any sport? Much appreciated.

Oh, and speaking of umpiring ... a couple of years back, I had a nine-post analysis of the study about umpires and racial discrimination. Recently, I distilled those posts into an article that ran in the Fall, 2010 issue of SABR's "Baseball Research Journal."

Here's a .PDF of that article. I recommend it over my original posts ... back then, I was trying to figure it out as I was going along. This article is a distillation of the analysis, and what I actually concluded. (If you do want the original posts, they're linked at my website.)

And, by the way, SABR has an archive of past BRJ articles. It's a pretty good resource. There's at least one Bill James piece, for instance.




Labels: , ,

Wednesday, February 09, 2011

Packers win, casinos barely profit

It turns out that Las Vegas casinos didn't make a whole lot of money on this year's Super Bowl. According to the State of Nevada Gaming Board (.pdf), the sports books' profit was a mere 0.83 percent of the $87,491,098 wagered on the game. That's a total profit of only $724,176.

For the last ten years, the average profit was about 8 percent, or about 10 times higher.

If you make the assumption that all bettors need to wager $6 to win $5, then, on average, the casino keeps $1 of every $12 bet, which is 8.3%. That's not too far off from the actual amounts for the last decade, so let's assume that's the case, and see where it leads us.

Making that assumption and doing a bit of algebra, I get that the relationship between the amount of profit and the percentage bet on the winning team is

Percent Winners = (6 - 11 * profit percentage) / 6

Plugging 0.83% into the equation gives that 54.09% of bettors won their bets on Sunday.

Here are the numbers for all 10 years:

2011: 54.90% winners, 0.83% profit
2010: 50.01% winners, 8.30% profit
2009: 50.07% winners, 8.20% profit
2008: 56.07% winners, 2.90% loss
2007: 46.96% winners, 13.9% profit
2006: 49.47% winners, 9.30% profit
2005: 45.27% winners, 17.0% profit
2004: 46.20% winners, 15.3% profit
2003: 50.56% winners, 7.30% profit
2002: 52.74% winners, 3.30% profit

The highest proportion of winners was 56.07 percent, in 2008 (when the Giants beat the undefeated Patriots). That's pretty high. Put into baseball team terms (which I think all percentages should do), it's almost 91-71.

However, we need to take those percentages with a grain of salt, for a couple of reasons.

First, we assumed all bettors are betting $6 to win $5. That's not necessarily true. Big bettors probably get better odds than that. And some proposition bets probably pay worse odds than $5 to $6. Without knowing the expected percentage the house takes, the percentages of winners in the table are only rough estimates.

If you change your assumption to assume that bettors get a better deal than 5:6, the percentage of winners will move closer to 50% in every case.


UPDATE: here are the numbers assuming 10:11 odds:

2011: 51.94% winners, 0.83% profit
2010: 48.03% winners, 8.30% profit
2009: 48.09% winners, 8.20% profit
2008: 53.84% winners, 2.90% loss
2007: 45.10% winners, 13.9% profit
2006: 47.51% winners, 9.30% profit
2005: 43.48% winners, 17.0% profit
2004: 44.37% winners, 15.3% profit
2003: 48.56% winners, 7.30% profit
2002: 50.66% winners, 3.30% profit


Second: the usual assumption is that the casinos want to eliminate risk by having the same amount of money on both sides of the bet. That way, the bookies are certain to win a fixed amount: no matter what happens, they pay the winners with $5 of the losers' money, and keep the remaining $1.

So, one naive assumption is that the sports books weren't that great in predicting how bettors would behave. They obviously had the spread too low if 55% of bettors successfully picked the Packers -- and they nearly lost money because of it.

That might be true: the original line favored the Packers by 2.5 points. In response to the Packers attracting too much action, the bookmakers could have moved the spread to -3. But, because so many games are won by a field goal, 3 points might have been too big a gap from 2.5, and the pendulum might have swung too far towards the Steelers.

However, it's also possible that the bookmakers have an excellent idea of the "true" odds, and are willing to take a certain amount of additional risk if it's in their favor. For instance, suppose the casinos realized that the chance the Packers would beat the spread was only 45 percent. In that case, they might have been happy to take a bit more action on the Patriots. They assumed a bit more risk for that game, and it cost them -- but, over time, they make more money on average by going with the odds.

So, we can't really draw any detailed conclusions, because of our assumptions. However, we CAN say that:

1. If all bets were taken at 5:6 odds, then almost 55% of "pick 'em" bets on the Super Bowl were winners last Sunday.

2. Regardless, there was much more winning than usual on Sunday, enough to almost wipe out the oddsmakers' profits.

I know there are lots of gambling experts out there who might have enough information to explain what really did happen on Sunday (and correct any bad assumptions I may have made). Anyone?


Hat Tip: The Sports Economist

Labels: , ,

Friday, February 04, 2011

"Scorecasting" on players gunning for .300

A few months ago, I wrote about a study by two psychology researchers, Devin Pope and Uri Simonsohn. The study found that, for players hitting .299 in their last at-bat of the season, they wound up hitting well over .400 in that last at-bat. The authors concluded that it's because .299 hitters really want to get to .300, and, therefore, they try extra hard (and succeed).

But, really, that isn't the case. It's really just an illusion caused by selective sampling. When a player hitting .299 gets a hit to push him over .300, he is much more likely to be taken out (or held out) of the lineup, to preserve the .300. Therefore, it's not that they're more likely to get a hit in their last at-bat -- it's that their last at-bat is more likely to be one that results in a hit.

(For an analogy: when a game ends with less than 3 outs, the last batter probably hits well over .500 (since the winning run must have scored on the play). But that's not because the player rises to the situation; it's because, as it were, the situation rises to the player. When he gets a hit, he's the last batter because the game ends. When he doesn't, he's not the last batter.)

Since the original study and article, the authors have modified their paper a bit, saying that the batting average effect is "likely to be at least partially explained" by selective sampling. However, the data given in the previous posts does suggest that almost the *entire* effect is explained by selective sampling. (PDFs: Old paper; new paper.)

There is one part of the study's findings that's probably partially real, and that's the issue of walks. None of the .299 hitters walked in their last at-bat. That's partially selective sampling -- if they walked, they're still at .299, and stayed in the game, so it's not their last at-bat -- but probably partially real, in that .299 hitters were more likely to swing away.

(My results are in previous posts here and here.)

------

The study is given featured status in "Scorecasting," in the chapter on round numbers. However, while the authors of the original paper mention the selective sampling issue, the authors of "Scorecasting" do not:

"What's more surprising is that when these .299 hitters swing away, they are remarkably successful. According to Pope and Simonsohn, in that final at-bat of the season, .299 hitters have hit almost .430. ... (Why, you might ask, don't *all* batters employ the same strategy of swinging wildly? ... if every batter swung away liberally throughout the season, pitchers would probably adjust accordingly and change their strategy to throw nothing but unhittable junk.) ...

"Another way to achieve a season-ending average of .300 is to hit the goal and then preserve it. Sure enough, players hitting .300 on the season's last day are much more likely to take the day off than are players hitting .299."


"Scorecasting" treats these two paragraphs as two separate effects. In reality, the second causes the first.

You can read an excerpt -- almost the entire thing, actually -- at Deadspin, here.

------

One thing that interested me in the chapter was this:

"But no benchmark is more sacred than hitting .300 in a season. It's the line of demarcation between All-Stars and also-rans. It's often the first statistic cited when making a case for or against a position player in arbitration. Not surprisingly, it carries huge financial value. By our calculations, the difference between two otherwise comparable players, one hitting .299 and the other .300, can be as high as two percent of salary, or, given the average major league salary, $130,000."


The authors don't say how they calculated that, but it seems reasonable. A free-agent win is worth $4.5 million, according to Tom Tango and others. That means a run is worth $450,000. One point of batting average, in 500 AB, is turning half an out into half a hit. Assuming the average hit is worth about 0.6 runs and an out is worth negative 0.25 runs, that means the single point of batting average is worth a bit over 0.4 runs. That's close to $200,000.

That figure is higher than the authors' figure of $130,000. The difference is probably just that the authors used the average MLB salary, which includes players not yet free agents (arbs and slaves). However, they imply that the difference between .299 and .300 is worth more than other one-point differences. That might be true, but it would be nice to know how they figured it out and what they found.

------

Finally, two bloggers weigh in. Tom Scocca, at Slate, criticizes the original study. Then, Christopher Shea, at the Wall Street Journal, criticizes Scocca.



Labels: , , ,

Tuesday, February 01, 2011

Scorecasting: are the Cubs unlucky, or is it management's fault?

The Chicago Cubs, it has been noted, have not been a particularly huge success on the field in the past few decades. Is Cubs' management to blame? The last chapter of the recent book "Scorecasting" says it's true. I'm not so sure.

The authors, Tobias J. Moskowitz and L. Jon Wertheim, set out to debunk the idea that the Cubs lack of success -- they haven't won a World Series since 1908 -- is simply due to luck.

How do they check that? How do they try to estimate the effects of luck on the Cubbies? Not the way sabermetricians would. Instead, the authors ... well, I'm not really sure what they did, but I can guess. Here's how they start:

"Another way to measure luck is to see how much of a team's success or failure can't be explaiend. For example, take a look at how the team performed on the field and whether, based on its performance, it won fewer games than it should have."

So far, so good. There is an established way to look at certain aspects of luck. You can look at the team's Pythagorean projection, which estimates its won-lost record from its runs scored and runs allowed. If it beat its projection, it was probably lucky.

Also, you can also compute its Runs Created estimate. The Runs Created formula takes a team's batting line, and projects the number of runs it should have scored. If the Cubbies scored more runs than their projection, they were somewhat lucky. If they scored fewer, they were somewhat unlucky.

But that doesn't seem to be what the authors do. At least, it doesn't seem to follow from their description. They continue:

"If you were told that your team led the league in hitting, home runs, runs scored, pitching, and fielding percentage, you'd assume your team won a lot more games than it lost. If it did not, you'd be within your rights to consider it unlucky."

Well, yes and no. Those criteria are not independent. If I were told that my team scored a certain number of runs, I wouldn't care whether it also led the league in home runs, would I? A run is a run, whether it came from leading the league in home runs, or leading the league in "hitting" (by which my best guess is that the authors meant batting average).

The authors do the same thing in the very same paragraph:

"How, for instance, did the 1982 Detroit Tigers finish fourth in their division, winning only 83 games and losing 79, despite placing eighth in the Majors in runs scored that season, seventh in team batting average, fourth in home runs, tenth in runs against, ninth in ERA, fifth in hits allowed, eighth in strikeouts against, and fourth in fewest errors?"

Again, if you know runs scored and runs against, why would you need anything else? Do they really think that if your pitchers give up four runs while striking out a lot of batters, you're more likely to win than if your pitchers give up four runs while striking out fewer batters?

(As an aside, just to answer the authors' question: The 1982 Tigers underperformed their Pythagorean estimate by 3 games. They underperformed their Runs Created by 2 games. But their opponents underperformed their own Runs Created estimate by 1 game. Combining these three measures shows the '82 Tigers finished four games worse than they "should have".)

Now, we get to the point where I don't really understand their methodology:

"Historically, for the average MLB team, its on-the-field statistics would predict its winning percentage year to year with 93 percent accuracy."

What does that mean? I'm not sure. My initial impression is that they ran a regression to predict winning percentage based on that bunch of stats above (although if they included runs scored and runs allowed, the other variables in the regression should be almost completely superfluous, but never mind). My guess is that's what they did, and they got an correlation coefficient of .93 ... or perhaps an r-squared of .93. But that's not how they explain it:

"That is, if you were to look only at a team's on-the-field numbers each season and rank it based on those numbers, 93 percent of the time you would get the same ranking as if you ranked it based on wins and losses."

Huh? That can't be right. If you were to take the last 100 years of the Cubs, and run a projection for each year, the probability that you'd get *exactly the same ranking* for the projection and the actual would be almost zero. Consider, for instance, 1996, where the Cubs outscored their opponents by a run, and nonetheless wound up 76-86. And now consider 1993, when the Cubs were outscored by a run, and wound up 84-78. There's no way any projection system would "know" to rank 1993 eight games ahead of 1996, and so there's no way the rankings would be the same. The probability of getting the same ranking, then, would be zero percent, not 93 percent.

What I think is happening is that they're really talking about a correlation of .93, and this "93 percent of the time you would get the same ranking" is just an oversimplification in explaining what the correlation means. I might be wrong about that, but that's how I'm going to proceed, because that seems the most plausible explanation.

So, now, from there, how do the authors get to the conclusion that the Cubs weren't unlucky? What I think they did is to run the same regression, but for Cub seasons only. And they got 94 percent instead of 93 percent. And so, they say,

"The Cubs' record can be just as easily explained as those of the majority of teams in baseball. ... Here you could argue that the Cubs are actually less unlucky than the average team in baseball."

What they're saying is, since the regression works just as well for the Cubs as any other team, they couldn't have been unlucky.

But that just doesn't follow. At least, if my guess is correct that they used regression. I think the authors are incorrect about what it means to be lucky and how that relates to the correlation.

The correlation in the data suggests the extent to which the data linearly "explain" the year-to-year differences in winning percentage. But the regression doesn't distinguish luck from other explanations. If the Cubs are consistently lucky, or consistently unlucky, the regression will include that in the correlation.

Suppose I try to guess whether a coin will land heads or tails. And I'm right about half the time. I might run a bunch of trials, and the results might look like this:

1000 trials, 550 correct
200 trials, 90 correct
1600 trials, 790 correct
100 trials, 40 correct

If I run a regression on these numbers, I'm going to get a pretty high correlation -- .9968, to be more precise.

But now, suppose I'm really lucky. In fact, I'm consistently lucky. And, as a result, I do 10 percent better on every trial:

1000 trials, 605 correct
200 trials, 99 correct
1600 trials, 869 correct
100 trials, 44 correct

What happens now? If I run the same regression (try it, if you want), I will get *exactly the same correlation*. Why? Because it's just as easy to predict the number of successes as before. I just do what I did before, and add 10%. It's not the correlation that changes -- it's the regression equation. Instead of predicting that I get about 50% right, the equation will just predict that I get about 55% right. The fact that I was lucky, consistently lucky, doesn't change the r or the r-squared.

The same thing will happen in the Cubs case. Suppose the Cubs are lucky, on average, by 1 win per season. The regression will "see" that, and simply adjust the equation to somehow predict an extra win per season. It'll probably change all the coefficients slightly so that the end result is one extra win. Maybe if the Cubs are lucky, and a single "should be" worth 0.046 wins, the regression will come up with a value of 0.047 instead, to reflect the fact that, all other things being equal, the Cubs' run total is a little higher than for other teams. Or something like that.

Regardless, that won't affect the correlation much at all. Whether the Cubs were a bit lucky, a bit unlucky, about average in luck, or even the luckiest or unluckiest team in baseball history, the correlation might come out higher than .93, less than .93, or the same as .93.

So, what, then, does the difference between the Cubs' .94, and the rest of the league's .93, tell us? It might be telling us about the *variance* of the Cubs' luck, not the mean. If the Cubs hit the same way one year as the next, but one year they win 76 games and another they win 84 games ... THAT will reduce the correlation, because it will turn out that the same batting line isn't able to very accurate pinpoint the number of wins.

If you must draw a conclusion from the the regression in the book -- which I am reluctant to do, but if you must -- it should be only that the Cubs' luck is very slightly *more consistent* than other teams' luck. But it will *not* tell you if the Cubs' overall luck is good, bad, or indifferent.

------

So, have the Cubs been lucky, or not? The book's study doesn't tell us. But we can just look at the Cubs' Pythagorean projections, and runs created projections. Actually, a few years ago, I did that, and I also created a method to try to quantify a "career year" effect, to tell if the team's players underperformed or overachieved for that season, based on the players' surrounding seasons. (For instance, Dave Stieb's 1986 was marked as an unlucky year, and Brady Anderson's 1996 a lucky year, because both look out of place in the context of the players' careers.)

My study gave a total of a team's luck based on five factors:

-- did it win more or fewer games than expected by its runs scored and allowed?
-- did it score more or fewer runs than expected by its batting line?
-- did its opponents score more or fewer runs than expected by their batting line?
-- did its hitters have over- or underachieving years?
-- did its pitchers have over- or underachieving years?

(Here's a PowerPoint presentation explaining the method, and here's a .ZIP file with full team and player data.)

The results: from 1960 to 2001, the Cubs were indeed unlucky ... by an average of slightly over half a win. That half win was comprised of about 1.5 wins of unlucky underperformance of their players, mitigated by about one win of being lucky in turning that performance into wins.

But the Cubs never really had seasons in that timespan in which bad luck cost them a pennant or division title. The closest were 1970 and 1971, when, both years, they finished about five games unluckier than they should have (they would have challenged for the pennant in 1970 with 89 wins, but not in 1971 with 88 wins). Mostly, when they were unlucky, they were a mediocre team that bad-lucked their way into the basement. In 1962 and 1966, they lost 103 games, but, with normal luck, would have lost only 85 and 89, respectively.

However, when the Cubs had *good* luck, it was at opportune times. In 1984, they won 96 games and the NL East, despite being only an 80-82 team on paper. And they did it again in 1989, winning 93 games instead of the expected 77.

On balance, I'd say that the Cubs were lucky rather than unlucky. They won two divisions because of luck, but never really lost one because of luck. Even if you want to consider that they lost half a title in 1970, that still doesn't come close to compensating for 1984 and 1989.

------

But things change once you get past 2001. It's not in the spreadsheet I linked to, but I later ran the same analysis for 2002 to 2007, at Chris Jaffe's request for his book. And, in recent years, the Cubs have indeed been unlucky:

2002: 67-95, "should have been" 86-76 (19 games unlucky)
2003: 88-74, "should have been" 86-76 (2 games lucky)
2004: 89-73, "should have been" 90-72 (1 game unlucky)
2005: 79-83, "should have been" 86-76 (6 games unlucky)
2006: 66-96, "should have been" 82-80 (16 games unlucky)
2007: 85-77, "should have been" 88-74 (3 games unlucky)

That's 42 games of bad luck over seven seasons -- an average of 7 games per season. That's huge. Even if you don't trust my "career year" calculations, just the Pythagoras and Runs Created bad luck sum to almost 5.5 of those 7 games.

So, yes ... in the last few years, the Cubs *have* been unlucky. Very, very unlucky.

------

In summary: from 1960 to 2001, the Cubs were a bit of a below-average team, with about average luck. Then, starting in 2002, the Cubs got good -- but, by coincidence or curse, their luck turned very bad at exactly the same time.

----

But if the "Scorecasting" authors don't believe that the Cubs have been unlucky, then what do they think is the reason for the Cubs' lack of success?

Incentives. Or, more accurately, the lack thereof. The Cubs sell out almost every game, win or lose. So, the authors ask, why should Cubs management care about winning? They gain very little if they win, so they don't bother to try.

To support that hypothesis, the authors show the impact (elasticity) of wins on tickets sold. It turns out that the Cubs have the lowest elasticity in baseball, at 0.6. If the Cubs' winning percentage drops by 10 percent, ticket sales drop by only 6 percent.

On the other hand, their crosstown rivals have one of the highest elasticities in the league, at about 1.2. For every 10 percent drop in winning percentage, White Sox ticket sales drop by 12 percent -- almost twice as much.

But ... I find this unconvincing, for a couple of reasons. First, if you look at the authors' tables (p. 245), it looks like it takes a year or so after a good season for attendance to jump. That makes sense. In 2005, it probably took a month or two for White Sox fans to realize the team was genuinely good; in 2006, they all knew beforehand, at season-ticket time.

Now, if you look at the Cubs' W-L record for the past 10 years, it really jumps up and down a lot; from 1998 to 2004, the team seesawed between good and bad. For seven consecutive seasons, they either won 88 games or more (four times), or lost 88 games or more (three times). So, fan expectations were probably never in line with team performance. Because the authors predicted attendance based on current performance, rather than lagged performance, that might be why they didn't see a strong relationship (even if there is one).

But that's a minor reason. The bigger reason I disagree with the authors' conclusions is that, even when they're selling out, the Cubs still have a strong incentive to improve the team -- and that's ticket prices. Isn't it obvious that the better the team, the higher the demand, and the more you can charge? It's no coincidence that the Cubs have the highest ticket prices in the Major Leagues (.pdf) at the same time as they're selling out most games. If the team is successful, and demand rises, the team just charges more instead of selling more.

Also, what about TV revenues, and merchandise sales, which also rise when a team succeeds?

It seems a curious omission that the authors would consider only that the Cubs can't sell more tickets, and not that total revenues would significantly increase in other ways. But that's what they did. And so they argue,

"So, at least financially, the Cubs seem to have far less incentive than do other teams -- less than the Yankees and Red Sox, and certainly less than the White Sox. ... Winning or losing is often the result of a few small things that require extra effort to gain a competitive edge: going the extra step to sign the highly sought-after free agent, investing in a strong farm team with diligent scouting, monitoring talent, poring over statistics, even making players more comfortable. All can make a difference at the margin, and all are costly. When the benefits of making these investments are marginal at best, why undertake them?"

Well, the first argument is the one I just made: the benefits are *not* "marginal at best," because with a winning team, the Cubs would earn a lot more money in other ways. But there's a more persuasive and obvious argument. If the Cubs have so small an incentive to win, if they care so little that they can't even be bothered to hire "diligent" scouts ... then why do they spend so much money on players?

In 2010, the Cubs' payroll was $146 million, third in the majors. In 2009, they were also third. Since 2004, they have never been lower than ninth, in a 30-team league. Going back as far as 1991, there are only a couple of seasons that the Cubs are below average -- and in those cases, just barely. In the past 20 years, the Cubs have spent significantly more money on player salaries than the average major-league team.

It just doesn't make sense to assume that the Cubs don't care about winning, does it, when they routinely spend literally millions of dollars more than other teams, in order to try to win?


Labels: , , , , ,