One of the most common requests I get is to write up a complete sample game probability calculation. In this article, I'll explain how the model works and do a full detailed example using the upcoming Super Bowl between the Steelers and Cardinals.
When I originally constructed this model, the goal wasn’t to predict game outcomes but to identify how important the various phases of the game were compared to the others. In order to do that, I had to choose stats that were independent of the others, or at least as independent as possible.
There were several options, such as points scored and allowed, total yards, or first downs. But if I’m trying to measure the true strength of a team’s offensive passing game, passing touchdowns may not tell us much. A team may have a great defense that gives them good field position on most drives, or it might have a spectacular running back that can carry the offense into the red zone frequently. So points or touchdowns won’t work.
The other obvious option is total yards. But losing teams can accumulate lots of total passing yards late in a games “trash time.” Or a team can generate lots of pass yards simply because they pass more often. That really doesn’t tell us how good a team is at passing. Total rushing yards presents a similar problem. A team with a great passing game can build a huge lead through three quarters, and then run out the clock in the 4th quarter accumulating a lot of rushing yards.
First downs made or allowed tells us a lot about how good an offense or defense is, but it doesn’t tell us anything about the relative contributions of the running and passing game of a team.
So, the best choice is going to be efficiency stats. Net yards per pass attempt and yards per rush tells us about how good a team truly is in those facets of the game. They are also largely independent of one another—not completely, but about as independent as possible.
Turnovers are also obviously critical. But total turnovers can be misleading just like total yards. Teams that pass infrequently may have few interceptions, but it may only be because they simply have fewer opportunities. So I also use interceptions per attempt, and fumbles per play.
So the model starts with team efficiency stats. But I don’t use all of them. For example, I throw out defensive fumble rate because although it helps explain past wins or losses, it doesn’t predict future games. A team’s defensive fumble rate is wildly inconsistent throughout a season, which suggests it’s very random or mostly due to an opponent’s ability to protect the ball. Forced fumbles and defensive interceptions show the same tendency. In the end, the model is based on:
- Offensive net passing yds per att
- Offensive rushing yds per att
- Offensive interceptions per att
- Offensive fumbles per play
- Defensive net passing yds per att
- Defensive rushing yds per att
- Team penalty yds per play
- Home field advantage
The model is a regression model, specifically a multivariate non-linear (logistic) regression. I know that sounds very technical, but the general idea behind regression is pretty intuitive. If you plotted a graph of a group of students’ SAT scores vs. their GPA, you’d see a rough diagonal line.
We can draw a line that estimates the relationship between SAT scores and GPA, and that line can be mathematically described with a slope and intercept. Here, we could say GPA = 1.5 + 2 * (test score).
Regression is what puts that line where it is. It draws a line that minimizes the error between the estimated GPA and the actual GPA of each case.
We can do the same thing with net passing efficiency and season wins. We can estimate season wins as Wins = -6.5 + 2.4*(off pass eff). Take the Cardinals this year. Their 7.1 net passing yds per attempt produces an estimate of 10.7 wins. They actually won 9, so it’s not a perfect system. We need to add more information, and that’s what multivariate regression can do.
Multivariate regression works the same way but is based on more than one predictor variable. Using both offensive and defensive pass efficiency as predictors, we get:
Wins = 9.6 + 2.3*(off pass eff) – 2.6*(def pass eff)
For the Cardinals, whose defensive pass efficiency was 6.5 yds per att in 2008, we get an estimate of 9.4 wins.
Adding the rest of the efficiency stats to the regression, we can improve the estimates even further. Unfortunately, linear regression, like we just used, can sometimes give us bad results. A team with the best stats imaginable would still only win 16 games in a season, but a linear regression might tell us they should win 21. Additionally, linear regression can estimate things like the total season wins, but it can’t estimate the chances of one team beating another. That’s where non-linear regression comes in.
Non-linear regression, like the logistic regression I use, is best used for dichotomous outcomes such as win or lose. A logistic regression model can estimate the probabilities of one outcome or the other based on input variables. It does this by using a logarithmic transformation, which is a fancy way to say taking the log of everything before doing all the computations. After computing the model and its output just as you would with linear regression, you “undo” the logarithm by taking the natural exponent of the result. Technically, logistic regression produces the “log of the odds ratio.” The odds ratio is the familiar “3 to 1” odds used at the race track, which can be translated into a probability of 0.75 (to 0.25).
Logistic regression would be useful if, instead of predicting GPA, you wanted to predict a student’s probability of graduation. Graduation is a yes-or-no dichotomous outcome, and winning an NFL game is no different. We can use the efficiency stats, that we already know contribute to winning, to estimate the chances one team beats another.
As an example, let’s compute the probability each opponent will win the upcoming Super Bowl based on offensive rushing efficiency alone. Based on the regular season game outcomes from 2002-2007, the regression output tells us that the intercept is zero and the coefficient of rushing efficiency is 0.25. The model can be written:
Log(odds ratio) = 0 + 0.25*(ARI off run eff) – 0.25*(PIT off run eff)
= 0.25*(3.46) – 0.25*(3.67)
= -0.052
The odds ratio, would be e-0.052 = 0.95. In other words, based on offensive running alone, the odds Arizona wins would be 0.95 to 1. In probability terms, this is 0.49, giving Pittsburgh the slightest edge. Another way of saying this is, holding all other factors equal, Pittsburgh’s advantage in rushing efficiency gives them just a 51% chance of winning.
[Note: You can translate odds ratios into probabilities by using prob = odds/(1+odds).]
Now we can do the same thing, but with the full list of predictor variables. The independent “input” variables are the efficiency stats for each team, and the dependent variable is the dichotomous outcome of each game—either 1 for a win or 0 for a loss. My handy regression software tells us that the model coefficients come out as:
Coefficient | Value |
Constant | -0.36 |
Home Field | 0.72 |
O Pass | 0.46 |
O Run | 0.25 |
O Int | -19.4 |
O Fum | -19.4 |
D Pass | -0.62 |
D Run | -0.25 |
Pen Rate | -1.53 |
The “logit,” or the change in the log of the odds ratio, can be written as:
Logit = const + home field + Team A logit - Team B logit
or
Logit = -0.36 + 0.72 + 0.46*(team A off pass eff) + 0.25*(team A off run eff) +...
- 0.46*(team B off pass eff) – 0.25*(team B off pass eff) - …
We have the constant, the home field advantage adjustment, and the sum of the products of each team’s coefficients and stats. The equation will eventually tell us Team A’s odds of winning, so we add its component logit and we subtract Team B’s. If Team A is the home team, we add the home field adjustment (0.72 * 1). If not, we can leave it out (0.72 * 0).
Now let’s look at Arizona and Pittsburgh in terms of their probability of winning Super Bowl XLIII. I’ll compute both teams’ logit component, combine them in the overall logit equation, then convert it to probabilities. To keep things simple, I’m going to only use team stats from the regular season for this example.
Arizona’s logit component would be:
Logit(ARI) = 0.46*7.1 + 0.25*3.5 – 19.4*0.024 – 19.4*0.028 – 0.62*6.5 – 0.25*4.0 – 1.53*0.39
= -2.45
Pittsburgh’s logit component would be:
Logit(PIT) = 0.46*6.0 + 0.25*3.7 – 19.4*0.030 – 19.4*0.026 – 0.62*4.3 – 0.25*3.3 – 1.53*0.41
= -1.51
Because the Super Bowl is at a neutral site, I’ll only add half of the home field adjustment when I combine the full equation.
Logit = -0.36 + 0.72/2 - 2.45 + 1.51
= -0.93
Therefore the odds ratio is e-0.93 = 0.39. That makes the probability of Arizona beating Pittsburgh at a neutral site equal to 0.39/(1+0.39) = 0.28. Pittsburgh’s corresponding probability would be 0.72.
(Notice how the constant and the home field adjustment cancels out to zero for a neutral site.)
In part 2 of this article, I'll explain how I factor in opponent adjustments and how I calculate a team's generic win probability (GWP)--the probability a team would win against a league-average opponent at a neutral site.