Open Mind

Alphabet Soup, part 3b: Parameters for ARMA(1,1) (continued)

September 16, 2008 · 14 Comments

We’ll now continue where we left off in the last post, to find out how to estimate ARMA(1,1) parameters from observed data. This post won’t have nearly so many equations! From my point of view, that’s a pity … but most of you will be relieved.

We got as far as noting that if we know the ARMA(1,1) parameters, the white-noise standard deviation \sigma_w and the AR and MA parameters \phi,~\theta, then we can determine the data standard deviation \sigma and the autocorrelations \rho_j (for all j). The data variance is given by

\sigma^2 = \gamma_0 = \sigma_w^2 [1 + (\phi+\theta)^2 / (1-\phi^2)],

and the autocorrelations for lag j \ge 1 are

\rho_j = \phi^j [1 + {\theta/\phi \over 1 + (\phi+\theta)^2 / (1-\phi^2)}] = \lambda \phi^j,

where for convenience I’ve defined

\lambda = [1 + {\theta/\phi \over 1 + (\phi+\theta)^2 / (1-\phi^2)}].

We note that \lambda has no dependence on the lag j, so for j \ge 1 the autocorrelations decay exponentially with decay factor \phi. This enables us to determine the first of our parameters using observed data. We begin by estimating the autocorrelations \rho_j. This can be done in many ways, but most ways are biased low, including the traditional (Yule-Walker) estimate. However, if we have enough data then the bias is small, so I won’t bother considering how to estimate the autocorrelations; with enough data, any method should work well enough.

We can note that

\ln(\rho_j) = \ln(\lambda) - j \ln \phi.

This indicates that (for j \ge 1) the logarithms of the autocorrelations \rho_j should be a linear function of lag j.

So the first thing I did was take the observed data and subtract the trend as determined by linear regression. Using GISS data from 1975 up to and including June 2008, that gives this:

Then I estimated the low-lag autocorrelations, using both the Yule-Walker estimate and a bias-corrected estimate. They are:

Lag Yule-Walker Bias-corrected
1 0.5976 0.6046
2 0.5223 0.53
3 0.3838 0.3911
4 0.3727 0.3809
5 0.3085 0.3165
6 0.2563 0.264

At this point we could take advantage of the fact that \rho_2/\rho_1=\phi, to estimate \phi. This gives \phi=0.8740 (Yule-Walker) or \phi=0.8766 (bias-corrected). Both estimates are very similar. But instead, I decided to take advantage of the fact that the logarithm of the autocorrelation is a linear function of lag (for lag \ge 1), by fitting a straight line to \ln(\rho_j) for j from 1 to 6. We have to omit the autocorrelation at lag 0 (which equals 1, so its logarithm is zero) because it does not follow the given equation (it would only do so for an AR(1) model). We must also omit lags higher than a few time steps, because then the autocorrelations become small enough that their uncertainties make the uncertainty in \ln(\rho_j) very large.

The slope of this regression line gives an estimate of \ln(\phi). The result was \ln(\phi)=-0.166866 (Yule-Walker) so \phi=0.8463, or \ln(\phi)=-0.163306 (bias-corrected) so \phi=0.8493. I adopted the value from linear regression using the bias-corrected autocorrelations, but clearly all estimates are very close to each other. I adopted the estimate \phi=0.8493.

That’s fine, but what about \theta and \sigma_w? To estimate \theta, we could use the fact that the intercept of the regression line of \ln(\rho_j) against j is an estimate of \ln \lambda. This would probably be more accurate than estimating \lambda as \lambda = \rho_1/\phi, but I used the latter estimate rather than the former. The linear-regression estimate is \ln(\lambda) = -0.3491 so \lambda=0.7053. Using the estimate \lambda=\rho_1/\phi gives \lambda=0.6046/0.8493=0.7119. Again it’s clear that both estimates are very close, I used \lambda=0.7119.

But \lambda is not \theta. To get that, we note that

\lambda = 1 + {\theta/\phi \over 1 + (\phi+\theta)^2/(1-\phi^2)}.

This can be rearranged, with some algebra, to

\theta^2 + [2\phi + (1/\phi - \phi)/(1-\lambda)] \theta + 1 = 0.

This equation is quadratic in \theta, so it can be solved for \theta. For the GISS data, I get

\theta^2 + 2.8375 \theta + 1 = 0.

There are two roots to any quadratic equation; in this case both are negative and I selected the one which is less negative, giving \theta=-0.4123.

We’re almost there! It only remains to determine \sigma_w^2, which we can compute directly from the equation (a version of a previous equation for \sigma^2)

\sigma_w = \sigma / \sqrt{1 + (\phi+\theta)^2/(1-\phi^2)}.

The standard deviation of the GISS residuals is \sigma = 0.1476, and using the estimated values of \phi,~\theta we get \sigma_w=0.1137. This illustrates that I made a typo in my earlier post, where I stated that \sigma_w=0.1147. The difference is small, and not really significant; the probable error in the estimate is larger than that.

Voila! We have estimates for all three parameters, and more important, we have a method for obtaining the estimates given observed data. There are a number of steps, and it’s easy to make an error, so it would probably be best to write a computer program to compute them. Or, you can fit the ARMA(1,1) model directly using a stats program like R (yes, I’ve recently started using R and I like it very much). That would save a lot of work! But at least, with these posts, we can see how the numbers can be estimated from observed parameters — and I always want to know how things are computed, I’m never satisfied with numbers that “magically” appear out of a “black box.” But I will say that R is very powerful, and it’s free for download from the web, so if you want to do a lot of stats I’d recommend it. But then, of course, you’ll have to learn how to use it!. It’s actually very easy, but it takes time to learn a lot because there’s a lot to it.

I was going to apply this analysis to HadCRU data as well, but I think I’ll leave that for the talented readers who are eager to apply all this stuff. If you decide to do that, I would recommend making things simple for yourself. Don’t hesitate to use the Yule-Walker estimate of autocorrelations (if you use R, that’s what you’ll get). Don’t hesitate to estimate \phi by \rho_2/\rho_1. Those estimates are plenty close enough, and you should keep in mind that they’re all only estimates. But they should enable you to model noise as an ARMA(1,1) process which will reasonably well mimic the behavior of the noise component in monthly global average temperature.

At this point I must say, these last two posts were a lot of work! In fact it was more work doing the posts than it is working out the equations. And I’m fully aware that there’s a limited audience for this kind of post, namely, those who are keenly interested in the math. But I think it’s well worth the effort. Still, I may take a little break from blogging for a few days…

Categories: mathematics

14 responses so far ↓

  • swade016 // September 17, 2008 at 2:19 pm

    Good stuff. How successful is this method for forecasting, say 10 “intervals” into the future? I tried an ARMA(1,1) on hourly data, and the forecast was basically a flat line. When I switched to an ARMA(24,1) it kept the “daily curve”. Any reason why this would work that way? Is it just because the p=1 isn’t a large enough lag to pick up the daily peaks and valleys when it forecasts?

    [Response: For hourly temperature there's a one-day cycle which is not a completely random process; it's driven by the rotation of the earth. The data should be modeled as (average daily cycle) + (weather noise). Also, the average daily cycle will change with the seasons, on yet another cycle with a period of one year driven by the orbital motion of the earth. The orbital and rotational cycles are not random processes, they're deterministic physical processes.

    Higher-order AR processes can mimic such cycles; you can do so with an AR(2) process, for instance. But eventually the phase of the "cycle" will drift, randomly. For an ARMA(24,1) properly tuned, the mimicry will probably be very good and the drift may take a while to be apparent, but sooner or later the "cycle" will be out of phase with the rotation of the earth. So the daily cycle in temperature is not due to a high-order random process; it's a deterministic signal driven by an external factor with a very precise period and consistent phase.

    Note that in this post the random process to mimic global temperature is fit only after the deterministic signal (trend) has been removed.

    Bear in mind that ARMA(p,q) mimics a random process. Hence such models only permit forecasting a few time steps into the future. Over the long haul, the random nature of the process dominates and prediction is impossible -- just as we can predict the weather a few days ahead, but predicting weather a month in advance is impossible.]

  • swade016 // September 17, 2008 at 7:09 pm

    Tamino, Thanks! That was very helpful.

    Actually, I’m attempting to build an electric load estimation/forecasting system from scratch and have tried a myriad of ways to get anything of substance.

    Some of the customers are weather sensitive, while others are not. For a weather sensitive customer, it’s fairly easy to regress against Temp/Windspeed/Humidity and get a good model. For others, say a school that only runs on 210 days a year, it’s not. In those cases, I’ve tried running through various time-series models and compare graphs to known data to get anything that resembles what’s going on. For all I know what I’m doing is bunk, but when the ARIMA(24,1) keeps returning R^2 values around .98 it “seems” like it works. Part of me just thinks it’s crap, but then again… it “looks” right.

    Earlier I’d asked you for a good Time-Series Analysis book, and while I’m waiting for yours to be finished, I’d still like to get a head start. Any recommendations on where to start, resources, etc…?

    [Response: Actually a good textbook is Shumway & Stoffer (2006), Time Series Analysis and its Applications, With R Examples, Springer, New York.]

  • george // September 18, 2008 at 9:32 pm

    The calculated white noise value comes out significantly less if you use the other negative root above.

    In that case, the white noise standard dev is only about 1/3 of the standard dev of residuals while the other case about 3/4.

    How did you decide to use the less negative root ?

    Is this based on some additional information?

  • tamino // September 19, 2008 at 1:22 am

    It has come to my attention that in addition to trying the ARMA(1,1) model on HadCRU data in a recent post, Lucia has also tried using a model which is the sum of an AR(1) model and white noise, i.e. the noise component of global temperature is modeled as

    \epsilon_t = \alpha_t + \beta_t,

    where \alpha_t is an AR(1) process and \beta_t is a white-noise process. This model is easily shown to have the same autocorrelation structure as an ARMA(1,1) process.

    I suspect she’s not aware that the two models are equivalent in the sense that every model which is the sum of an AR(1) process and a white-noise process can be reformulated as an ARMA(1,1) process. However, not every ARMA(1,1) model can be reformulated as the sum of an AR(1) process and a white-noise process.

    The proof is rather involved, and a post outlining it would be a lengthy dissertation of mainly equations. I suspect I’ve already exhausted my quota of excessively mathematical posts (at least for the time being).

  • David B. Benson // September 19, 2008 at 1:35 am

    tamino // September 19, 2008 at 1:22 am wrote “I suspect I’ve already exhausted my quota of excessively mathematical posts (at least for the time being).”

    Not for me you haven’t.

  • lucia // September 19, 2008 at 5:16 am


    Thanks for posting the numbers.

    FWIW, I was aware the equivalence and that the two have the exact same auto-correlation structure. :)

    I also posted for the HadCrut3 series, but based the variables for ARMA(1,1) on data from Nov. 1913-dec. 1943. (The dates are discussed.)

    I’m running some numbers to write a more complete post showing distributions of trends for three periods which vary according to the number of volcanic eruptions in the period.

    I was going to show the results for both the ARMA(1,1) and the White+Red noise treatment, with the intention of pointing out the main differences in results one gets arise from the period we use to obtain our parameters.

  • void{} // September 19, 2008 at 4:03 pm

    I suspect she’s not aware that the two models are equivalent in the sense that every model which is the sum of an AR(1) process and a white-noise process can be reformulated as an ARMA(1,1) process.

    Yet another strawman thrown out with no supporting evidence whatsoever.

    Such tactics have no place in discussions of technical topics (or anything for that matter).

    Take a look at the ratio of comments for Open Threads to those for specific technical matters. A very large noise-to-signal ratio. Plus, one or two of the technical threads evolved into Open Threads.

    [Response: There's no "tactic" or "strawman" at all. I suspected she was not aware of the equivalence (but knew she was aware that they have identical autocorrelation structures). She has stated that she was aware. That's all.]

  • lucia // September 19, 2008 at 7:45 pm

    I guess I should clarify further. I was aware they were equivalent. The reason I was aware of the equivalence is I discussed the expressing the process as ARMA or “AR + White Noise” with an econometrician last February asking why they don’t use these other interpretations. I even scrawled some equations on scrap paper.

    Obviously, those ARMA processes that exhibit negative loops in the autocorrelation or which have positive zero intercepts for Ln(rho) vs lag can’t be described as AR(1)+Noise. So, ARMA is a broader class and as such has some advantages.

    Despite that, I like AR(1)+Noise formulation for the GMST data for a number of reasons. One is when you happen to run the numbers you can compare the magnitude of “white noise” to the size of the 95% confidence intervals for measurement uncertainty in monthly observations GMST suggested by Hadley.

    This comparison can’t be done as easily using ARMA(1,1).

    For what it’s worth, I approve of blogs having open threads. It’s true there is lots of noise on open threads. But so what? It’s a blog.

  • Leif Svalgaard // September 20, 2008 at 10:43 am

    fix the definition of lambda. You don’t mean to have the second equal sign and what follows it in the definition.

    [Response: Done.]

  • Christopher // September 21, 2008 at 4:54 am

    I’ve really enjoyed these posts; I like the “methods” aspect of these and the PCA stuff some months ago. But I’m somewhat confused. I always thought that any trend in a time series invalidated stationarity and that you need stationarity to use the ARMA paradigm. But I’m sure I’ve missed something here but I’m not sure what?

    [Response: We apply an additive model, in which the data (global temperature) is the sum of a signal (which is responsible for the trend) and noise. We model the signal from 1975 to the present as a straight line. The residuals from that model give an estimate of the noise. We model the noise, and only the noise, as ARMA(1,1).]

  • Cthulhu // September 21, 2008 at 9:51 pm

    For the same data (GISS 1975 to Jul 2008) I get
    autocorrelation(1) = 0.5939
    AR Parameter = 0.8733
    MA Parameter = -0.4704
    White noise parameter = 0.1135

    hadcrut3 I get
    AR Parameter = 0.9097
    MA Parameter = -0.4117
    White noise parameter = 0.0762

    Very rough estimates though.

    From generating 1000 century long periods with 0.18C/decade trend and ARMA(1,1) noise following the Hadcrut parameters above, every one contained at least one 6 year period with a negative temperature trend (OLS).

    998 contained at least one 10 year period with a negative temperature trend.

    369 contained at least one 15 year period with a negative temperature trend.

    20 contained at least one 20 year period with a negative temperature trend.

    To see what difference volcanic influences have I removed the years 1981-1986 and 1992-1994 from the hadcrut dataset, the parameters didn’t change much, suprisingly still very much the same:

    AR Parameter = 0.9171
    MA Parameter = -0.4111
    White noise parameter = 0.0763

    I also attempted to reduce volcanic influences by fixing the residuals for the 1982-1987 and 1991-1996 period to zero. Hardly the best way of doing it, but if anything it probably over compensates..

    Sorry about the plot quality, no axis but it’s hadcrut3 starting in 1975, you can make out 1998 and the “volcano years” that have been reduced to trend.

    AR Parameter = 0.9325
    MA Parameter = -0.4882
    White noise parameter = 0.0622

    Using these parameters as before every one of the 1000 century long periods generated contained at least one 6 year period with a negative temperature trend.

    973 contained at least one 10 year period with a negative temperature trend.

    232 contained at least one 15 year period with a negative temperature trend.

    5 contained at least one 20 year period with a negative temperature trend.

    This was quite rushed though, there might be errors in some or all of this.

  • steven mosher // September 28, 2008 at 12:23 am


    Nice work,

    But for there periods where volcanos impacted the climate you can look at GISS forcings or Mitchell and find more periods to remove than those you did.

    Both shown here.

    The unfortunate thing is there are not more years in the past 100 when volcanic activity was not influencing the climate. Have a look at Mitchells chart. Finally, as we all know from a recent posting on RC ,the GSMT dips post war, may be due to collection methodology ( for SST only) anyways, nice approach.

  • Hank Roberts // September 28, 2008 at 6:17 pm

    > quota of excessively mathematical posts

    Your quota has been increased to 700 billion.
    Please continue. It may be beyond my ability to do the math but I learn a lot from reading what you write — I always learn at least a little math, and how to read it — and also learn from what others who understand it write in comments.

    Please keep the level high. You attract people who are way smarter than me who write thoughtful responses.

    It’s rare. Continue.

  • Dave A // September 29, 2008 at 10:32 pm

    “you spend a lot of time working on something, and you are really trying to do the best job you can of simulating what happens in the real world. It is easy to get caught up in it: you start to believe that what happens in your model must be what happens in the real world. And often that is not true…The danger is that youbegin to lose some objectivity on the response of the model and begin to believe that the model really works like the real world…

Leave a Comment