Open Mind

Alphabet Soup, part 2: MA

September 11, 2008 · 3 Comments

In a previous post we looked at a model for random processes, the “AR” (autoregressive) model. Let’s take a look at another random process, the “MA” (moving average) model.

Many of us are familiar with moving averages as a way of smoothing data. In its simplest form, given a time series w_n we take the average of a number of consecutive values of a time series, centered at a particular point, to generate the moving-average value at that point. For simplicity, let’s take the average of 2k+1 consecutive values: the central value, plus k to the left and k to the right, to create the smoothed series x_n:

x_n = {1 \over 2k+1} \sum_{j=n-k}^{n+k} w_j.

Moving averages are a fine way to smooth a time series. But it’s important to keep in mind that two consecutive smoothed values have a lot of data points in common; the values x_{n-1} and x_n are based on entirely the same set of original values, except for w_{n-k-1} which forms part of the average for x_{n-1} but not for x_n, and w_{n+k} which forms part of the average for x_n but not for x_{n-1}. Therefore the process introduces autocorrelation in the smoothed series.

Suppose the original time series is a random process, and the expected value of any given datum w_n is zero, i.e., it’s a zero-mean random time series. Then the autocovariance at lag j is the expected value of a data point with its lag-j (j time steps preceding) value (the angle brackets denote the expected value of the enclosed quantity)

\gamma_j^{(w)} = \langle w_n w_{n-j} \rangle.

The autocovariance at lag zero is just the data variance

\gamma_0^{(w)} = \langle (w_n)^2 \rangle = \sigma_w^2

Now suppose that our original series is white noise, which means that different values are uncorrelated. In this case all the autocovariances are zero, except the lag-zero autocovariance which is the data variance

\gamma_j^{(w)} = 0, for j \ne 0.

If j=0, then the autocovariance at lag zero is just the variance of the white-noise series

\gamma_0^{(w)} = \langle (w_n)^2 \rangle = \sigma_w^2,

Therefore the autocorrelations are also all zero, except the lag-zero autocorrelation which is always equal to 1:

\rho_j^{(w)} = \gamma_j^{(w)} / \gamma_0^{(w)} = 0, for j \ne 0.

Now let’s consider our moving-average time series. We have

\gamma_j = \langle x_{n-j} x_n \rangle.

We can substitute the values for y to get

\gamma_j = \langle 1/(2k+1)^2 \sum_{a=n-j-k}^{n-j+k} w_a \sum_{b=n-k}^{n+k} w_b \rangle.

When we evaluate the double sum, we’ll have a large number of terms which are products of two values of w. If the w-values are different, then the expected value of their product will be zero because all the w values are uncorrelated (it’s white noise!). But if the term is of two w-values which are the same, then the expected value of the product is just the variance of the white noise, \sigma_w^2. If j is less than 2k+1, then there will be exactly 2k+1-j such nonzero terms, but if j is greater than or equal to 2k+1 then there won’t be any terms at all with the same w values. The total autocovariance is therefore the number of nonzero terms times the expected value of each nonzero term, or

\gamma_j = (2k+1-j) \sigma_w^2 / (2k+1)^2, for j < 2k+1, otherwise zero.

These are the autocovariances; the autocorrelations are therefore

\rho_j = \gamma_j / \gamma_0 = (2k+1-j) / (2k+1).

From this we see that the autocorrelation decreases steadily until lag j = 2k+1, when it becomes zero, and is zero for all higher lags. I generated 1000 data points of a white-noise series, then computed a series of 3-point moving averages (which includes the central value and k=1 points before and after), and computed the sample autocorrelation of this series, which is graphed here:

The dashed lines give the 95% confidence limits of no autocorrelation. Clearly the autocorrelation decreases steadily, reaching zero (within its error limits) at lag 3, and is zero (within its error limits) thereafter.

This is the usual definition of a moving average. It computes an average for the time index at the center of the range of values being averaged. One can also, if one wishes, assign the averaged value to the end time of the range of values being averaged (this is what ExCel does, and I find it a bit annoying but so be it). One can also compute a weighted moving average, in which the values being averaged aren’t all weighted equally. In fact it’s not uncommon to use a Gaussian weight function, so values at the center of the range are given greater weight and values far from the center less weight. All these are examples of moving averages.

If one computes a moving average from a finite number of terms, and the values being averaged are a white-noise process, and there are a finite number of values being averaged at each step, and the average is assigned the time index of the final point being averaged, and the weight given to the final point is 1, then we have an example of a moving average process. We can write it as

x_n = w_n + \theta_1 w_{n-1} + \theta_2 w_{n-2} + ... + \theta_q w_{n-q}.

The process w_n is a white-noise process. Note that the coefficient of w_n is one, that there are a finite number (q) of terms, and that the time index of the average is the same as that of the “most recent” value entering into the average. Since there are q+1 terms in the average (from w_n to w_{n-q} in reverse order), this is a moving-average process of order q, or an MA(q) process.

This is not too dissimilar from the AR(p) process we looked at before. Each new x value was a white-noise term (w_n) plus a linear combination of preceding x values, as much as p (the order of the AR process) time steps preceding the new value. Now we have that each new x value is a white-noise term (w_n) plus a linear combination of preceding white-noise terms, as much as q (the order of the MA process) time steps preceding the new value.

We can analyze the autocorrelation structure of the MA(q) process, in much the same way we did for simple moving averages. If we do so, we’ll find that just as before, when we get beyond a certain lag value the autocorrelation is zero. In fact, for an MA(q) process the autocorrelation is zero for all lags greater than q.

In fact the moving average actually is also a moving-average process. Consider the 3-point moving average shown above. This was

x_n = (w_{n+1} + w_n + w_{n-1})/3.

Now define a new white-noise process

\tilde w_n = w_{n+1} / 3.

Then we see that

x_n = \tilde w_n + \tilde w_{n-1} + \tilde w_{n-2}.

Since \tilde w_n is a white-noise process, there are a finite number of terms, and the coefficient of the most recent term is one, we have a moving-average process. It goes as far as 2 time steps prior to the current time index, so in fact it’s an MA(2) process. I’ll take this opportunity to correct a mistake I made in recent comments, where I said that a k-point moving average has MA(k) structure; in fact it actually has MA(k-1) structure.

The fact that the autocorrelation function (ACF) is zero for lags greater than the order of the process is a sign that the process is MA. We noted something similar for AR processes, that the partial autocorrelation function (PACF) is zero for lags greater than the order of the MA process. If you study the autocorrelation and partial autocorrelation functions, and one of them doesn’t just “tail off” gradually, but ends (by which I mean, goes to zero) abruptly, that’s a sign of a pure AR or pure MA process. If it’s the ACF that ends abruptly, the process is MA and the lag for the final nonzero autocorrelation is the order of the MA process. If it’s the PACF that ends abruptly, the process is AR and the lag for the final nonzero partial autocorrelation is the order of the AR process.

A random process doesn’t have to be just AR or MA. It can be both, consisting of a white-noise term (w_n) plus a linear combination of preceding x values up to order p, and preceding w values up to order q. Then we have

x_n = w_n + \sum_{j=1}^p \phi_j x_{n-j} + \sum_{k=1}^q \theta_k w_{n-k}.

This is the general form of an ARMA(p,q) process. For an ARMA process, neither the ACF nor the PACF will end abruptly, values will decay slowly for all lags. So we can’t tell just from the ACF or PACF what the orders of the process are. Instead we can form hypotheses, estimate models for each hypothesis, and test their suitability using some criterion (like the Akaike or Bayesian information criterion).

The residuals from fitting a trend to global average temperature data, at monthly time resolution, are reasonably well approximated as an ARMA(1,1) process. The match isn’t perfect, but it’s pretty good. In an upcoming post, I’ll generate some artificial ARMA(1,1) data, add a trend to it, and we’ll look at how it behaves. In doing so, we’ll gain some insight into how monthly average global temperature can be expected to behave.

And of course we haven’t exhausted the possibilities of alphabet soup either! So there’s more to come on more versions of time series processes.

Categories: mathematics

3 responses so far ↓

  • swade016 // September 12, 2008 at 1:53 pm

    Is the reason they monthly approximations are only “pretty good” is due to temperature being non-stationary? If you were to switch to yearly, would the ARMA(1,1) become better since the “seasonal” component would be minimized?

    Still trying to catch up on time-series stuff, so forgive me if this is a stupid question.

    [Response: The non-stationarity of the temperature is included in the trend; presumably the difference between the trend and the data (the residuals) is stationary. The ARMA(1,1) model applies to the residuals only. It's possible that the residuals themselves are non-stationary, but I don't see any evidence of that.

    I believe that the residuals are not exactly an ARMA(1,1) process primarily because there's some very elaborate physics behind global climate, and such processes tend to induce a complicated autocorrelation structure, so the autocorrelations don't exactly follow the behavior \rho_j = \lambda \phi^j (the signature of an ARMA(1,1). However, they're very close to following that formula. Yet for most models, it's impossible to confirm or deny them with certainty because we can only estimate the autocorrelations, and as the lag gets very large they all become statistically indistinguishable from zero. Also, there's no "perfect" estimator of autocorrelation from observed data; the most common (Yule-Walker) estimate is biased, and even bias-corrected estimates are only unbiased to 1st order.

    We can say with certainty that it's not AR(1).]

  • Gavin's Pussycat // September 12, 2008 at 5:57 pm

    This is fun.

    Two questions:

    1) is the reason that monthly mean temp data is ARMA(1,1) that it is actually constructed by one-month averaging (the MA(1) part) on “point data” (daily means) that are close to AR(1)?

    [Response: No. If you take monthly averages of AR(1) daily data, then you will get autocorrelations of the form \rho_j = \lambda \phi^j, but the factor \lambda will be greater than 1. For GISS monthly data the factor \lambda is less than 1.]

    2) In an earlier post (”To AR(1) or not to AR(1)” IIRC) you used an autocovariance expression of alpha*rho^j. Am I right in recognizing your above mentioned “signature” for ARMA(1,1)?

    [Response: Yes indeed.]

  • Gavin's Pussycat // September 13, 2008 at 6:58 am

    OK, thanks. That means then that daily global mean temp data are not AR(1), right? So… what is it?

Leave a Comment