Open Mind


February 4, 2010 · 21 Comments

First of all, thanks to Scott Mandia for the link to the story of the duck. That sums up what denialists are doing. Plain and simple.

The very first data record in the GHCN (using the raw data, not the adjusted) is for Skikda in northeast Algeria near the Mediterranean Sea.

Actually, the GHCN starts with three data records for Skikda, covering different time spans. But the three have some overlap, so we can check whether they align with each other. Plotting the first in red, the second in green, the third in blue, we get this:

We can see that the match between different records is quite good. In fact it’s so good that most of the data points overlap, obscuring each other. This is also evident if we zoom in on the time span of the overlap:

To make the separate records distinguishable, let’s take the 2nd (green) and 3rd (blue) and reduce them by 2 deg.C:

Plainly the different records tell the same story, although there are some differences. We can easily compute the differences between the data records, and plot them (set 2 minus set 1 in green, set 3 minus set 1 in blue):

Most of the differences are zero, although each record has a single month with a very large difference. I intend to apply (to the whole GHCN) an automated routine to combine different data sets for the same location into a single record, so I’m not going to agonize over which is more correct. I’m just going to use the 1st set of data as a “reference” in order to adjust the other two to match. To adjust the other sets, I’ll offset them by the average difference between them. Then I’ll average the resultant records to get a single data record for Skikda — and here it is:

There’s a strong seasonal cycle in Skikda, with about 13 deg.C difference between winter and summer. We can remove the seasonal pattern by computing anomalies (relative to a baseline 1951-1980):

I’ve added a lowess smooth to the data. A recent warming is plainly evident. Also (although it’s not evident from these graphs) for Skikda the summer has warmed more than the winter, so the size of the seasonal cycle has increased. We can see the same thing in 1-year averages for Skikda:

And the trend is rather blatant in a graph of 5-year averages:

The aforementioned method for combining different data records is “optimal” in the sense that the offset for a given section minimizes the sum of the squared differences between records. For Skikda it’s easy because all the overlap happens between only two records at a time — the overlap sections themselves don’t overlap.

It can happen that, for some time span, there’s overlap for not just two records but for three or more. That doesn’t happen for Skikda, but it does for some locations. I’ve written a routine which does the following: the 1st record is used as a reference. Then the 2nd record is offset by the average difference between it and the reference. Then those records are averaged to produce a new reference. Then the 3rd record is offset by the average difference between it and the reference, then those records are averaged to produce a new reference. Etc. etc. lather rinse repeat.

This is certainly a good way to combine records, and when there are no times at which 3 or more records overlap it’s the “optimal” way. But when there are times with overlap between 3 or more records it’s not optimal. In such cases we can define the offsets by minimizing the sum of the squares of the differences between the offset records, together with the constraint that the 1st record is not offset (otherwise the system is undetermined).

I doubt this refinement is necessary. But when I get around to it, I’ll program that and test it out on some cases with triple/quadruple/etc. overlap, just to see how big a difference it makes. I expect it’ll be tiny — but only time will tell.

Then it’ll be time to combine records from different locations in order to be able to form a “gridpoint average.” Again, one can use some record as a reference station to compute an offset for successive records, and add them one at a time. Again, that won’t be optimal, but again, one can define offsets which minimize the sum of squared differences. And again, it’ll be interesting to see how big a difference it makes applying the optimal method, as opposed to applying the add-one-at-a-time procedure.

Computing gridpoint averages is crucial, because if we want to know what’s happening worldwide (and we do) it’s essential to compute an area-weighted average; anything else gives undue weight to regions with large numbers of stations. That’s particularly true of the U.S., which accounts for about a quarter of all station records but only covers a small fraction of that in terms of land area.

In the meantime we can say this about Skikda: it’s gotten warmer. By about 2 deg.C.

Categories: Global Warming

21 responses so far ↓

  • Chad // February 4, 2010 at 10:07 pm | Reply

    I was working on something similar in terms of combining overlapping records. I was using the first difference method. But the method you’ve laid out is much easier to implement (same as GISS no?). Mine was overkill.

    [Response: It's based on the GISS method; I'd have to look at their documentation more closely to tell whether it's exactly the same. The overall results may not be identical but should be comparable.]

  • dko // February 4, 2010 at 10:11 pm | Reply

    Thanks, Tamino. I look forward to more of this GHCN analysis!

    [Response: This is just the beginning; there's lots more to come.]

  • Nick Barnes // February 4, 2010 at 10:18 pm | Reply

    The combining procedure you describe is essentially the first part of the combining done by GISTEMP (although computing the offset between two different series is done via annual anomalies, and combining more than two series uses a weighted average). See comb_records() and friends, in of ccc-gistemp (although there is quite a bit more clarification to be done on this section of code).

    Subsequent combining (comb_pieces() in the same file) addresses multiple series for the same station which do not have so much overlap.

    • John Goetz // February 5, 2010 at 4:53 pm | Reply

      Another difference – due to the fact that GISS will use the annual values – is that Tamino’s method does not involve (best I can tell) estimating missing values. This is a good thing.

  • Zeke Hausfather // February 4, 2010 at 10:27 pm | Reply


    Any chance this exercise will peek into some of the stations with GSN data that isn’t making it into GHCN for one reason or another? Carrot Eater and I have been digging into it a bit recently, and a number of places highlighted by skeptics (N. Canada and Bolivia, for example) have started reporting data to the WMO again in the last few years.

    [Response: Can't say yet. It depends on a lot of things, including how much time and effort goes into just examining the GHCN. I hope we can shed some light on a lot of things.]

  • Michael Hauber // February 4, 2010 at 11:41 pm | Reply

    I don’t think different data records in GHCN correspond exactly to actual stations. I noticed this when Darwin hit the big time as a talking point, and I compared GHCN data to the Australia weather bureau data. As a side issue, on Darwin it is amusing and sad to note that the original BOM data shows a warming trend for the most recent Darwin station, and no trend for a different station which ceased about 1940. The final GISS data shows a cooling trend for Darwin.

    In particular for the Green station I would bet good internet reputation points that this is made of three separate stations, which I would not consider to be directly comparable.

    For the very good agreement in the overlap I’ll make another guess that this is really because its exactly the same station, which GHCN has collected via different intermediate sources. And that a few differences have been introduced due to typos, or maybe some processing such as different rules for filling in blank data.

  • carrot eater // February 4, 2010 at 11:51 pm | Reply

    I guess at some point you’ll have to decide whether to re-create the wheel, invent a whole new better wheel, or just kick the tyres a bit. This could be a huge project, if you want it to be. Sounds like you’re headed towards the GISS method for the spatial average as well, not just the duplicate combination.

    Perhaps its time to update the Hansen 1987 study on the correlation between station pairs and the distance between them. The linear weighting to 1200 km is somewhat arbitrary; one could use some model results to provide a perfect field to test methods for gridding and spatial averaging.

    Good luck; I’m eager to see what you make of it.

  • Chad // February 5, 2010 at 12:33 am | Reply

    Perhaps its time to update the Hansen 1987 study on the correlation between station pairs and the distance between them. The linear weighting to 1200 km is somewhat arbitrary; one could use some model results to provide a perfect field to test methods for gridding and spatial averaging.

    The 1987 paper used a relatively course GCM to get error estimates on the global mean using the methods they described. There are a few modern GCMs that have much higher resolution so it may be worth applying the same methodology and re-evaluate their conclusions. I think I just found my next project.

  • Chad // February 5, 2010 at 12:46 am | Reply

    The GISS method sorts the duplicates in order of record length and that’s the order in which they are combined.

    • drj11 // February 5, 2010 at 9:55 am | Reply

      @Chad: Yes. Starting the “best” record, which is the one that came from Monthly Climatic Data for the World, if that’s available. As Nick B says, see and in particular the functions comb_records and get_best from our Clear Climate Code project.

    • John Goetz // February 5, 2010 at 4:50 pm | Reply

      As drj11 says, if there is an MCDW station in the record GISS starts with that. Most non-US stations have such a record, and all such records begin in 1987 or later. Interestingly, this means MCDW are often the shortest records available for a station. The net result is that GISS tends to combine records from newest to oldest regardless of length, because as records are combined the combined record tends to be longer than what remains to be combined. Not 100% true, but true more often than not.

  • carrot eater // February 5, 2010 at 3:00 am | Reply

    Chad, yes, that’s what I had in mind. I re-read Hansen/Lebedeff recently and was taken by just how much was in that thing. That aspect could be usefully revisited, though there have been spatial degrees of freedom analyses since then.

    Can anybody point me to a description of how the NCDC and CRU come up with their spatial averages? I understand there are some differences from GISS in this.

  • billy t // February 5, 2010 at 4:14 am | Reply

    It would be an interesting exercise to see the results of doing your combining exercise on the 1-degree offset data (ie after you’ve added in an artificial station move offset). On first reading I thought that’s what you did, but then I realised that the graph of the offsets between the records is zero.

    Presumably, you will get exactly the same final result if the original 3 records have different offsets on them?

  • Chad // February 5, 2010 at 8:42 am | Reply

    I looked at Skikda and was able to replicate your figures for the temperature anomaly. But I couldn’t smooth the data because there’s lots of missing data. I see that there are straight line segments covering the missing data. Wouldn’t it be more appropriate to just let those remain as missing values? If you take this a step further and combine this series with another station’s combined series, the straight line segments could create spurious effects on the resulting anomalies and trend calculations. I understand that this is a first shot at the problem so I may be jumping a bit to ahead of you on this.

  • Berbalang // February 5, 2010 at 2:30 pm | Reply

    I love the Duck story! It pretty much has the denier tactics and the media pegged!

  • doskonaleszare // February 5, 2010 at 5:31 pm | Reply

    @carrot eater
    You can find nice discussion of GISS, NCDC and CRU methods here:

    Jones et al 1999
    Peterson et al 1998
    Jones and Moberg 2003
    Brohan et al 2006

    You can also check CRU perl scripts published here:

    As far as I understand, CRUTEM3 (without variance adjustements) is pretty simple: first it calculates anomalies for each station wrt 1961–1990 climatology, then averages all anomalies in each 5×5 gridbox, and calculates hemispheric and global means.

    Perl implementation of the above procedure seems to replicate CRUTEM3 pretty well with 1500 and 3000 stations, so I guess it *is* the CRU code (or reimplementation of it).

    JMA uses CAM too, with 1971-2000 normals.

  • carrot eater // February 5, 2010 at 8:32 pm | Reply

    doskonaleszare: Thanks; I’d come to that conclusion as well – they calculate each station’s anomaly on its own, then average the stations that fall within each grid box. As opposed to calculating the anomaly at a grid point through interpolation.

    So GISS lets you use more stations (the stations don’t need to have data within a fixed baseline), and their interpolation lets you cover more area.

  • AndyB // February 5, 2010 at 9:29 pm | Reply

    I know y’all are familiar with Clear Climate Code. I love the idea of having the same in R.

  • Eli Rabett // February 5, 2010 at 9:57 pm | Reply

    What the hell happened in summer ~1976 (or 1977)

    [Response: Missing data. The graph is set to connect points and lines, and the absence of points in that time leads to an errant line.]

    • carrot eater // February 6, 2010 at 12:19 am | Reply

      It’s the year Algeria went without a summer.

      Anyway, if I were working manually, I’d treat the >1966 data and 1930s data as if they were two different stations, since there’s no guarantee of any continuity with a gap like that. Is there? Using anybody’s method, the 1930s fragment would then end up being discarded.

      [Response: It occurred to me also that perhaps they should be treated as different records. Perhaps as the project progresses, I'll get a more reliable answer.]

Leave a Comment