Open Mind

The Power of Large Numbers

July 5, 2007 · 26 Comments

A commenter (”Citizen”) on another post opines that because thermometers are imprecise, and not continually calibrated, it’s not possible to discern temperature changes precisely enough to support the temperature increase claimed by climate researchers. In fact, he (she?) states:

Do you think that continually averaging imprecise information makes that information more precise? Don’t you think that’s a stretch? It’s not rational you know.

It’s certainly not intuitive. But in fact it is true.


Let me give an example. For over a century, amateur astronomers have observed variable stars (stars which change brightness over time) and reported their observations to the American Association of Variable Star Observers (AAVSO). Most of the over 10 million observations in the database are brightness estimates by eye. There are no fancy photometers, no CCD cameras, no photographs, just an observer using a pair of eyes. The observer looks at the star, compares it to other stars of known brightness (comparison stars), and “guesstimates” what the brightness of the target star is.

Measuring brightness of stars is the process called photometry. Doing it by eye is visual photometry. Astronomical brightness is measured on a scale called magnitude, and the human eye is definitely a low-precision photometer. The probable error in a visual brightness estimate is 0.2 to 0.3 magnitude; it can easily be 0.5 magnitude or more. For this reason, visual observers generally don’t bother reporting estimates to more than one digit after the decimal; the resolution of visual estimates is generally 0.1 magnitudes.

On the face of it, one would probably not believe that visual estimates, precise only to 0.2-0.3 magnitudes, and only reported to the nearest 0.1 magnitude, could detect features in the brightness evolution (”light curve”) of a variable star smaller than that. However, it turns out to be a fundamental property of statistics that the average of a large number of estimates is more precise than any single estimate. The more data go into the average, the more precise is the average — even though the source data are all imprecise.

A good example is the variable star is X Cygni. It’s a Cepheid-type variable, a pulsating variable which regularly repeats a cycle of brightness changes every 16.4 days. This star has been observed visually for many years by amateurs, who have contributed thousands of observations. Professional brightness estimates (using photoelectric photometers), each far more precise, number barely over 100.

One of the most prolific observers of X Cygni was Wayne Lowder (observer ID “LX”), former president of the AAVSO. Taking his (several thousand) visual observations throughout the cycle of the star, and averaging them, gives an average light curve based on visual data. We can also plot the photoelectric photometry throughout the cycle to produce a light curve based on high-precision instruments. Then we can compare the two. Plotting the photoelectric photometry in red, and the average of visual data as black squares, we get this result (click the graph for a clearer view):

fig22.jpg

There are clear features in the visual-average light curve that are considerably smaller than 0.1 magnitude. For example, around phase 0.8 of the cycle, there’s a “bump” so that the brightness decreases by about 0.04 magnitudes. In fact there are many small features in the visual light curve whose size is smaller than the limiting resolution of 0.1 magnitudes, and far smaller than the basic precision of a single visual observation, 0.2-0.3 magnitude.

That these features are real, not accidental errors, is testified by that fact that the same features are seen in the light curve from photoelectric photometry. There is a noticeable offset between the photoelectric photometry and the visual: the two light curves don’t have the same zero point. But they do show precisely the same changes. It turns out that the fundamental precision of the light curve from the average of visual observations is about 0.02 magnitudes — more than ten times better than the precision of any single visual observation. For determining the changes in brightness of the variable star X Cygni, the visual data, because of the large number of observations available, give a result which is every bit as precise as photoelectric photometers attached to professional observatory telescopes.

This is a powerful illustration of one of the most useful, and most basic, facts of statistics: the statistical power of large numbers. For decades, professional astronomers looked askance at the visual observations made by amateurs because of the low precision of a single visual observation. But because more and more researchers are tapping into the vast number of low-precision visual observations available from amateur observers, and publishing important scientific discoveries based on them, over the last decade or so the professional community has begun to appreciate the astounding precision which can be attained from the vast number of imprecise estimates, due to the statistical power of large numbers. Amateur observations of variable stars are finally getting the credit they’re due, and the world of variable star astrophysics is benefitting as a result.

The same thing applies to temperature measurements. Thermometers are generally only read to the nearest degree. But the monthly average temperature is the average of 30 daily estimates; the annual average is the average of 365 (or 366) daily measurements. The trend estimated at a single location with a century of daily measurements, benefits from the information contained in 36,525 daily measurements. And the global average temperature trend is based on data from several thousand different locations around the earth. The statistical power of large numbers enables us to discern, with precision, trends and changes which are far smaller than those that can be measured on a single day with a single thermometer.

Categories: Global Warming · astronomy · climate change

26 responses so far ↓

  • stewart // July 5, 2007 at 12:49 pm

    Another example (in my field) is IQ scores. IQ scores are precise to only about +/- 4 points, but the recent birth order study was making a big (and overstated) fuss about a 3 point difference. Is it real? Absolutely. Is it meaningful? Not for the person, but ti may affect theory, and our understanding of the basic processes.

    This is why one single measurement or event is not ‘proof’ of climate change. We need to combine measurements, look at trends, and see where they fall with expectations. Then, seemingly smaller changes can be much more powerful evidence than a few large changes. The few ‘honest skeptics’ I’ve read seem to have a poor knowledge of statistics, and don’t understand things such as standard errors, or the law of large numbers.

  • the power of statistics « 95_Theses // July 5, 2007 at 1:09 pm

    […] on the statistical power of large numbers, according to this explanation, we could add up all the other occurrences of equally stupid operations and come to the very […]

  • J // July 5, 2007 at 1:54 pm

    This is all well and good, but to some degree you’re taking advantage of “Citizen’s” ignorance. The comment you’re responding to doesn’t seem to display awareness of the distinction between *precision* and *accuracy*. You’re addressing the issue of precision, but not the possibility of systematic biases in the data.

    Imagine a hypothetical case where two measurement techniques, A and B, provide biased estimates of the true temperature T in a given location. Say that T is stable at 300K; individual measurements with technique A result in a mean of 299.5K and a standard deviation of 2K, while measurements with B yield a mean of 300.5K and a standard deviation of 1K.

    Now, suppose that the difference in variance between A and B is a result of technological improvements in temperature measurement, and that method B gradually comes to supplant method A. At time t1, when 100% of the measurements are made with technique A, our estimate of T is something like 299.504K plus or minus 0.01K. At time t2, when 100% of the measurements are made with technique B, our estimate of T is perhaps 300.4997K, plus or minus 0.005K. During the interval, when A is being phased out and B is being phased in, there is a gradual increase in the estimate of T.

    The end result (if anyone is still reading this…), is that we have a completely spurious 1K warming trend, despite Tamino’s (correct) assurance that the confidence interval around the mean at both t1 and t2 is quite small, thanks to the power of large numbers.

    Obviously there are many ways of addressing this, but it all gets considerably more complicated. This is why I accused you (in a friendly way, of course) of taking advantage of “Citizen’s” ignorance. More-informed skeptics won’t be disputing the fact that a large sample size can produce a more precise estimate of the mean; they’ll assert that systematic biases in the measurements haven’t been addressed.

    [Response: You’re quite right that “Citizen” seems not to know the difference between precision and accuracy. I was responding to his comment in my blog (quoted above), asking “Do you think that continually averaging imprecise information makes that information more precise?” In fact it does — but clearly he seems not to understand this either (and still doesn’t, based on his latest blog post).

    Your are correct that if a change of instruments occurs, and there exists a bias between the two instruments, then the change will introduce a spurious trend into the analysis. What it seems you don’t know is that temperature time series, as analyzed by GISS, take this into account. As is well documented in Hansen et al. 1999 and Hansen et al. 2001, factors which can introduce such a bias (such as instrument changes, station moves, time-of-observation changes) are searched for, both in the data themselves (the sudden introduction of a bias causes a noticeable “step change” in the data) and in available metadata (written records of instrument changes/station moves/etc.).

    When identified, the bias between data series is estimated and a correction applied, so that the separate sections of data are brought onto the same scale. Then we can reliably analyze the data for trends which are truly part of the signal rather than instrumental/other biases.

    Another thing that most denialists don’t want to think about (or at least, never talk about) is the fact that such biases go both ways. Denialists always imply that biases introduce a false warming trend into the data, so the real trend is less than estimated. In fact in most cases they can equally well introduce a false cooling trend. As the data are examined more correctly and more erroneous factors are corrected for, it’s just as likely that the trend in temperature will be found to be even greater than currently estimated, rather than less. An exception is time-of-observation bias; U.S. stations have consistently migrated from morning/afternoon observations to maximum/minimum recording systems, not the other way, and this migration tends to introduce a false cooling trend. Fortunately, this too is corrected for in the GISS analysis.]

  • J // July 5, 2007 at 2:04 pm

    I should add, of course, that the skeptical position I mention in my last paragraph doesn’t really make sense as a response to climate change.

    Saying that there might be (unknown and uncharacterized) sources of error in the calculated temperature trends doesn’t provide much comfort. For some reason, the skeptics always seem to assume that errors act to inflate the perceived trend (along the lines of my artificial example above). But they could equally well be acting in the opposite direction.

    Likewise, claiming that observed 20th century warming was just a result of unspecified “natural variation” is a bogus argument. How do we know that the underlying “natural variation” wasn’t actually producing *cooling*, in which case the anthropogenic warming would be *greater* than what we think we have observed?

  • SomeBeans // July 5, 2007 at 2:30 pm

    …but the HadCRU and GISTEMP products are both anomaly measures where steps are taken to remove temperature “jumps” which might be introduced by changing a particular instrument from A to B, and where it is immaterial whether thermometer A is showing 300K in 2001 and 301K in 2002 whilst thermometer B is showing 299K in 2001 and 300K in 2002 because it is the difference not the absolute value that counts.

    I was quite surprised that some of the more confident skeptical posters on Real Climate seemed to have no clue about the most basic statistical procedures…

    For my own amusement I’m currently working on plotting the brightness indices of stations appearing at surfacestations.org along with the brightness indices of the whole set…

  • Eli Rabett // July 5, 2007 at 3:25 pm

    An even (to me) more powerful example is how you can get 10/12 bit and even higher precision out of an 8 bit device by oversampling (measuring multiple times). Useful when your ambitions extend much further than your budget.

  • George // July 5, 2007 at 4:47 pm

    Is that really more than a single cycle shown on that graph? It looks like a single one repeated. The data looks too perfectly repeated (down to the “scatter” in the instrumental record).

    If it is indeed a single cycle repeated, what is the purpose for showing more than one cycle?

    I know more data improves the result, but not more of the very same data! (just kidding)

    [Response: It is a single cycle repeated. The reason it’s customary to show two full cycles (one of which is a repeat of the other) is so that every part of the cycle can be viewed unbroken. If the graph showed only one cycle, we couldn’t visually inspect the cycle shape near phase 0 (because the graph is “broken” there). By repeating the cycle, we can view any phase region without interruption.]

  • J // July 5, 2007 at 4:52 pm

    Our most patient and diligent host wrote:

    “What it seems you don’t know is that temperature time series, as analyzed by GISS, take this into account. […]”

    Actually, I do know that. Really!

    I think that the claim you’re addressing here is one of the … weaker … of the denialist arguments (and that’s saying a lot). I raised the issue of measurement biases because I think that’s where the less-uninformed denialists are camped out.

    Maybe this post is a necessary first step. My response up-thread was kind of an attempt to play Devil’s advocate.

    Also … I just want to note for the record that my post up-thread (2:04) was written before Tamino’s in-line comments to the previous post (1:54). As it stands now, I look like a bit of an idiot, appearing to repeat more or less what you say in your in-line comments right above the 2:04 post. Oh well.

    [Response: When I posted my response to your first comment, I noticed that your 2nd comment appeared — clearly the two were near-simultaneous, neither aware of the other. it’s happened before that between clicking “edit” to respond, and clicking “edit comment” to post my response, the commenter has made another comment — and they usually repeat the same information! If it’s any consolation, I felt a bit like an idiot too.]

  • J // July 5, 2007 at 4:58 pm

    Off-topic: At this moment I am listening to a recording of Claudio Abbado conducting the Mahler Chamber Orchestra in Die Zauberflöte.

    It makes reading this site a much weirder experience … having a trio of comely young women singing about Tamino in the background!

    [Response: Small world! Die Zauberflote is my favorite opera, and that’s why I chose the moniker “tamino”]

  • SomeBeans // July 5, 2007 at 5:00 pm

    It’s okay, J, we understand ;-)

  • J // July 5, 2007 at 5:17 pm

    > … that’s why I chose the moniker “tamino”

    Since you don’t have any pictures of yourself on this site, I have been subconsciously imagining you as looking somewhat like Josef Köstlinger (?) in the 1970s-era Bergman film version.

    Anyway, it’s an appropriate name, given what you do on this blog. You are developing your own understanding (of climate science), overcoming challenges (trolls and denialists), and bringing light to the masses.

  • Steve Bloom // July 5, 2007 at 6:18 pm

    Unfortunately, the perfectly clear line of argument above will only get you so far with the denialists, as they are convinced that Jim Hansen (along with Phil Jones and less well-known people at NCDC) is cooking the books on the adjustments.

  • george // July 6, 2007 at 1:51 am

    The probable error in a visual brightness estimate is 0.2 to 0.3 magnitude; it can easily be 0.5 magnitude or more.”

    I don’t debate the general principle you are illustrating, or that the most probable error may indeed be 0.2 - 0.3 with the visual estimates of magnitudes.

    But it is also possible that the specific case you gave, the observer (Wayne Lowder) was better than the “average” observer — perhaps even much better.

    I know they are probably not available, but I am curious about the individual measurements that were averaged to give each data point for the visual estimates.

    It is not unreasonable to assume that there might be significant differences in the quality of the data produced by different observers, since eyesight plays such a key role and since expertise and practice (perhaps even an individual “system”) may also play roles in the the data quality.

    Also, though you say that the estimates are only reported to the nearest 0.1 magnitude, isn’t it possible in some cases (perhaps Mr. Lowder’s) that the observers were actually estimating to better than the 0.1 magnitude that is the standard for reporting? For someone who is very skilled at measurement and who has done it a lot, this may at least be plausible.

    Don’t get me wrong. I’m not trying to cast doubt on the basic principle you are illustrating here. I am quite familiar with this and have used it in practice (and I will attest that it is quite real, though it never ceases to surprise me!).

    It’s just that sometimes without seeing all the data, it hard to separate the effect one is interested in from other effects.

    [Response: I suspect it’s true that Mr. Lowder, an experienced and enthusiastic observer (I knew him well, may God rest his soul), was a cut above the average. But I can assure you the resolution with which the data were reported is indeed 0.1 magnitude.]

  • Ender // July 6, 2007 at 7:22 am

    Steve Bloom - “Unfortunately, the perfectly clear line of argument above will only get you so far with the denialists, as they are convinced that Jim Hansen (along with Phil Jones and less well-known people at NCDC) is cooking the books on the adjustments.”

    I think that it is worse that this. The denialists seem to searching for a new wedge issue along the lines of the “Hockey stick is broken therefore global warming is wrong”. They seem to have shifted to “the temperature record is broken therefore Global Warming is wrong” and I fully expect McIntyre to have a post on every weather station in the US where he will finds something wrong and then trumpet that the whole theory of AGW is incorrect because the back of Bourke (an Australian expression) weather station had a cup of coffee spilled on it in 1950.

    Nothing will stop this just like rational argument had no chance in the Hockey Stick wars of a few years ago that have thankfully died down because even M&M drained every single possibility of doubt mongering in the years and millions of blog comments that constituted the war.

    [Response: I suspect you may be right on the money with this comment. That’s why I like the idea of discussing the science *without* paying much attention to what the denialists are saying. It’s time to stop letting them set the agenda for discussion.

    I’m perfectly willing (as this post shows) to address issues as denialists raise them — or not. The choice is up to me! But the fact is, with Al Gore’s advocacy, George W. Bush’s concessions (lip service at least), and the supreme court ruling in Massachusetts vs EPA, the war is over. Also, all the denialists efforts may be backfiring on them; global warming is one of the hottest topics in the blogosphere, if they want the public to ignore the issue they should stop talking about it so much!]

  • J // July 6, 2007 at 2:44 pm

    If someone really doubts this, here’s an experiment you can have them do. All they need is a computer with Excel (or another similar spreadsheet) … no fancy statistical tools required.

    (1) In the first cell of column A, type the following formula:
    =ROUND(RAND(),1)
    The “RAND()” part generates a random number between 0 and 1. The “ROUND…1″ part rounds it off to one decimal place.

    (2) Select the contents of that cell, copy them, and paste them into the next 999 cells down the column. They should now have 1000 “measurements”, each with 1 significant figure of precision.

    (3) Somewhere else on the spreadsheet, type the following three functions:
    =AVERAGE(A1:A2)
    =AVERAGE(A1:A11)
    =AVERAGE(A1:A101)
    =AVERAGE(A1:A1001)
    These will calculate the mean of the first 2, 10, 100, and 1000 “measurements”.

    The exact details will differ depending on what random numbers are generated, but here are a representative example of the results:

    Mean of 2 measurements: 0.3
    Mean of 10 measurements: 0.41
    Mean of 100 measurements: 0.45
    Mean of 1000 measurements: 0.498
    Mean of 10000 measurements: 0.4995

    Note how, as the number of measurements increases, the sample mean generally approaches the “true” population mean of 0.5. Eventually, you’re able to get three significant figures worth of precision for the mean, despite the fact that each individual measurement has only one significant figure.

  • george // July 7, 2007 at 5:32 pm

    There are clearly practical limitations of this general principle. How well it works is essentially limited by how well the reality of the measurement fits the underlying assumptions of the statistical principle (namely, that the measurement errors are normally distributed)

    There is clearly more going on in the case of a human observer than in the case of an instrument that in principle is capable of resolving a measurement to the desired precision (that can detect single photons, for example), but whose result is degraded by electronic noise.

    For example, if I have a ruler that measures to the nearest sixteenth of an inch, can I measure something to the nearest millionth of an inch merely by taking a large number of measurements to the nearest 1/16″ and then averaging them?

    I seriously doubt it — but I am open to proof otherwise (documented with the results from an NIST CMM, of course)

    At its basic core, this principle depends on the idea that the errors are truly random (that multiple measurements comprise a gaussian distribution) which is never true in practice, not even in the case given immediately above, since RAND is only “pseudo”-random. Exell 2003 RAND even generatednegative numbers for the values that they claimed were “random values between 0 and 1″! (And some people wonder why their Windows computer crashes on a regular basis.)

    As I stated above, I don’t question the general principle, but I think in practice one has to be very careful to look at how close the reality fits the assumptions of the principle — and with human observations in particular, one has to be especially careful.

    I don’t doubt the results shown above with the visual photometry (particularly since the results have been verified instrumentally) , but as a general rule, I’d say that one has to be careful that the underlying assumptions of the principle are being (approximately) met and that one does not take it too far.

    [Response: It is indeed possible to achieve higher precision than the gradations on the ruler, given a large number of measurements. I doubt one could get to a millionth of an inch (but then, I doubt the object’s size is constant to one millionth of an inch, given thermal expansion and all). But with a trillion measurements, you can get *much* better precision than the gradations of the ruler.

    The increased precision does *not* depend on the errors being Gaussian. But it does depend on the rms error being as large as or larger than the resolution. If the resolution of the instrument is much finer than the probable error, the increased precision doesn’t apply — so this is a case for which increasing the precision of the instrument degrades information! Hence if the instrumental precision improves to a finer level than the reporting resolution, it is essential to increase the reporting resolution. I delivered a paper on this very topic at a meeting of the AAVSO, when they were considering increasing the report resolution from 1 to 2 significant digits.]

  • Wolfgang Flamme // July 7, 2007 at 9:51 pm

    One of the problems with sub least count instrument readings AFAIR: There is a significant accumulation of readings in the 1/4 LC, 1/2 LC, 3/4 LC, 1/3 LC and 2/3 LC readings. Again, it’s different in case there is a supportive Vernier scale involved.

  • Eli Rabett // July 8, 2007 at 2:53 am

    Wolfgang, amazing as it may seem, if you take enough readings you will get much better precision than the least significant digit (LC I presume is something like letzte Zahler) You don’t need a Vernier, just a lot of readings.

  • John Cook // July 10, 2007 at 6:14 am

    Great post. Two quick questions - is there a relationship between the # of measurements versus uncertainty? The power of numbers makes sense but I’m wondering if it’s quantifiable?

    Secondly, does this blog have an RSS feed? Can’t find one anywhere. Thanks!

    [Response: Generally, if the uncertainty in a single estimate is \sigma, and the number of data points used in an average is N, the uncertainty of the average itself will be \sigma / \sqrt{N}.

    I don’t know whether wordpress automatically provides RSS feeds, or I have to set up such a thing — but I’ll look into it.]

  • Wolfgang Flamme // July 10, 2007 at 11:12 am

    Eli,
    ‘better precision’ might well be the result of a stronger/more uniform bias and does not propose improved accuracy at all.

    Simply image two grops of students using mm-scaled rulers, one measuring a rod of length 100.005 mm, the other one of length 97.318 mm …

    [Response: Large numbers of data result in better precision, but indeed they do not improve accuracy. In the example given in the post, the large number of variable-star observations don’t enable us to determine the star’s absolute brightness with super-accuracy, but they do enable us to plot the brightness *changes* with stunning precision. Hence the entire visual light curve could be off by an entire magnitude! (more likely it’s off by 0.2 mag, the difference in zero-point between the visual and photoelectric data). But the size and shape of the light curve (period/amplitude/signal shape) and its fine details can be determined reliably and precisely, which is what’s important for understanding the star’s behavior. Likewise for temperature data, it’s the changes (not the absolute value) that are of primary interest.]

  • Wolfgang Flamme // July 10, 2007 at 11:13 am

    Eli,
    PS: LC=Least Count

  • Adam // July 10, 2007 at 12:47 pm

    “But the monthly average temperature is the average of 30 daily estimates; the annual average is the average of 365 (or 366) daily measurements. The trend estimated at a single location with a century of daily measurements, benefits from the information contained in 36,525 daily measurements. And the global average temperature trend is based on data from several thousand different locations around the earth. ”

    I may have this wrong, and of course there will be different datasets, but I thought that the daily average was calculated from the average of the max & min, and those daily averages were applied to the month etc. (or maybe the individual max & mins are used in the monthly etc. calculation). Doesn’t affect the point of the post, of course.

    [Response: You have it right.]

  • Eli Rabett // July 11, 2007 at 10:52 pm

    A better imagine would be that the students use the same ruler, or two rulers that have been calibrated against a common standard.

  • Bloggers for Positive Global Change at Glittering Muse // July 14, 2007 at 6:16 am

    […] Tamino does his small but important part to educate a skeptical reader who doubts the validity of the science of averaging temperatures to predicts global warming. […]

  • Heiko Gerhauser // July 18, 2007 at 5:04 pm

    Maybe the graph isn’t resolved that well, but the offset between visual and experimental observations doesn’t appear to be constant to me.

    For example, a the bottom of the brightness curves (at about time 0.6) the difference in brightness seems to be less than 0.1, but at time t is about 0.8, the offset’s risen to above 0.2.

    So, while individual features of the curve less than 0.1 in brightness difference are recognisable, the accuracy of the trends seems much closer to 0.1-0.2 (eg between time t=0.6 and t=0.8 the visual increase is 0.15 and the photometer measured increase 0.3)?

  • Timothy Chase // July 19, 2007 at 2:52 am

    John Cook had asked about your RSS Feed.

    I have a page that collects feeds from a number of blogs (easier to keep track of new posts, favorite authors, etc.), and what I am using is:

    http://tamino.wordpress.com/feed/rss2

Leave a Comment