Open Mind

Love/Hate Relationship

December 18, 2008 · 44 Comments

I have a love/hate relationship with microsoft Excel, and with the R statistical language.


They’re both immensely powerful tools for exploring and manipulating data. They’re easy to use. Excel is designed for general-purpose use by a wide range of people, but it does include a lot of statistical tools. R is a professional-grade statistical analysis package, it’s easy to learn and easy to use, and it’s free — I’m using it more and more since I started a couple of months ago. I love ‘em.

But like all extremely powerful tools, they’re easy to abuse. Any idiot can load data into Excel and fit a trend line, which can lead to statements like this recent reader comment on RealClimate:


Between 1935 and 2007, the Arctic appears to have warmed about half a degree …

Well, that ain’t right. The explanation came in a later comment:


I did a quick and hurried least squares fit on 1935-2007 (GISS 64N-90N) then multiplied the annual rise by 72 (number of years).

I had to reply


The arctic trend from 1935 to 2007 is demonstrably nonlinear — a linear trend can easily be shown to underestimate the warming. Severely.

I don’t know if Excel, or R, or something else was used to fit the trend line in question, and it doesn’t really matter. What’s germaine is that somebody used an easily available computer tool to analyze data, then drew a conclusion which isn’t justified.

Let’s take a look at GISS data for the arctic zone (64N to 90N), consisting of annual average temperature for that latitude band. Here’s a plot (created with R):

data

Clearly there’s a strong signal present, and a lot of noise too. It would be nice to separate the signal from the noise, and one of the ways to do that (at least approximately) is to apply a smoothing filter.

Smoothing data is one of my main interests. To apply my own methods, I have to write my own programs (you can do that in R, but I prefer to write standalone programs). I currently use what amounts to a “lowess” smooth (”locally weighted scatterplot smoothing”) but with a different “window” than is usually applied. I’m also researching the use of different regression functions than polynomials (the basis for lowess smoothing), because polynomials have increased uncertainty near the edge of the data window.

But you don’t have to be an active researcher to apply sophisticated smoothing methods. You just have to apply some of the tools available in R, or in Excel. In fact R incorporates the lowess smooth, and makes it ridiculously easy to apply. But doing so can return some ridiculous results: here’s what you get if you apply a lowess smooth to the arctic temperature data, with the default parameters (i.e., no parameters at all):

badlow

It’s evident, just by looking, that this isn’t a good approximation to the signal. It does indicate (correctly) that there’s an overall increase, but it doesn’t capture the mid-century cooling or do a very good job of estimating the overall pattern. If we used this smooth to estimate the arctic warming from 1935 to 2007, we’d underestimate it. Severely.

The problem is that lowess requires a “time scale” on which to smooth, which is controlled by a parameter, the fraction of data used for the local fit. Fortunately the user doesn’t have to supply that parameter, but unfortunately the default doesn’t always work very well — as this example illustrates.

We’ll do a lot better if we set the parameter to f=0.15:

lowess

Clearly the fit looks better so we might have found an appropriate time scale, but can we verify that? We can plot the residuals from this fit:

lowres

That looks good — the residuals look random. We can even analyze them, computing the ACF (autocorrelation function) and PACF (partial autocorrelation function):

lowacf

This confirms that the residuals are behaving at least approximately like noise; in fact they’re pretty close to white noise, althought the preponderance of negative values of the ACF and PACF for low lags tells us there’s something more going on. In fact the residuals are probably not just noise (let alone white), but they are at least approximately so.

Suppose we had fit a straight line to the data from 1935 to 2007? Then we’d get this (linear fit in green):

line35

Clearly that’s not a good fit. Using this linear fit to estimate the change from 1935 to 2007 gives just a hair over 0.47 deg.C. And it’s plainly wrong. Using the lowess fit (the good one) to estimate the change gives 1.28 deg.C — more than twice as big, in fact it’s 2.7 times as big.

In case you’re curious about my latest smoothing technique, here it is compared to the lowess smooth:

2smooth

It’s a little choppier than the lowess smooth because I’m using a “boxcar” window, I just haven’t gotten around to applying a smooth window yet. Hey — it’s still in development. But in some respects it might be better than the lowess smooth, especially at the beginning and end of the data. Note I said “might” — the tests are far from complete. And in case you’re interested, this smooth indicates a warming of the arctic since 1935 of a mere 1.16 deg.C, only 2.4 times as large as indicated by the linear fit.

If we plot the residuals from the linear fit, we get this:

linres

It’s quite obvious that these residuals are not just noise; there’s still signal there, and a rather strong signal at that. I could apply any number of analyses to demonstrate this with overwhelming statistical significance — but you get the picture.

There are a few lessons to be learned here. One is that you should look at your results. The graph with the green line (the linear fit from 1935 to 2007) superimposed on the data and the lowess fit is sufficient to show that the linear model isn’t right. The graph of the default-parameter lowess fit is sufficient to reveal that it’s not right either. I have to wonder whether the claimant bothered to look. Believe it or not, visual inspection of graphs is one of the most powerful analytical tools around. It can also lead to false results, so you can’t rely on it to establish significance — but it can often reveal false results as well. Use it!

Another lesson is to appreciate that there are more things in heaven and earth, Horatio, than are dreamt of in your application of Excel and R. Yes they’re extraordinarily powerful tools, but they’re not “black boxes” into which you can shove data and expect correct answers to appear as if by magic. You might even get correct results and not know it, instead interpreting it incorrectly! For instance, a linear fit might be strongly statistically significant even though the trend isn’t linear; this leads to the correct result that the data are not just random noise. But only if we try (real hard) to explore for nonlinear patterns and fail to find them, and establish that the residuals are consistent with a noise proces, could we claim that the trend is well approximated as a linear trend and use it to estimate the net change over a given time span.

I guess it’s inevitable that powerful tools will fall into the wrong hands. But with a little common sense (look!) and a bit of humility, you can avoid drawing mistaken conclusions most of the time. And if you’ve been careful, and you still think you’ve established a given result, unless you’re very knowledgeable and experienced, remain open to the possibility that there’s more than meets the eye. In fact, even if you are knowledgeable and experienced…

Categories: Global Warming

44 responses so far ↓

  • Ross // December 18, 2008 at 7:57 pm

    Speaking of visually inspecting a graph, I did this to the RSS AMSU mid-tropo monthly means graph of the past 30 years… and decided it is basically a flat line. Linear regression might say that the temperature is increasing at 0.06 or 0.09 degrees per decade, but is that an appropriate conclusion when the temperature swings about as much as it does?

    [Response: That's why we apply statistical tests to determine significance. In this case the test says yes, it's not just accidental. Visual inspection is a necessary (in my opinion) but hardly sufficient technique; you can't just "look at it" and *decide*. Visual inspection often gives clues and insights from the remarkable pattern-recognition capabilities of the human brain. But the brain is easy to fool, often failing to see patterns that are present and often "seeing" patterns that aren't there.

    And as was pointed out, the mid-troposphere temperature data is contaminated by the cooling signal of the stratosphere, so the real mid-troposphere trend is even greater.]

  • John Mashey // December 18, 2008 at 8:11 pm

    Yes, good stuff, and John Tukey would have approved - see the QUOTES section of
    http://en.wikipedia.org/wiki/John_W._Tukey

  • Bob North // December 18, 2008 at 8:16 pm

    Good post Tamino. I agree that regressions and such get misused quite a bit (and not just in the climate sciences) and that people really need to plot the data, smooths, etc. to make sure that they intuitively “make sense” before making any substantial conclusions. This also helped me understand why you have maintained that the trend in artic sea ice extent/area during the satelitte era is “demonstrably nonlinear” (From this post, I presume it is based on an evaluation of the residuals.)

  • Phil Scadden // December 18, 2008 at 8:49 pm

    Still really looking to see some of this work on smoothing published. (BTW - what journals are good for looking at current research on this?). Parameter settings for smoothing filters seem more like black magic than science - terribly subjective. Surely want to work back from desired properties of residuals to values of the parameters?

    [Response: Unfortunately, the literature on smoothing is scattered over a lot of journals, especially since methods are often developed with a particular problem in mind, and end up published in journals related to the problem under study. Thus (IIRC) the original Savitzky-Golay paper was in a journal about analytical chemistry, and has turned out to be one of that journal's most oft-cited papers; the groundbreaking work of Bill Cleveland (on lowess) was first published by ATT Bell labs; and my favorite wavelet method is in the Astronomical Journal. There's a lot of literature about irregularly sampled time series in astronomy journals, because astronomers have to analyze irregularly sampled data so much (what with the weather and all). I know -- it's frustrating and makes things both harder to find, and harder to keep track of.

    It's also noteworthy that there are (at least) two purposes for smoothing. One is exactly what it says: to smooth data, which may remove signal but that's OK. The other is to separate the signal from the noise -- and for that purpose you definitely want to be able to establish that the residuals are consistent with a noise process.

    Parameter setting is indeed as much an art as a science; perhaps future insights will improve that situation.]

  • The Wonderer // December 18, 2008 at 9:27 pm

    Apparently your text editing tools are lacking also, since they allowed you to misspell germane using another word.

  • S2 // December 18, 2008 at 9:31 pm

    Thanks for that.

    It hadn’t occurred to me that plotting the residuals could be quite such a powerful tool - that’s one to remember (I’ll be starting my first statistics module in January).

    One of my favourite visualisations of “linear trends” is the animated gif by Atmoz at
    http://atmoz.org/blog/2008/03/14/first-assume-a-spherical-enso/

  • Phil Scadden // December 19, 2008 at 1:15 am

    Thanks Tamino. Your “favourite wavelet method”
    was a revelation to me. I’m finding the method useful in other contexts (trying to morphometrical parameters on a family of gastropod fossils). S-G is computationally cheap for long series which I like but not the requirements of regular sampling and managing the ends of series worries me, but see other people doing it same way. I look over some astronomical journals next time am over at library. [This is just personal comms/thanks - doesnt need to go on the blog]

    [Response: It's nice to have a little friendly "chat" that's also on-topic.]

  • Ray Ladbury // December 19, 2008 at 2:09 am

    Great Post! It shows why we keep hearing that old saw by Disraeli about lies, damnable lies and statistics. Instead I say: Any damn fool can use statistics to lie. What takes skill is using them to tease out the truth.

  • Chad // December 19, 2008 at 2:30 am

    Hey Tamino, great post. I’ve used lowess before fairly often. Then I found the smooth function in matlab has a modified implementation called rlowess which is outlier resistant. What are your thoughts on rlowess? Advantages/disadvantages?

    [Response: Outlier treatment can be very tricky. Should it be thought of as an "outlier" that deserves less weight in analysis, or is it a crucial indicator of change?

    I'd say that both the "robust" and and "non-robust" smoothing methods should give good results. If they're notably different, then there are "outliers" (meaning, points very deviant from the "norm" -- whatever that means!), and you should *look* at the data to try to get some idea of how important the "outliers" might be. They could be just plain mistakes, and should definitely be omitted -- or they could be the most informative data points of all. I know that doesn't help much!]

  • Gavin's Pussycat // December 19, 2008 at 7:38 am

    These posts are lovely, and I always learn something from them.

    One subject that might be worth writing a post on someday, with examples, is Monte Carlo resampling methods like bootstrapping and jackknifing. I see them used a lot, and tried to read the Wikipedia articles on it, but it didn’t come to life for me.

    > Instead I say: Any damn fool can use statistics to lie. What takes skill is using them to tease out the truth.

    Ray, so true, so true.

  • Uli // December 19, 2008 at 10:11 am

    Hallo Tamino,
    is there the possibillity to use the parameters of the ARMA(1,1) from
    http://tamino.wordpress.com/2008/09/12/dont-get-fooled-again/
    to generate improofed smooth curves compared to general methods like running mean?
    (Running mean may be optimal for datasets without autocorrelation and normal distributed random part.)

  • Horatio Algeranon // December 19, 2008 at 7:48 pm

    there are more things in heaven and earth, Horatio, than are dreamt of in your application of Excel and R.

    Words to live by.

  • Joe W // December 19, 2008 at 8:53 pm

    As neatly explained in one of your earlier posts, many geophysical time series appear to lack a natural time-scale. For example, 663 years of Nile flood minina are available in the longmemo package in R, data(NileMin). Scale analysis of the variance of NileMin shows the famous power law with Hurst parameter about 0.8.

    What happens if, as an alternative analysis, NileMin is separated into a smooth component and a residual along the lines of your post? Use of a lowess parameter f=0.15 introduces an artifical time-scale into the problem. The time-scale is about 30 years in the case of NileMin with f=0.15. Scale analysis of the variance of the residuals fails to show Hurst scaling for time-scales longer than 30y.

    Is it expected that the temperature time-series show long memory effects? If so an analyis into “signal” and “noise” parts will be problematic.

  • Hank Roberts // December 20, 2008 at 1:23 am

    Side question — those working with Excel, if you do large worksheets, how much RAM, and what kind of processor(s) and speed makes you happy?

    I could tell you stories, except that I can’t, about IT departments cheesparing on RAM ….

  • Ray Ladbury // December 20, 2008 at 2:15 am

    Hank, The largest spreadsheets I’ve used have been nearly 200 Mbytes. I’ve even used Excel to do Monte Carlo calculations. The computer I have via my employer is totally inadequate for such tasks–I have to use my home machine. I find that memory is generally more important than processor speed (within limits), but I’ve had some spreadsheets that would chug for 10 minutes when you updated them.

    So, why, you may ask, am I using Microsoft Excel? Well,
    1)It is what I have to work with.
    2)I can make it work–I’ve yet to find a problem I couldn’t solve in Excel.
    3)If I can make it work, anybody in my field can adopt the technique.

  • John Mashey // December 21, 2008 at 1:02 am

    Apropos Excel, albeit from another field that doesn’t have physics underneath:

    I recommend Sam Savage’s “The Flaw of Averages”, http://www.stanford.edu/~savage/flaw/ .

    The page has a bunch of Excel-based animations, the article is worth reading:
    http://www.stanford.edu/~savage/flaw/Article.htm

    Note: one can draw parallels between running out of retirement money, and getting just warm enough to melt too much permafrost…

  • Hank Roberts // December 21, 2008 at 2:11 am

    Funny how often adding RAM can turn a recalcitrant spreadsheet into something that can be improved without just crashing. That’s why I asked about how much y’all find useful for what size spreadsheets. (Data points may be coming from my beloved spouse who makes a living fixing other people’s spreadsheets, one company after another, on their always starveling equipment.)

    Relevant, hat tip to /. who point to this today:
    http://www.codinghorror.com/blog/archives/001198.html

  • dhogaza // December 21, 2008 at 4:50 am

    Funny how often adding RAM can turn a recalcitrant spreadsheet into something that can be improved without just crashing

    Not really … the total memory space available will be RAM + virtual memory space.

    Not that I use Windows, and Unix-based systems are more likely to degrade gracefully, but the basic principle holds true everywhere.

  • EliRabett // December 21, 2008 at 4:33 pm

    Eli has used large Excel spreadsheets to model state to state scattering. A referee objected, to which the reply was: You want me to use some buggy code written by a grad student?

    A more serious point, as Tamino will tell you this is but the first step, or maybe the second, the more important question is why do you get the observed behavior. You need a consistent interpretation of the observations.

  • Hank Roberts // December 21, 2008 at 5:37 pm

    >> Funny how often adding RAM
    > Not really

    Really, actually, this is the only change company IT people often are willing to make — they go steal a RAM chip from some other computer and move it to the one the visiting consultant is trying to use. Sure messing with virtual memory might help. You expect that? Not.

    TMOT.

    And it’s, yes, really, surprising how often going from 256k or 512k to a meg or more of RAM turns something in Excel that stutters and crashes into something that crawls.

    This is considered “enough” in corporate IT.

    This is why I asked the question (grin). I’ve been through it myself trying to index large books. It can be done. Barely. With support.

    It’s far easier to take the work home and use a really good machine, of course!

    Of course these examples are from Windows.
    Enough.

  • Hank Roberts // December 21, 2008 at 6:57 pm

    Thanks to John Mashey for the reminder about Tukey, which led to this:

    http://en.wikipedia.org/wiki/Uncomfortable_science

    That’s also why before doing climate experiments with coal, we should have mounted a scratch planet.
    http://edp.org/monkey.htm

  • Hank Roberts // December 23, 2008 at 4:13 pm

    Ok, examples from close to home, hoping they’ll encourage others to add details:

    30meg spreadsheet
    512k RAM on machine provided to consultant
    5 minutes to open

    3 megs RAM, on a faster multi-core machine borrowed from someone else in the company
    5 seconds to open

    IT department provided the consultant with a machine left over from their server closet with 2 megs of RAM after this demonstration; that made the difference between crashing and crawling — still slow but can do more interesting stuff without crashing.

    None of the above is exactly true, being filtered through me, but it’s approximately the story.

    So, how much RAM do you all find useful?
    What kind of CPU?
    How big spreadsheets?

    Maybe the summary question is — are there _benchmark_ routines prepared that one can use in Excel, some standard spreadsheet set up that can be run through its paces and timed, on a variety of machines, to give some number that says how ‘useful’ the machine is to do this work?

  • Dano // December 23, 2008 at 5:03 pm

    Hank,

    In a past life I was a systems analyst for a large bank, making sure ~135 users had what they needed to do their jobs, including PCs (among other duties).

    Anyway, amen bruddah. What Hank said.

    Best,

    D

  • dhogaza // December 23, 2008 at 7:09 pm

    Well, my laptop has two GIGAbytes of RAM, are you sure those figures are 512K and 3M respectively? That’s like so last century and all :)

    I’m guessing 512M and 3GB respectively, simply because XP and Vista aren’t going to run with 3MB RAM, much less 512K.

    Unless you’re running DOS, I guess :)

    You can get non-ECC 2GB DDR2, which would work in a wide variety of dual-core Intel-based boxes, from crucial for $22. Memory’s gotten so cheap that a 2GB stick doesn’t even qualify for free shipping from Crucial any more! Requires a $40 purchase for their UPS 2-day free delivery, sheesh!

    Of course your dual-core is also simply faster than the older machine, and probably has faster SATA rather then IDE drives.

    However, that huge a speed difference means there’s a vastly greater amount of swapping (shuffling RAM to disk and vice-versa) going on while loading the spreadsheet. Since modern graphics-oriented operating systems are memory pigs (but with memory being so cheap, who cares, really?) it’s not surprising that the smaller machine would choke on a large spreadsheet.

  • dhogaza // December 23, 2008 at 8:53 pm

    Hank, if you’re routinely working with large spreadsheets such as you describe, the best benchmark for you to test with is most likely your large spreadsheet itself.

    A canned benchmark may or may not be representative of your spreadsheet(s), may do far less or far more computation per cell, or computations that may or may not be representative of the kind of work you do (business types are a lot less likely to be whumping on transcendentals or the like than techie types).

    In your case, the size of the spreadsheet is probably the most important factor and you can never have too much RAM, especially given modern prices, there’s really no price-performance to argue when you can totally load up a PC for $50-$100…

  • Hank Roberts // December 23, 2008 at 10:02 pm

    I’m not. One near and dear to me is. That’s why I’m asking if people would be willing to comment more on what they’re doing, with more specifics about the spreadsheets and hardware.

    This is also an attempt to entice said near and dear Excel expert to get more involved in climate work, which she’d like to learn about.

    C’m'on, y’all, brag a little bit about Excel, loves and hates? Details! Do your Tom Sawyer fence-painting best to make this sound like fun?

    Pointers to public copies of interesting climate-related spreadsheets, maybe? References ot academic work published using it? Organizations that rely on it?

    What’s “totally load up a PC” mean to you all, these days?

    (I am merely a conduit for this for this, being a Macish and grammarian observer.)

  • Hank Roberts // December 23, 2008 at 10:06 pm

    > 512M and 3GB

    Er, ah, yeah. Blush. The only PCs I own still run DOS 3.3, teaching typing (Typing Tutor IV).

    I’m a bystander here, ineptly asking for more about what works in Excel, just hoping for tempting details to be posted.

  • P. Lewis // December 24, 2008 at 12:30 am

    Excel climate models

  • Ray Ladbury // December 24, 2008 at 3:01 am

    Hank, OK, you asked for it. I typically work in Excel–in part because it’s all my employer will provide and in part because it’s all I can count on the employers of others in my field to provide. Actually, I have yet to find anything I wasn’t able to do in Excel–albeit, sometimes slowly and maybe having to use a few tricks.
    I’ve done reliability studies on ASICs for a detector signal processor that was flying literally thousands of ASICs. Such caluclations involve factorials, which give errors when the numbers get too large. However, you can use Stirling’s approximation for the factorial for large numbers and then simplify as much as possible–voila convergence.
    I have also developed a Likelihood-based fitting routine that not only finds best fits, but also confidence intervals. I used this to estimate worst-case memory upset rates due to cosmic rays as a function of confidence level. I validated the method using Monte Carlo methods–all in Excel.
    I’ve found that the biggest limitation for file size is actually the operating system. I have some files larger than 100 Mbytes that I could run on a Windows 2000 computer with 256 Mbytes of memory that won’t even open on a Windows XP machine with faster processor and 512 M.
    Would it be easier to do this in C or even Mathematica–sure. But I have yet to find something I couldn’t do in Excel.

    I have a book that is pretty good:
    http://www.amazon.com/Excel-Scientists-Engineers-Numerical-Methods/dp/0471387347/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1230087537&sr=8-1

  • Hank Roberts // December 24, 2008 at 4:54 am

    P. Lewis, try that link again? There’s no URL in the page source.

  • Hank Roberts // December 24, 2008 at 5:09 am

    I know my methods, why don’t I apply them?
    Any of these, in particular?
    http://www.google.com/search?q=Excel+climate+models

  • dhogaza // December 24, 2008 at 5:55 am

    I’ve found that the biggest limitation for file size is actually the operating system. I have some files larger than 100 Mbytes that I could run on a Windows 2000 computer with 256 Mbytes of memory that won’t even open on a Windows XP machine with faster processor and 512 M.

    Yep, as I said above, modern operating systems (I meant that actually as a sneer against recent MS efforts) are RAM pigs.

    But, in their defense, if 2GB RAM chips cost $22, rather than pounding the pavement in annoyance, just give in …

  • michel // December 24, 2008 at 6:35 am

    HR -

    Three issues with Excel.

    One, it used to have real deficiencies in its built in stats functions, maybe still does.

    http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf

    You can find other references.

    Second, people don’t realize that what they are doing in Excel is writing programs. Mostly of course they never have written programs in a real language. And Excel permits and even encourages the most basic programming style faults. Its hard to document - you have no ‘comment lines’. Go to’s or their equivalent are almost universal. The IDE, if you can call it that, is beyond primitive. Most people have no idea they are even looking at an IDE, they think they are just looking at their data. And the data and the code are inextricably mixed up with each other. The result is that studies of spreadsheets in actual use have shown them to be commonly riddled with errors. Start here:

    http://www.eusprig.org/

    A third issue is it can be intolerably slow. Minutes to load has been mentioned here - that is quite easy to bring about. Its not just Excel, if it takes 5 minutes in Excel, OpenOffice will be the same. Its spreadsheets.

    So, what’s the answer? After all, Excel is used by Finance departments all over the world, used for what seems like sophisticated financial analysis, business case modelling etc. Charts generated by it are being shown in Powerpoint presentations at this very moment probably, and Boards and senior managers have been making decisions about the disposition of billions on the basis of them for years. With results that are perhaps all too apparent today, and which were apparent at the conclusion of the spreadsheet fuelled dot.com bubble!

    The answer is one that most Excel enthusiasts really do not want to hear. It is, get the right tool for the job. For statistics, R. For general programming, Python is one possibility. Both Lutz and Hetland have written excellent books.

    For databases many will recommend Filemaker to the end user. Its so easy. Yes, and like Excel, you have this awful inextricable muddling of program and data, and the similar fatal ability to write relational database programs without having a clue that is what you’re doing. Python with Sqlite. Yes, it will be a lot harder to learn.

    If you can manage to keep a spreadsheet down to one or two well structured small sheets and not jump around, they are fine. Or maybe if, like some posters may be, you have the kind of mind that can work without errors in that very counterproductive environment. But most people will in the long run be better of minimizing dependency on and use of spreadsheets. Especially for business planning and business case use. And that is not too far removed from what is being done in climate studies.

    Get her Hetland’s book for the New Year. Python is a free download. Or if Hetland is a bit too much to start with, get also the new Python version of ‘How to think like a Computer Scientist’ by Allan Downey. Its out in February. Excellent.

    Happy Christmas all.

  • michel // December 24, 2008 at 6:56 am

    HR

    The new edition of Downey’s book has changed its title and is

    Python for Software Design. You can get a description, links to an e-version of the earlier one and the drafts of this one at

    http://www.greenteapress.com/thinkpython/

  • Barry Brook // December 24, 2008 at 7:25 am

    A problem with frequentist smoothers can be the difficulty in parameter counting. GAMS, for instance, can provide tight ‘fits’ but risks being heavily over-parameterised — making it difficult to determine how much precision you are trading off for your bias reduction.

    Model selection (or ideally), multi-model inference can help here with well-specified models. Concepts such as AIC (Akaike Information Criterion — technically an estimate of the relative Kullback-Leibler information, with the bias correction term acting to reinforce parsimony) or BIC (Bayesian Information Criterion — essentially an approximation of the posterior model probability using an MLE, under the assumption of uninformative priors) are useful for evaluating this trade-off.

    DIC (Deviance Information Criterion — a true Bayesian model selection criterion) helps where it is difficult to ‘parameter count’ (as is the case with smoothers) — because an estimate of the effective number of parameters can be derived from the difference between the point estimate of the deviance and the mean of the posterior.

    Leave-one-out k-fold cross-validation is another way to get at the overfit problem, but it turns out that AIC is an asymptotic approximation of C-V anyway, and is must less computationally intensive to calculate.

    These methods are moving the statistical modelling of time series — at least in my area of specialitiy — away from the days of ‘black magic’. I’ve written about it recently in BioScience

    Gavin’s Pussycat — I use bootstrapping regularly (there is little need for jackknifing these days) — but it has such a variety of uses, beyond that it would be difficult to do it justice in a single post!

  • P. Lewis // December 24, 2008 at 4:01 pm

    Oops! Sure I did it right. Oh well… link as text:

    http://rowland.worc.ac.uk/mec/Excel/Index.html

  • Hank Roberts // December 24, 2008 at 5:06 pm

    The article Barry coauthored looks worth a topic, Tamino, if the statistically literate have more on it.

    Alternatives — yep, but love/hate/persistence with Excel is strong for anyone working with businesses.

    I found this long since and my Excel consultant has shown it to every employer she’s worked for, to their edification:

    Ray Panko’s Spreadsheet Research Website
    http://panko.shidler.hawaii.edu. …
    spreadsheet research focused on errors

  • Barton Paul Levenson // December 25, 2008 at 10:19 am

    michel,

    To write coherent programs in Excel, use the built-in programming language, VBA. It comes with a complete IDE including point help references, and a huge range of statements, operators, and aids to structured programming such as the ability to write separate subroutines and functions. In addition, there is specialty code available to control Excel spreadsheets in particular. And the ability to create forms for user interaction.

  • Barton Paul Levenson // December 25, 2008 at 10:20 am

    Note — to get to VBA (Visual Basic for Applications), go through the Macro menu item.

  • Hank Roberts // December 25, 2008 at 7:47 pm

    Aside — Joseph’s blog here also offers some Excel spreadsheets related to climate:
    http://residualanalysis.blogspot.com/2008/06/anthropogenic-global-warming-is.html

    On VBA, I can speak well of that from having learned a bit writing Word macros (for example taking large tab-delimited ASCII database reports and producing documents formatted for readability in print form). I learned a lot by asking in the microsoft.public. newsgroups, much now available at http://www.mvps.org/

    My first introduction to the delight of online help — I found I could ask a question at 1200 baud at 2am, and someone (John McGhie, usually) in Australia would come up with a first answer, and several people in Europe would improve on it, and someone in Pennsylvania would find ways to better the answer, and I’d wake up to find my question had gone all the way around the world and come back as answers, explanations, and usable code.

    The future did work as I’d hoped as a kid, in some ways.

    As Joseph says over at that link:

    “… The nice thing about this technique is that it is completely accessible to anyone with Excel installed. It can also be illustrated graphically, as the reader will see….”

  • michel // December 26, 2008 at 7:46 am

    Oh dear. Oh dear. Guys, yes, you can write macros in Excel. In VBA no less. Yes. That is part of the problem, not part of the solution. It makes your Excel programs even harder to structure and even more of a black box. But as here, you can tell people this till you are blue in the face, and they will still think Excel is the solution.

    The combination of Powerpoint for the presentation of investment appraisals, and Excel (with VBA) for doing investment appraisal and modelling has been an essential ingredient in the collective madness of decision making that led to the bubble and has now led to the bust.

    The idea of people seriously trying to write climate models in Excel, with or without VBA, well, its a nightmare. The only worse thing would be if they were to then create Powerpoint presentations based on their models, and invite us all to meetings where we had to make big money investment decisions based on the results.

    You are better off with one page of numbers generated on a scientific calculator. Yes, that means you can’t model in as much detail as you want. That, if the alternative is as much detail as you want in Excel, is an advantage.

    Don’t use Excel. Use a real language. Though one knows, sadly, that in response to this people will write in and say, they are using Excel for this and that enormous and complicated scientific or investment model, and its fine. In our culture, someone is in denial about Excel. The question is who.

  • michel // December 26, 2008 at 7:52 am

    What I’m basically saying on the VBA issue is, use a real programming language, and if you insist on using an MS language, fine, use Visual Basic. Just don’t insist on using it embedded within Excel!

    Well, that is probably enough on this topic. Good luck with it, however you do it.

  • Barton Paul Levenson // December 27, 2008 at 11:47 am

    michel, I don’t understand your point. How is VBA not a programming language, and how is it not structured? The language gives you input, processing, and output on a fairly sophisticated level, and it allows subroutines and functions and sophisticated branching code, which is what “structured” original meant. Are you objecting to the fact that it’s not object oriented? Fortran isn’t, either.

    For the record, I write my own climate apps as console routines in Fortran-95. But that’s just because I prefer console output to putting the answers in Excel spreadsheet cells.

    The only real objection I can find to climate modeling in Excel is that the speed is compromised because the macros are interpreted and not compiled. You therefore couldn’t use it for apps like radiative-convective models, which is what I mostly work with. But even there, a person with patience could let his model run for several hours compared to the several minutes it would take with a compiled language.

  • michel // December 28, 2008 at 7:31 am

    I’m suggesting using proper Visual Basic, if that’s the language you like, but not in Excel macros. People have arguments about the merits of VB as opposed to other languages, but that is a refinement. Excel, with or without macros, whether VBA or not, is the thing I am suggesting avoiding.

    But look, good luck with it, it will be an interesting and instructive project whatever.

Leave a Comment