[Archive: 23 May 1998]


Hidden truths

Features illustration: Hidden truths
Illustration: Ewan Fraser


From missing butterflies to drugs trials
that were never published,
the data sleuths can track them down.
Robert Matthews is hot on the trail

HOW many butterflies have you never seen? It sounds like a riddle from Alice's Adventures in Wonderland, a patently silly question without any rational answer.

Yet perfectly sensible people ask similar questions all the time. Military intelligence analysts want to know how many tanks the other side has, without much hope of being allowed to count them. Health planners need to know how big a drugs problem a city faces, but drug users aren't exactly keen to talk to pollsters. These are sensible questions, but can they ever have sensible answers? Surely no one can tell you how many butterflies--or tanks, or drug addicts--are out there that you have never seen?

Step forward the data sleuths: mathematicians armed with a whole array of ingenious methods for solving the Case of the Missing Data. Feed them just a few titbits of data gleaned by a researcher--or a spy--and they'll show you how they fit into the big picture you have yet to see. Or hand them a pile of published evidence apparently backing some new miracle treatment, and they'll tell you if there is an even bigger pile of unpublished evidence pointing the other way.

Wielding that sort of power, it's no wonder that much of the work of the data sleuths is strictly hush-hush. During the Second World War, the Allies used their expertise to discover clues to the capabilities of the German war machine (see "Data sleuths go to war", below). And in 1987, spycatcher Peter Wright, the former assistant director of MI5, Britain's counterintelligence organisation, revealed that data-sleuthing methods were used to analyse intercepted transmissions and estimate the number of Soviet spies active in Britain.

But the skills of the data sleuths are now being recognised by a much wider clientele. After all, researchers trying to gauge the extent of social problems like drugs and vice face similar problems to MI5: people they are interested in generally aren't keen to talk to the authorities.

Data sleuths have devised some clever ways of identifying such ghost populations, and one of their favourites is "capture-recapture"--a technique borrowed from ecology. Suppose health officials are trying to gauge the number of prostitutes working the streets of a large city. Using capture-recapture, they go out on the streets and interview any prostitutes they bump into ("capture"). Then a few weeks later they conduct a second survey to see how many of the prostitutes from the first survey turn up again ("recapture"). Clearly, the bigger the total population of prostitutes, the smaller the chances of encountering the same ones twice.

And it's this simple fact that gives a measure of the size of the total population. For example, suppose they "capture" and record the details of 100 prostitutes on the first survey, wait a few weeks and then carry out another survey of 100 prostitutes. If the researchers find that, say, 10 per cent of those in the second survey had been seen before, this would imply that the original 100 prostitutes must constitute 10 per cent of the unknown total population--which must therefore be 1000.

In 1991, Neil McKeganey and his colleagues at Glasgow University used this approach to estimate the number of HIV-positive prostitutes in Glasgow--a doubly stigmatised population but a key one in determining the spread of HIV. Previous studies suggested that around 2 per cent of prostitutes were HIV-positive, but this figure was all but useless without some idea of how many prostitutes there were ("Stalking HIV in the red light area", New Scientist, 12 June 1993, p 22).

Their survey pointed to a total of around 1150, and saliva tests carried out at the same time suggested that around 29 of the prostitutes were HIV-positive. That gave an overall HIV prevalence of around 2·5 per cent--in line with theresults of the earlier prevalence studies.

Changing perspectives

In 1996, following on the success of the HIV survey, McKeganey and Gordon Hay of the Centre for Drug Misuse Research at Glasgow University used a sophisticated multiple-sample version of capture-recapture to turn data for attendance at drug centres and police arrest records into an estimate of the numbers of drug abusers in Dundee. The files revealed 900 directly, but the data-sleuthing methods suggested that the true population was around 2700. According to McKeganey, this means that health officials in Dundee could be facing as big a drugs problem as that in Glasgow, which has one of the worst drugs problems in Britain.

Like many data-sleuthing methods, capture-recapture was originally devised to help ecologists get a handle on the size of animal populations: they literally capture, release and recapture animals. Yet some of the cleverest applications have emerged when mathematicians spot quirky analogies between trapping animals and trapping more abstract beasts.

An intriguing example centres on the works of William Shakespeare. For centuries, controversy has raged among literary scholars about the origins of the Bard's works. Some insist he collaborated with contemporaries on some of his plays, while others claim that Shakespeare is merely a pseudonym for a whole host of contemporaries, such as Christopher Marlowe or Francis Bacon.

Traditionally, these arguments have focused on the words used in plays attributed to Shakespeare. But literary scholar Ward Elliott and mathematician Robert Valenza of Claremont McKenna College in Claremont, California, have been using data-sleuthing methods to find new clues among the words Shakespeare might have used, but didn't. To do this they have exploited a method invented more than 50 years ago by the pioneering statistician Ronald Fisher to solve our original data-sleuthing riddle: just how many butterflies are there out there which have never been seen?

In 1943, a naturalist who had just returned from a butterfly hunting expedition to Malaysia posed this riddle to Fisher, then a professor of genetics at Cambridge University. Having spent several months out in the jungle, the naturalist had trapped a number of species new to science. He couldn't help feeling, however, that if he had spent more time out there, he would have caught yet more unknown butterflies. But how many more?

With characteristic genius, Fisher realised that this "silly" question could be solved by applying the laws of probability. The idea was that the bigger the population of a species, the greater the chances of catching a member of that species. By noting how many different butterflies of each species were caught in a given time--say, a three-month expedition--Fisher showed how to estimate the total populations.

A few decades later, Ronald Thisted of Chicago State University, Illinois, and Bradley Efron at Stanford University in California reported a radically different application of this technique ("A bard by any other name", New Scientist, 22 January 1994, p 23). They pointed out that every piece of text written by an author is an "expedition" into the author's total vocabulary, with the words of the text being the "butterflies" sought by Fisher's naturalist. Extending the analogy, Efron and Thisted showed that there are different "species" of word, identifiable by the frequency with which they appear. For example, out of the total 885 000 words in the known works of Shakespeare, around 4400 appear twice, 2300 three times and so on.

But it's the 14 400 that appear just once that are the most intriguing. When each of these words makes its debut in an individual play, it's as if the word had been "trapped" from the species pool of words that Shakespeare knew, but never used. Thisted and Efron realised that they could look at how many new words were trapped in a series of undisputed Shakespeare plays, and then use Fisher's basic idea to estimate how many unused words were still out there in Shakepeare's hidden vocabulary. Armed with this information, they argued, it should be possible to predict how many of these words should appear in each new, disputed play, sonnet or whatever. In other words, Shakepeare's first-time use of these words could provide a "fingerprint" of his writing style.

The Bard's fingerprints

Although early attempts to exploit this idea met with mixed results, in 1996 Elliott and Valenza succeeded in turning it into a technique capable of casting light on many literary mysteries. Applying it to Shakespeare's works, they found a reliable "fingerprint": in each play, Shakespeare typically used around 300 to 400 new words from his "hidden" vocabulary. Crucially, however, when Elliott and Valenza applied the same test to Shakespeare's contemporaries, such as Marlowe, Thomas Middleton and Ben Jonson, they found rates of new word use quite different from those of Shakespeare, reflecting their different hidden word vocabularies. This let them pour cold water on the claims that Shakespeare was merely a pseudonym for one of his contemporaries. Shakespeare, it seems, really was Shakespeare.

Similar methods are now pouring neat petrol on a far more inflammable issue: whether doctors can trust the medical literature when it bombards them with apparently impressive evidence of major breakthroughs. Everyone expects the media to hype new medical findings. But data sleuths are finding worrying evidence that even the academic literature gives a slanted view of the truth about new treatments--one that is often unjustifiably optimistic.

Testing a drug typically starts in small trials, with just a few dozen patients given the drug or a placebo. Being small, such trials cannot give the precise results of huge international studies, and their findings tend to show a lot of "scatter": some point to no benefit, while others seem startlingly impressive.

But doctors have long suspected that the studies showing no significant effect tend to get locked in the filing cabinets of hospitals and drugs companies and quietly forgotten. After all, it is hard to get enthusiastic about negative results, and even academic journals have limited space to publish research, and an eye for big media coverage.

Yet the danger of this "publication bias" is obvious: it can give a completely misleading impression of the value of a new drug. Worse, that impression can harden into statistical fact if the results of the small studies are pooled in a meta-analysis designed to reach supposedly more reliable conclusions.

Suspicions about publication bias received worrying confirmation last year with the publication in the British Medical Journal of a study by Jerome Stern and John Simes of the University of Sydney. The researchers chose more than 200 medical studies carried out between 1979 and 1988 and then followed them through the publication process. They found that those which uncovered positive effects were more than twice as likely to be published as those that failed to find anything. The bias was even stronger for clinical trials of new treatments, with positive results being more than three times as likely to find their way into the academic literature.

What's missing?

All of which came as little surprise to medical data sleuths, who have developed some natty methods for finding evidence of missing data. Armed with their mathematical crowbars, they have found lots of evidence that negative studies do end up forgotten in filing cabinets.

One of their favourite tools is the funnel plot. This exploits the basic statistical fact that the more data you collect, the more precise your findings will be. For medical trials of, say, drug efficiency, small studies typically give results scattered around the true figure, while the large ones tend to be quite closely clustered around it. Plotting published findings against study size on a graph should thus give a kind of "funnel", with the small, imprecise studies at its base, and the large, accurate studies at its apex (see Diagram, 39K).

If the funnel looks distinctly bent, however, it means the published data aren't giving the full picture. Results that should have been reported have somehow gone missing--and doctors are in danger of getting a biased view of reality. Many data sleuths rely on their eyes alone to detect that data are missing from funnel plots. A gaping hole in the funnel plot, where lots of small studies reporting no effect should be, is hard to miss. But Matthias Egger and colleagues at Bristol University recently found a way to quantify just how bent funnel plots are, and the nature of the bias.

Their findings, published last September in the BMJ, make disturbing reading. Analysing the funnel plots of 75 published medical studies, they found that no fewer than a quarter showed significant signs of missing data: small studies with negative effects had a habit of staying locked in filing cabinets.

To show how crucial these missing data are in judging the true effectiveness of new treatments, Egger and his colleagues then focused on eight research projects where in each case lots of small studies had culminated in a single, huge study aimed at settling the issue once and for all.

All eight--which ranged from treatments for heart attacks to the use of aspirin to fight pregnancy disorders--had shown great promise in small studies. But this promise evaporated in four of the big studies. And, sure enough, those that failed to live up to early expectations had seriously skewed funnel plots: negative results had been filed away and forgotten.

Egger's findings have added weight to calls for medical scientists to be given access to all study results, so that they have a better idea about whether new therapies cut the mustard. More than a hundred leading medical journals, including the BMJ and The Lancet, now contribute to a register of unreported trial studies to fill in the gaps. Even so, data sleuths like Egger still want to see published research findings routinely scanned for signs of publication bias.

Knowing about the existence of missing data is one thing, but what everyone really wants is some idea of what all those missing data say, and what effect they would have on the big picture. Some data sleuths are now taking on this ultimate challenge, with results that are already proving controversial.

Passive smoking

At Colorado State University in Fort Collins, Geof Givens and his colleagues have been investigating mathematical ways of deducing things about missing studies. One approach is a probabilistic model based on the notion that the less statistically impressive a study's finding, the less likely it is to get published. Combined with a model for how such negative studies emerge as research into a new therapy continues, Givens and his colleagues have found they can estimate the number of missing studies, their likely message, and how they affect any conclusions that have been based only on published results.

As a test-bed for their technique, Givens and his colleagues picked one of the most controversial of all medical debates: the link between passive smoking and lung cancer. Over the years, dozens of studies of passive smoking have been published and in 1992 the US Environmental Protection Agency gained huge media attention with an analysis of studies that pointed to a 19 per cent higher risk of lung cancer among passive smokers. The funnel plot for the EPA data hints at a more complex story, however. It is far from symmetric--hinting at the presence of missing data showing no significant effect. But how do these missing data affect the case against passive smoking?

Givens and his colleagues decided to find out. In a paper published in Statistical Science late last year, they concluded that the EPA had missed about five sets of unpublished results. And while this might not sound a lot, they were sufficient to bring the overall risk estimate for passive smoking down by about 30 per cent, making the case against passive smoking much more equivocal than many believe.

Both the conclusion and the way it was reached are still very controversial. Health officials working for the American and British governments remain adamant that the case against passive smoking is solid, and the techniques used by Givens and his colleagues have been criticised for being based on too many questionable assumptions.

There's a twist to the story, however. Previous worries about the effect of publication bias in the passive smoking debate had already prompted attempts to solve the missing results problem the low-tech way, by physically tracking them down. Sure enough, in 1994 some more US data were unearthed in the form of five unpublished negative studies that had been left out of the EPA analysis--just the number that Givens' team calculated had gone missing.

A small success, no doubt. But it helps the cause of data sleuths such as Givens and his colleagues, and supports their maxim for life in a data-swamped world: what you don't know can be just as important as what you do.

Robert Matthews is the science correspondent of The Sunday Telegraph

Data sleuths go to war

DURING the Second World War, the Allies used data sleuthing methods to deduce the productivity of Germany's armament factories using nothing more than the serial numbers found on captured equipment.

It worked like this. Suppose a new tank starts appearing on battlefields, and that some have been captured. Close examination of the tank reveals a serial number tucked away on the gearbox. For simplicity, assume that the serial number runs from 1 to N, where N is the latest tank off the production line. The question is: how big is N--or, more bluntly, how many tanks has the enemy built so far?

One insight comes immediately: if the biggest serial number is, say, 2355, then there must be at least 2355 of them. But by exploiting the random capture process, a sharper estimate emerges. Suppose the biggest serial number found is B, and the smallest is S. Then it's as likely that there are as many unseen serial numbers higher than B as there are serial numbers lower than S. In other words, (N - B) will be roughly equal to S - 1, as 1 is the lowest serial number of all. So the total number of tanks N is about B + S - 1.

Roger Johnson of Carleton College in Northfield, Minnesota, has recently shown that there is an even better estimate: N = (1 + 1/C)B - 1, where C is the number of captured tanks. But even the rough formula is clearly not as widely known as it should be. According to Johnson, during the 1980s an American military official was given access to the production line of Israel's Merkava tank. When the official asked how many Merkavas were being produced, he was told the information was classified. "I found it amusing," the official later told Johnson, "because there was a serial number on each tank chassis."

From New Scientist, 23 May 1998


© Copyright New Scientist, RBI Limited 2001