Just for facebook

Tooth Fairy Science and Other Pitfalls: Applying Rigorous Science to Messy Medicine

A presentation by Harriet Hall, MD
Original PowerPoint (3.9M PPT)



Title Slide

Many of you know I wrote this book, entitled “Women Aren’t Supposed to Fly: the Memoirs of a Female Flight Surgeon.” I thought I would share with you the REAL reason women aren’t supposed to fly.

That’s her purse hanging from the instrument panel.

This workshop is called the Skeptic’s Toolbox, and every year we talk about what makes some people skeptics and others not. You can take two people who believe in dowsing and explain the ideomotor effect and show how dowsing has consistently failed every properly controlled test. One person will accept the evidence and stop believing in dowsing. The other will ignore the evidence and continue to believe. Why is that? Bob Carroll, author of The Skeptic’s Dictionary has said that critical thinking is an unnatural act. Science doesn’t come naturally.

Jerry Andrus used to come to the Toolbox and show us his close-up magic tricks and optical illusions. He would tell us “The reason I can fool you is that you have a wonderful brain.” Our brain takes the odd contraption on the right, and from another point of view on the left it assembles it into a nonexistent, impossible box.

Psychiatrist Morgan Levy said, “Thinking like a human” is not a logical way to think but it is not a stupid way to think either. You could say that our thinking is intelligently illogical. Millions of years of evolution did not result in humans that think like a computer. It is precisely because we think in an intelligently illogical way that our predecessors were able to survive.

He also said “Scientists expend an enormous amount of time and energy going to school in order to learn how to undo the effects of evolution so that they can investigate natural phenomena in a logical way.” Education helps, but it isn’t enough. We all know highly educated people who are not skeptics.

Ray Hyman has suggested that skeptics are mutants. Has something new evolved in our brains to help us overcome our intelligently illogical thinking processes? Well, Ray, I’m here to tell you you are right. Skeptics ARE mutants, and I have proof…

This is a picture of my daughter Kimberly.

But to get back to the subject of my talk… I’m going to explain some of the pitfalls we encounter when we apply the scientific method to clinical medicine. I’m going to try to give you a feel for how I evaluate a study to see if it is credible. I’m going to suggest some questions you may want to ask the next time you visit your doctor. This is the Skeptic’s Toolbox, and I’m hoping to offer you some tools so that the next time you see a report in the media claiming that broccoli causes cancer you will have a better handle on how to evaluate the report, what questions to ask, and how to decide whether you should immediately stop eating broccoli.

Before we had science we had to rely on these things. Folk remedies, old wives’ tales, herb women, witch doctors, superstitions. The plague doctor in the upper left picture wore a beak-like mask stuffed with herbs in the belief that it would keep him from catching the disease. Instead of scientific trials we had only case reports and testimonials – for centuries we used fleams like the one in the upper right picture for bloodletting and we kept doing it because both patients and doctors told us it worked. The proclamations of authorities like Hippocrates and Galen were never tested or questioned, and their errors were passed down for centuries.

Eventually, we crossed the line from proto-science to modern science, for instance from alchemy to chemistry.

Here’s a short history of medicine: 2000 BC: eat this root. 1000 AD: that root is heathen, say this prayer. 1850 AD: that prayer is superstition, here, drink this potion. 1920 AD: that potion is snake oil, here, take this pill. 1965 AD: that pill is ineffective, here, take this antibiotic – and then…

Back to square one: 2000 AD: that antibiotic is artificial, here, eat this root. But that’s only in some circles. Mostly we stick to science-based medicine today.

Scientific method is nothing special; it’s really just a way of thinking about a problem, forming a hypothesis, selecting one variable at a time, and testing it. It is the only discipline that allows us to reliably distinguish myth from fact. It can be as simple as trouble-shooting to find out why the radio isn’t working. Hypothesis: it’s not plugged in. Test: check the plug. And so on.

We have gradually learned better ways to test. The picture on the left is of Don Quixote. In the novel, he is assembling some old armor to wear on his quest and discovers that the visor is no longer attached to the helmet. He attaches it with ribbons, and when he tests it by striking it with his sword, the visor falls off. He re-attaches it more carefully and the second time he decides he doesn’t really need to test it again. Modern science has developed more rigorous ways to do tests.

Hard science is easy, soft science is hard. When you mix chemicals A and B you always get chemical C, and you can calculate exactly how much will be produced. But medical studies are more problematic. Every molecule of a chemical is the same, but humans are not all the same. We have genetic differences, and there are confounders like age, diet, alcohol, and concurrent diseases that may all influence the response to a treatment. Subjects in a study may not take all their assigned pills. In medical studies, A plus B may appear to equal C or D or E.

The first modern clinical trial was done by James Lind of the Royal Navy in 1747. Back then, the sailing ships went out for years at a time and sailors had no access to fresh foods. Many of them developed scurvy, where they became weak, unable to work, and had internal bleeding, bleeding gums, and other symptoms. Today we know this was due to a deficiency of vitamin C. But vitamins weren’t discovered for another 2 centuries. Lind had heard reports of successful treatment, and he had developed the hypothesis that scurvy was due to putrefaction and could be prevented by acids. He divided 12 sick sailors into 6 groups, kept them all on the same diet, and gave each group a different test remedy: a quart of cider, 25 drops of elixir of vitriol (sulfuric acid – I hope he diluted it in water before he gave it to them!), 6 spoons of vinegar, half a pint of seawater, two oranges and a lemon, or a spicy paste plus a drink of barley. The winning combination was two oranges and a lemon. It worked, but he still didn’t understand HOW it worked. He tried sending ships out with bottled juice to save storage space, and that didn’t work – the bottling process heated the juice and destroyed the vitamin C.

Applying scientific method to medicine led to lots of great discoveries. Vaccines are probably responsible for saving more lives than anything else. Reliable birth control allowed women to take control of their lives and contribute to society in all kinds of occupations. Antibiotics reduced the toll of infections. We developed fantastic imaging methods (x-rays, ultrasound, CT, MRI, PET scans) (that’s Homer Simpson’s skull x-ray) that enabled us to see inside the living body and make diagnoses without waiting for the autopsy. Diabetes used to kill all its victims: insulin keeps them alive today. We can even transplant organs. And, since I mentioned birth control pills for the women, I’ll mention Viagra for the men. Some men probably think that’s one of medical science’s greatest inventions.

Modern medicine has accomplished a lot.

But the greatest discovery of all was probably the randomized controlled trial (RCT). It was the method that enabled us to make most of those great discoveries.

The R in RCT is randomization. When comparing the responses of two groups, you want to make sure the groups are comparable. If you steered all the sick, old people into the treatment group and all the young, healthy ones into the placebo group, an effective treatment might appear to be a dud. To make sure there is no bias in group assignment, it’s best to use concealed allocation, where the researcher who assigns patients to groups 1 and 2 doesn’t know which is the placebo group; only another researcher knows that. Once you have assigned patients randomly to the groups, you still need to go back and check that the two groups really are similar.

In reports of well-designed studies, you will usually see a table like this.

Here’s a close-up. The details aren’t important. The point is that they looked at all kinds of parameters like age, sex, ethnic group, weight, height, blood pressure, heart rate, etc. and calculated statistical measures of how similar the groups were.

The second letter of RCT stands for Controlled. Treatments almost always “work” – even quack treatments often seem to work due to the placebo response and due to the natural course of disease where some people improve without treatment. There is a phenomenon called the Hawthorne effect, where just being enrolled in a study leads to improvement. So instead of just treating subjects and showing that they improve, you need to compare them to a control group of patients who get a placebo, or a known effective treatment, or no treatment at all.

Here’s why that’s important. Almost any disease has a fluctuating course. Symptoms wax and wane unpredictably. Here’s an example of a woman with osteoarthritis pain in her knees. At the beginning of this particular month she hardly had any pain, then it got worse for a while, then it subsided again. This is what it does with no treatment at all. Let’s say she decides to take a pill for the pain. She’s not likely to try it during the first few days when the pain is hardly noticeable. She’s more likely to try it when the pain peaks. And look what happens:

Her pain goes away! Of course this is exactly the same as the previous graph with the left half cut off. But it sure is convincing. It really looks like whatever she did worked.

Now let’s say she has a placebo response to the treatment. The pain was going away anyway, but now it goes away really fast. Of course she’s going to conclude that the treatment worked wonders. So if you want to prove that a new treatment works, you’re going to have to show that it produces more improvement than the natural course of disease and the placebo response.

The best RCTs are double blind. The subject doesn’t know whether he’s getting the treatment or the placebo, and the researcher doesn’t know which he’s giving the patient. This minimizes any unconscious effects of bias. But even double blinding isn’t foolproof. Sometimes patients can guess which group they were in due to side effects. Sometimes they even take the capsules apart and check whether the contents taste like sugar. So you really need to do an exit poll, asking patients which group they thought they were in. If they can guess better than chance, you didn’t have an adequate placebo and your results are tainted.

Now let’s look at how these RCTs are used in practice to develop new drugs.

The first step is to decide what’s worth testing. If an Amazon explorer reports that a tribe chews a certain leaf to treat infections, you don’t just bring those leaves home and give them to people. First you might test it in a lab to see what the components were and see whether they suppressed growth in a bacterial culture. You might try it on animals with infections. In vitro is Latin for “in glass” and refers to lab testing with test tubes and Petri dishes. In vivo means in a living body.

Animal studies may not be valid. Animals are not always equivalent to humans. Aspirin would have been rejected on the basis of animal studies. It causes congenital defects in mice, but not in humans.

Once you’ve decided that a drug is worth testing, the next step is what we call Phase I trials. It looks for toxicity to determine if the drug is safe. You give a single small dose of the drug to healthy volunteers. If they have no adverse effects, you test larger doses over longer periods. Sometimes there are unpleasant surprises. A drug called TGN1412 monoclonal antibody was tested in animals and appeared to be safe. They took a dose 500 times lower than was found safe in animals and gave it to 4 healthy volunteers. They were all hospitalized with multiple organ failures and some required organ transplants to save their lives.

If the drug passes the Phase I safety trials, the next step is a Phase II trial to see if the drug is effective. Phase I is the first trial in humans; Phase II is the first trial in patients. It tests large numbers of patients with the target disease and compares different doses. Ideally, several trials are done with different patient groups.

If the Phase II trials show the drug works, the next step is a Phase III trial to see if it works better than other treatments. You compare the new drug to an older drug or to a placebo.

Placebo trials may not be ethical. If you have an effective treatment for a disease, you can’t risk patients’ lives by denying them that treatment and assigning them to a placebo group. We couldn’t do a trial today comparing appendectomy to a placebo for acute appendicitis.

The trials don’t end with approval and marketing. They continue with post-marketing Phase IV trials to see what more we can learn. The company may want to demonstrate that the drug works for other illnesses, or show that it works better than a competitor’s product.

And then there’s post-marketing surveillance. The number of subjects in a research trial is small compared to the number of patients who will be taking a drug after it is marketed. Some of them will be different from the people in the trials in various ways, for instance they may have concurrent diseases. The first rotavirus vaccine was taken off the market when they discovered that 1 in 10,000 children developed intussusception, a telescoping of the bowel that is life-threatening. 1 in 100,000 people who got the 1976 swine flu vaccine developed a paralysis called Guillain-Barré syndrome. How many people would you have to enroll in a premarketing study to detect a one-in-100,000 complication? One of the weirdest drug effects I came across was that in men taking a drug called Flomax for prostate symptoms, if they have cataract surgery they can develop a complication called “floppy iris syndrome.” No one could have predicted that!

A researcher named Ioannidis recently wrote a seminal paper showing that most published research findings are wrong. And he explained why.

There are any number of things that can go wrong in research. Here are some of them. If a drug company funds the research, it’s more likely to support their drug than if an independent lab does the study. If the researchers are true believers, all kinds of psychological factors come into play and even if they do their best to be objective, they are at risk of fooling themselves. People who volunteer for a study of acupuncture are likely to believe it might work; people who think acupuncture is nonsense probably won’t sign up. Maybe 3 studies were done and only one showed positive results and that’s the one they submitted for publication (the file drawer effect). Most researchers delegate the day-to-day details of research to subordinates. Sometimes the peons in the trenches are just doing a job and trying to please their boss. They may feed false data to the author or suppress information they know he doesn’t want to hear. Sometimes when you read the conclusion of a study and go back and look at the actual data, the data don’t justify the conclusion. The report can’t possibly contain every detail of the research – what are they not telling us? Were there a lot of dropouts? Maybe when it didn’t work, they quit, and only the ones who got results were left to be counted.

Here are just a few of the other things that can go wrong. Some countries only publish studies with positive results. In China, if you published a study showing something didn’t work you would lose face and lose your job. So I won’t trust any results out of China until they are confirmed in other countries. If there are only a few subjects, errors are more likely. If you study the net worth of 5 people and Bill Gates is one of the 5, you get skewed results. In general, the more subjects, the more you can trust the results. When they calculate the statistics, they can use the wrong method or make mistakes. They can misinterpret the findings. The file drawer effect is when negative studies are not submitted for publication; publication bias is when the journals are less likely to publish negative studies. Inappropriate data mining is when the study doesn’t show what they wanted, and they look at subgroups and tweak the data every which way until they get something that looks positive. Sometimes researchers outright lie and commit fraud to further their careers. Sometimes they get caught, sometimes they don’t.

In thinking about a study, there are several things to consider. Was the endpoint a lab value or a clinical benefit? The diabetes drug Avandia decreased the levels of Hemoglobin A1C, a blood test indicating that the disease is under control. Unfortunately, it increased the mortality rate. Good blood tests aren’t very important if you’re dead. We try to look for POEMS – Patient Oriented Evidence that Matters. Instead of looking at cholesterol levels, we look at numbers of heart attacks. What kind of study was done? RCT, cohort, case-control, epidemiologic? Was it blinded? Was the study well-designed? Did they use an intention to treat analysis to correct for dropouts?

How much background noise was there? If you look at enough endpoints you’re almost certain to find some correlation just by chance. In a study of Gulf War Syndrome they looked at veterans’ wives to see if the husbands had brought anything home to harm them. They found that wives were less likely to have moles and benign skin lesions. Obviously, that was just noise, not a sign that Gulf War Syndrome improves the skin of spouses. Small effects (a 5% improvement) are not as trustworthy as large effects (60% improvement). It’s always better if different types of evidence from different sources arrive at the same conclusion. You should never trust a single paper, but should look at the entire body of published evidence. And you should trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.

I want to mention the null hypothesis because it can be hard to understand – it has always bothered me because it seems sort of like a double negative. Instead of testing the claim itself – that X cures Y – you test the null hypothesis – that X doesn’t cure Y. There are only 2 options: you can reject the null hypothesis (more people are cured with X than with placebo) or you can not reject it (the same number of people are cured with X as with placebo). You can’t ACCEPT the null hypothesis, because it’s hard to prove a negative. If the null hypothesis is that there are no black swans, all it takes is one black swan to disprove it. No matter how many times you try to fly, you can never prove that you can’t fly. The more times you try it, the less likely, but science can never say absolutely never. It can only say that based on current evidence, the likelihood of a human being able to fly is so vanishingly small that no one in his right mind would jump off a cliff to try it. Of course, it remains open to new evidence and would be willing to reconsider if people could actually show that they could fly.

All this is very discouraging. If most published research is wrong, when CAN you believe the results? You can look for good quality studies published in good journals, studies that have been confirmed by other studies, studies that are consistent with other knowledge, and studies that have a reasonably high prior probability. As Carl Sagan said, extraordinary claims require extraordinary proof.

Look at this structure. I think you can see that the prior probability of your being able to take 12 dice and construct this is so low that you would not waste your time trying.

Clinical research usually uses the arbitrary level of p=0.05 as the cut-off for statistical significance. Research in physics demands much more. The p=0.05 level essentially means that if you repeat the trial 20 times, 19 are likely to show the same result. Or if something doesn’t work, it might appear to work in 1 out of 20 studies. Does this mean that a study that is statistically significant at p=0.05 is 95% likely to be correct? NO! It may be much less likely to be correct if there is low prior probability. To understand why, let’s look at Bayes’ Theorem.

Bayes’ theorem allows us to use prior probability to calculate posterior probability. Suppose there is a co-ed school having 60% boys and 40% girls as students. The female students wear trousers or skirts in equal numbers; the boys all wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem. (1 in 4)

Here’s a table showing what this means for clinical research. In the top row, if the p-value is 0.05 and the prior probability is 1%, the probability that the results are correct is only 3%. If the prior probability is 50%, the posterior probability is still only 73%. In the bottom row, if the p-value shows a phenomenally high significance level of 0.001, and the prior probability is 1%, the posterior probability is still only 50%. Most of us think the prior probability of homeopathy studies is well under 1%. This shows why the homeopathy studies that claim statistical significance can’t be trusted.

If you don’t consider prior probability, you can end up doing what I call Tooth Fairy Science. You can study whether leaving the tooth in a baggie generates more Tooth Fairy money than leaving it wrapped in Kleenex. You can study the average money left for the first tooth versus the last tooth. You can correlate Tooth Fairy proceeds with parental income. You can get reliable data that are reproducible, consistent, and statistically significant. You think you have learned something about the Tooth Fairy. But you haven’t. Your data has another explanation, parental behavior, that you haven’t even considered. You have deceived yourself by trying to do research on something that doesn’t exist.

Ray Hyman’s categorical directive: “Before we try to explain something, we should be sure it actually happened.” Hall’s corollary is “Before we do research on something, we should make sure it exists.”

There’s been a lot of research on the meridians and qi of acupuncture. Here’s an early acupuncture patient.

In acupuncture fairy science, the hypothesis is that sticking needles in specific acupuncture points along acupuncture meridians affects the flow of qi, which improves health. There is no evidence that specific acupuncture points or meridians exist, and no evidence that qi exists. If it did exist, there’s no reason to think it could flow, or that sticking needles in people could affect that flow, or that the flow could improve health. With acupuncture, what could you use as a placebo control? People generally notice when you stick a needle in them, so blinding is very difficult. They have come up with ingenious placebos – comparing random points on the skin to traditional acupuncture points or using sham needles that work like those stage daggers, where they appear to be penetrating the skin but actually just retract into the handle. One recent study used toothpicks that pricked but did not penetrate the skin. No matter what control you pick, a double blind study is impossible, because the acupuncturist has to know what he’s doing. The best studies using sham acupuncture controls have consistently shown that acupuncture works better than no treatment, and that sham acupuncture works just as well as real acupuncture. It doesn’t matter where you stick the needle or whether you use a needle at all. The only thing that seems to matter is whether the patient believes he got acupuncture. The acupuncture fairy believer’s conclusion is that we know acupuncture is effective so sham acupuncture must be effective too. The rational conclusion is that acupuncture works no better than placebo.

I write for the Science-Based Medicine blog, where we make a distinction between evidence-based medicine and science-based medicine. Evidence based medicine simply looks at the published clinical research and accepts the findings. Science based medicine considers preclinical research, prior probability, consistency with the rest of the body of scientific knowledge, and the fallibility of most research. I like to think of it as this formula: SBM = EBM + CT (critical thinking)

One of the pitfalls in evaluating published studies is disease clusters. If the mean incidence of cancer is X, that means some communities will have fewer than X cases and some will have more than X. If you spill rice on a grid, there will be an average number of grains per square, but some squares will have 0 grains and some will have lots of grains. It can be very tricky to determine whether a cluster is due to chance or whether it represents an increased risk in that particular area. In the Love Canal incident, a panel of distinguished doctors recently reviewed the scientific findings to date. They issued a surprising verdict. In their view, no scientific evidence has been offered that the people of Love Canal have suffered "acute health effects" from exposure to the hazardous wastes, nor has the threat of long-term damage been conclusively demonstrated.

Another tricky thing is confidence intervals. The top of the columns shows the actual measurements found, and there is a 95% confidence that the true measurement falls somewhere on the red lines. In the example on the left, the lowest end of the red line for the brown column is still higher than the highest end of the red line for the white column, so we can be confident brown is really greater than white. In the example on the right, a level within the red line for the brown column is higher than a level within the red line for the white column, so it’s possible that brown might actually be greater than white.

One of the most common human errors is forgetting that correlation doesn’t mean causation. The rise in autism was correlated with a rise in the number of pirates, but I doubt if anyone thinks pirates cause autism or autism causes pirates.

The logical fallacy here is post hoc, ergo propter hoc: assuming that because B follows A, it was caused by A. It’s easy to see why this is a fallacy when you show that the rooster crows every morning, followed by the sun rising. There is a consistent correlation, but we all know it was not the rooster’s crowing that made the sun come up. But think about this: I took a pill, and I got better; therefore the pill made me better. Suddenly it seems perfectly reasonable – but we have to remember that it might be just another rooster.

Does this make sense? It’s from the Scientific American in 1858. A doctor studied the curative effects of light in hospitals, and found that 4 times as many patients were cured in properly lighted rooms than in darkness. He said this was “Due to the agency of light, without a full supply of which plants and animals maintain but a sickly and feeble existence.” The editors commented that the health statistics of all civilized countries had improved in the preceding century – may be because houses are better built to admit more light. I think most of us can see that this correlation did not prove causation – we can come up with other explanations.

A scientist named Hill came up with a list of criteria to determine whether a correlation from epidemiologic studies showed causation. I’ll illustrate with the example of smoking and lung cancer. There is no way we could do a randomized controlled study of smoking – you can’t divide children in two groups and make one group smoke for decades and the other not, and you certainly couldn’t have a blinded study. So we had to approach the question by other routes. 1. There was a temporal relationship: people smoked first and got lung cancer later. 2. There was a strong relationship – lots more smokers than nonsmokers got lung cancer. 3. There was a dose-response relationship – the more cigarettes smoked, the higher the rate of lung cancer. 4. The results of various kinds of studies were all consistent. 5. The mechanism was plausible: we know there are cancer-causing compounds in cigarette smoke. 6. Alternate explanations were considered and ruled out. 7. They did experiments where they exposed lab animals to cigarette smoke and the animals developed cancer. 8. Specificity: cigarettes produced specific types of lung cancers, not a mixture of various unrelated symptoms. 9. Coherence. The data from different kinds of epidemiologic and lab studies and from all sources of information held together in a coherent body of evidence.

Media reports of medical studies usually make findings sound bad by citing relative risk reduction rather than absolute risk reduction. Rather than knowing percentages, we want to know how many people are harmed or helped.

Here’s an example. A study showed that the risk of breast cancer increases by 6% for every daily drink of alcohol. The relative risk is 6%. So would 2 drinks a day give 12% of women breast cancer? NO. 9% of women get breast cancer overall. In every 100 women, 9 will get breast cancer. If every woman had 2 extra drinks a day, 10 of them would get breast cancer. Only 1 woman in 100 would be harmed; the other 99 would not be harmed. The absolute risk is 1 in 100, not 12.

Another example. Eating bacon increases the risk of colorectal cancer by 21% (relative risk). 5 men in 100 get cancer in their lifetime. If each ate a couple of slices of bacon every day, 6 would get cancer – only one more man, for an absolute risk of 1 in 100.

A recent study showed that using a cell phone doubled the risk of acoustic neuroma (a tumor in the ear). The relative risk was reported as 200% and alarmed parents took their children’s phones away. But the baseline risk of acoustic neuroma is 1:100,000. 200% of 1 is 2. The absolute risk was 1 more tumor per 100,000 people. Acoustic neuroma is a treatable, non-malignant tumor. The lead researcher said she would rather accept the risk and know where her kids were. She let them keep their cell phones. She warned that the results were provisional, the study small, and that different results might be found with a larger study. She was vindicated when a later, larger study found no increased risk.

What’s wrong with this? A British Heart Foundation press release said “We know that regular exposure to second-hand smoke increases the chances of developing heart disease by around 25%. This means that for every four non-smokers who work in a smoky environment like a pub, one of them will suffer disability and premature death from a heart condition because of second-hand smoke.”

If 4 in 100 nonsmokers have heart disease, a 25% increase means 5 will have it. The risk is not 1 in 4, but 1 in 100.

As well as asking for absolute risk rather than relative risk, we can ask for NNT and NNH – the number needed to treat and the number needed to harm. When you use Tylenol for post-op pain, you have to give it to 3.6 patients for one to benefit. You have to treat 16 dog-bite patients with antibiotics for one to benefit: 15 out of 16 will take the antibiotics needlessly. The clot-buster drug tPA has to be given to 3.1 patients for one to benefit, and for every 30.1 patients, one will be harmed.

Lipitor is one of the statin drugs used to lower cholesterol and prevent heart attacks and strokes. When it is used for secondary prevention (in patients who already have heart disease) somewhere between 16 and 23 patients must be treated for one to benefit. When it is used for primary prevention (in patients who are at risk but don’t yet have heart disease) the NNT rises to somewhere between 70 and 250, depending on age, other risk factors, etc. Of every 200 patients taking the drug, one is harmed.

One cynic put it this way "What if you put 250 people in a room and told them they would each pay $1,000 a year for a drug they would have to take every day, that many would get diarrhea and muscle pain, and that 249 would have no benefit? And that they could do just as well by exercising? How many would take that?" This is an exaggeration, but it illustrates that these drugs should be used selectively, based on individual factors like a total risk assessment and the patient’s personal preferences to take his chances vs. taking a drug for insurance.

We all tend to assume that a positive test means someone has a disease and a negative test means he doesn’t. But it’s not that simple. There’s no such thing as a definitive test. There are false positives, false negatives, and lab errors. The diagnostic value of a test depends on the pre-test probability that the patient has the disease. One lesson I learned over and over in my years of practice was never to believe one lab test. My mother was a case in point. On the basis of one sky-high blood glucose test, she was told she had diabetes. They wanted to start her on treatment, but I persuaded them to wait. We bought a home monitor and checked her regularly and never ever got a single abnormal reading. When the lab checked her again, she was normal. We still don’t know what happened – maybe her blood sample got switched with someone else’s.

If your mammogram is positive, how likely is it that you actually have breast cancer? They’ve done surveys asking this question, and most laypeople and even many doctors guess 90%. Actually it’s only 10%.

Mammograms are 90% accurate in spotting those who have cancer (this is called the sensitivity of the test). They are 93% accurate in spotting those who don’t have cancer (this is called the specificity of the test). 0.8% of women getting routine mammograms have cancer (this is the prevalence of the disease).

This means that of every 1000 women getting mammograms, 8 of them have cancer. Of those 8 women with cancer, 7 of them will have true positive results, and one will have a false negative result and be falsely reassured that she does not have cancer. 992 of the 1000 women do not have cancer. Of those, 70 will have false positive results and 922 will have true negative results. So in all, there will be 77 positive test results, and only 7 of those will actually have cancer – roughly 10%.

It gets worse. How many lives are saved by mammography? If 1000 women are screened for 10 years starting at age 50, one life will be saved. 2-10 women will be overdiagnosed and treated needlessly. 10-15 women will be told that they have breast cancer earlier than they would otherwise have been told, but this will not affect their prognosis. 100-500 women will have at least one false alarm, and about half of them will undergo a biopsy they didn’t really need.

It gets worse when you screen for multiple diseases. The PCLO Trial tested for cancers of the prostate, colon, lung, and ovary. After only 4 tests, 37% of men and 26% of women had false positives. After 14 tests, it went up to 60% and 49%. After 14 tests, 28% of men and 22% of women had undergone unnecessary invasive tests to determine whether they really had cancer – this included biopsies, exploratory surgery and even hysterectomy.

We hear a lot about risks, but the media seldom put those risks into perspective for us. The swine flu had killed 263 people in the US as of July 10, 2009. Regular flu kills 36,000 each year. Smoking kills 440,000 each year. We should be far more afraid of cigarettes than of swine flu.

We are warned about the obesity epidemic and we are told the ideal body mass index is 19-25. For a BMI of 18 or lower, there are 34,000 excess deaths. For a BMI over 30, there are 112,000 excess deaths. But lo and behold, for BMIs only mildly elevated at 25-29, there are 86,000 FEWER deaths. Makes you wonder what “ideal weight” really means.

We all know how easy it is to lie with statistics. Here are some statistical benchmarks to keep in mind that will help you detect some of the lies. The US population is a little over 300 million, with 4 million births and 2.4 million deaths a year. Half of the deaths are from heart disease and cancer, and you can read the rest. The report that “more than 4 million women are battered to death by their husbands or boyfriends every year” can’t possibly be true because total homicides are only 17,000. 4 million exceeds the total 2.4 million mortality.

Another area of confusion is the difference between prevalence and incidence. Prevalence is how many people in a population currently have the disease; incidence is how many people are newly diagnosed each year. One of my colleagues on the blog was discussing autism rates with an anti-vaccine activist. He cited the incidence figures for Denmark and his opponent tried to compare them to the prevalence figures in the US and ended up with egg on his face.

It’s very important to remember that probabilities are not predictions. Children who consistently spend more than 4 hours a day watching TV are more likely to be overweight. But that doesn’t mean you can make a child fat by making him watch TV. Or that you can make a child thinner by turning the TV off.

When there are a lot of small studies that don’t show statistical significance, sometimes it is possible to combine data to make one big study that does show statistical significance. Usually, the studies are not similar enough to justify doing that. If the small studies are not well-designed, combining them just makes the Garbage In Garbage Out (GIGO) problem worse. Systematic reviews examine all the published data using predetermined criteria and trying to weigh the quality of evidence. Because of the characteristics of a systematic review, if the results are negative they are probably true, but if they are positive they may still have a GIGO problem. Sometimes the reviewers are biased – they may have homeopaths reviewing the literature on homeopathy.

A systematic review of homeopathy studies showed that overall it worked better than placebo, but it didn’t work better for any specific condition. Whaaat? That’s like saying broccoli is good for everyone but is not good for men, women or children.

One of my readers educated me about how this can happen, and I thought it was neat so I’ll share it with you. It’s called Simpson’s paradox. Looking across the first row, testing treatments for symptom A, 30 patients were given placebo; 24 improved for an 80% improvement. 50 patients were given a homeopathic remedy and 40 improved, for an 80% improvement. In the rows for symptom B and C the same thing happened: the improvement percentages for placebo and homeopathy were exactly equal. But look what happens in the bottom row. When you add up rows A, B, and C, it looks like you got a 35% improvement with the placebo and a 48% improvement with homeopathy. This is statistical shenanigans, NOT evidence that homeopathy works better than placebo.

If a test is negative, how likely is it that the patient really doesn’t have the disease? If the test is positive, what is the likelihood that he really has the disease?

We can calculate likelihood ratios. Here’s a complex example, where the thickness of the lining of the uterus is measured on ultrasound to predict the likelihood of cancer of the uterus in women who have postmenopausal bleeding. If the thickness is less than 4 mm, the likelihood is 0.2%; if it is 21-25, the probability of cancer is 50%.

Some tests are better at ruling out disease and some are better at ruling it in. There is a blood test called the D-dimer test that is used to help rule out pulmonary embolus. If the doctor thinks there is a 10% probability that the patient has a PE, a positive D-dimer test raises the probability to 17%; a negative D-dimer test lowers the probability to 0.2%.

You have a sore throat. You get a throat culture. If the culture is positive, you have strep throat. If it’s negative, you don’t have strep throat. Right? WRONG! You might be an asymptomatic strep carrier, and your symptoms might be due to a virus. There could be a lab error. False positive and false negative test results are possible.

Rather than depending on a throat culture, we’ve developed some clinical decision rules. We can calculate a strep score based on points for age, exudate on tonsils, swollen lymph nodes, fever, and absence of cough.

If the score is 0-1 point, the probability of strep is essentially zero and no culture or treatment is necessary. If the score is 2-3 points, the probability of strep rises to 17-35%, and we get a throat culture and give antibiotics if the culture is positive. If the score is 4-5 points, the probability is over 50% and we can just give antibiotics and not bother doing a culture.

Medicine is not an art or a science, but an applied science. It can be tricky to apply the evidence from published studies to the individual patient in the doctor’s office. Often there are no pertinent studies to go by. We can’t just use blind guesswork or intuition. We have to make the best possible judgment based on the best available evidence.

Medical science is flawed, but it’s better than the alternatives: testimonials, personal experience, belief-based treatments, or hypothetical, untested treatments. As the cartoon says, “I need an antidote, not an anecdote.”

If you don’t have a reliable way to evaluate evidence, you can end up being fooled by quackery like the blue dot cure. A quack can do anything useless and ridiculous, like painting a blue dot on the patient’s nose, and can “show” that it works. There are only 3 things that can happen: the patient can get better, worse, or stay the same. If he gets better, the quack claims that the blue dot worked. If he gets worse, the quack laments “If only you’d come to me sooner, the blue dot would have had time to work.” If the patients stays the same, the quack says “The blue dot kept you from getting worse; we need to continue the treatment.”

People are easily fooled into believing that quack treatments work.

Real science and pseudoscience may be hard for the layman to tell apart, but it’s important to know the difference, because…

Quackery takes your money.

Pseudoscientific explanations may seem to make superficial sense if you don’t know anything about science.

In the scientific method, the scientist says “Here are the facts. What conclusions can we draw from them?” In the pseudoscientific method (here represented by the creationist method) the pseudoscientist says “Here’s the conclusion. What facts can we find to support it?” The scientist asks IF something works; the pseudoscientist tries to SHOW that it DOES work.

As Bill Nye, the Science Guy, says: Science rules!

Here’s a brief overview of how to read a study. The first thing is the abstract, a summary of what the study showed. It is only as good as the person who wrote it, and it may be inaccurate or it may draw conclusions not warranted by the data. The introduction explains what we already knew about the subject and why they decided to do this particular study to learn more. It usually cites previous studies, and if the authors are biased, it may cherry-pick the literature and mention only studies that support their bias. Then there is a Methods section that should describe in detail exactly what they did, so a reader could replicate their experiment in his own lab. Then they give their Results. They should give us the raw data so we can do our own analysis. This is where they apply statistical tests and tell us whether their results are statistically significant and at what level of significance. The Discussion section tells us what they think their data shows. If they are good scientists, they will point out the limitations of their study and try to anticipate the objections that others might raise. Then with the Conclusion, they will tell us what they think the implications of their study are. A good scientist will usually say something like “If other studies confirm these findings, this treatment may turn out to be clinically useful.” A bad scientist might say something like “We have proved this treatment works and everyone should start using it right away.” The references will be listed, and it’s worth checking them. Sometimes you can tell from the title alone that the reference has little to do with the claim it is meant to support. For instance, the statement “homeopathy cures rheumatoid arthritis” might be referenced by a homeopathy handbook or by a study entitled “The prevalence of a positive RA factor in a population with advanced rheumatoid arthritis.” I have even found references that attested to the exact opposite of the claim made in the article. Then, at the very end, there should be a mention of the source of funding and a disclosure of any conflicts of interest on the part of the authors.

I’m going to go over a couple of examples of how I go about evaluating a study, first from an abstract alone, and then from an entire study. Let’s say you read in the newspaper about a new study that shows that taking a multivitamin will reduce your risk of a heart attack. How can you know if the newspaper report has represented the study fairly? Most studies are listed in PubMed at www.pubmed.gov with their abstracts. You can usually get enough clues from the news report to use the search function and find the abstract. You will see something like this:

Don’t worry about the details. I’m just going to go through and mention some of the kinds of things I look for. The red highlighted phrases caught my attention. This was a study in Sweden. They took 1296 people who had had a heart attack (myocardial infarction or MI) and compared them to 1685 controls picked from the general population and matched to the patients by sex, age, and catchment area. They asked them to self-report their use of diet supplements, and they found that patients who had not had a heart attack were significantly more likely to have used diet supplements. This was a case-control study, which is less reliable than a randomized controlled study. It depended on recall rather than measuring in any way what the patients had actually taken. It did not try to assess whether any of the patients had been vitamin deficient. It only studied patients between the ages of 45 and 70, so any conclusions drawn from it might not apply to people older and younger than that. It was done in Sweden; they themselves point out that the consumption of fruits and vegetables is relatively low in that country, so it might not have implications for other countries with different diets. Of those who said they used supplements, 80% said they used multivitamins. What about the other 20%? Could they have been taking some other supplement that had a larger effect than vitamins? They checked for some possible confounding factors, and they found that “never smoking” outweighed the effect of vitamins in women. The abstract concludes “Findings from this study indicate that use of low dose multivitamin supplements may aid in the primary prevention of MI.” I don’t think the data support that conclusion at all. At any rate, the headline is clearly wrong. This study does NOT show that you should start taking a multivitamin to reduce your risk of a heart attack. More importantly, when you search the medical literature you don’t find any other studies showing that multivitamins prevent heart attacks; in fact, there are several studies showing that supplemental vitamins either have no effect or make things worse.

Now let’s go over an entire study. I picked this one: Dominican children with HIV not Receiving Antiretrovirals: Massage Therapy Influences their Behavior and Development. It was published in an online journal “Evidence-based complementary and Alternative Medicine (eCAM). The very title of the journal suggests bias, since the definition of CAM is that it is not supported by the kind of evidence that would lead to its adoption by mainstream medicine.

In short, they took children who were HIV positive and divided them into two groups, comparing massage therapy to play therapy. Play therapy was the placebo control they chose: you can judge for yourselves whether that is an adequate placebo.

The abstract explains that they studied 48 Dominican children between the ages of 2 and 8. They all had untreated HIV/AIDS. They were randomized to receive either massage or play therapy for 12 weeks. The massage group improved in self-help abilities and communication. Children over the age of 6 showed a decrease in depressive/anxious behaviors and negative thoughts.

The first thing I noticed was a number of careless errors. Caribbean was mis-spelled. Their phrase “for enhancing varying behavioral and developmental domains” was so vague as to be essentially meaningless. They said “A second objective of our work was to determine the absence of antiretroviral treatment on the impact of HIV infected Dominican children’s mood and behavior.” The abstract said the sessions lasted 20 minutes, while the text said 30 minutes. By themselves, errors like these may not mean much, but it makes me wonder if the researchers were as careless about their experimental methods as they were about writing up the results. It also makes me wonder what the peer reviewers and editors were doing – they certainly weren’t doing their job, or they would have caught and corrected errors like these.

In the Introduction section they offered a brief review of the literature. They summarized a few cherry-picked studies that didn’t really support their claims, they omitted other studies that contradicted their claims, and they didn’t address plausibility. The rationale for doing this study was unclear. They talked about massage enhancing immune function, but this study did not even try to measure immune function. It only looked for psychological and developmental effects.

In the Methods section, they described 30 minute sessions with either a standardized massage protocol or a play therapy protocol. The play therapy consisted of giving the child a choice of coloring/drawing, playing with blocks, playing cards, or reading children’s books. The parents were present throughout, and the parents’ reports were used to judge improvement.

They used a couple of scales to measure improvement. This one, the CBCL, measured these 8 items, and another scale measured several others. So we have a multiple end point problem and there is no indication that they tried to correct for this.

The Results section reported that for children under 5 there were no significant differences between the massage and play groups. For children over 6, they reported “significant” improvement in the massage group for anxious/depressed behavior, negative thoughts, and overall internalizing scores. Look at the p values. The value for negative thoughts is p=0.059, higher than the usual cut-off of 0.05. Most scientists would not report this as “significant.” And note that among all the many things they measured, only two showed true statistical significance.

They seemed to be confused about the meaning of significance. They reported that 100% of the children in the play group showed an increase in their score on rule breaking behaviors, significant at a p=<0.05 level. They commented that “this significant change was not clinically meaningful.” How could they determine that? And if that wasn’t clinically meaningful, how did they determine that their other findings WERE clinically meaningful? They characterized a change in IQ data as “marginally significant for the massage group at p=0.07. That’s like being a “little bit pregnant.” The cut-off for significance is p=0.05, and most researchers would simply call anything over the cut-off not significant.

On the Developmental Profile scales, they found that the massage group improved in self-help and communication. The play group improved in social development, while the massage group showed decreased social development. They found no significant differences in the physical or academic scores. There is a bar graph printed in the article that appears to show that the play group did better with a gain of 9 points compared to 8 for the massage group.

In the Discussion section they said that “massage therapy was effective in reducing maladaptive internalizing behaviors in children aged 6 and over” and “children 2-8 years of age who received massage demonstrated enhanced self-help and communication skills.” They found it “interesting” that children in the massage group remained at the same social developmental level, suggesting that it was because those children had little or no play activity at home. (Did they? We don’t know.)

They were “puzzled” by their failure to find any effect on behaviors in the under-5 age group because they were so sure massage therapy improves children’s moods and anxiety levels. They tried to rationalize what might have gone wrong. They commented that “anecdotally, the nurses who conducted the massages reported changes in the children over time, including better mood.” This kind of anecdotal report is meaningless and has no place in an objective scientific study. It is a blatant attempt to make massage look better than what the data showed.

They recommended massage therapy as a cost-effective option to improve symptoms and functioning in children with untreated HIV. Sorry, but the study doesn’t even begin to support such a recommendation.

Why was this study really done? We already know that children need human interaction, play, touch, and TLC. Why massage? Were they really interested in improving the lot of these children, or were they just trying to create data to support their chosen profession of massage? Does this study justify using massage as a “band-aid” on children who are denied life-saving anti-AIDS drugs?

I could have taken the data they collected and used it to support a very different conclusion: We studied a bunch of outcomes and found that massage is ineffective for all but a couple of them (and those are probably not clinically meaningful). One outcome was worse with massage (decreased social development).We did not show that massage is any better than TLC. Money would be better spent saving lives with effective drugs.

In the syllabus we have reprinted one top-of-the-line study and one worthless study. In your group discussion sessions, I’d like you to compare the two. Some of it is technical: don’t worry about the parts you don’t understand. Look for reasons one is better than the other.