WISE DECISIONS, INC.
11466 Laurelcrest Road
Studio City, CA 91604
818-985-4094
wedwards@mizar.usc.edu

August 8, 1995

How to Estimate Thousands of Reliable, Valid Probabilities

For Use in Normative Systems[1]

Ward Edwards

Introduction. Ever since Savage (1954) brought to our attention the idea that probabilities are orderly opinions, and therefore should be assessed judgmentally, system designers have been interested in using such judgments for practical purposes. In 1962, Edwards proposed the idea of PIP systems; the acronym stands for Probabilistic Information Processing. Such systems are explicitly built around human judgments of probability--or perhaps only of ratios of probabilities. In 1968, Edwards, Phillips, Hays, and Goodman presented experimental evidence that systems built on likelihood ratio judgments can work well.

But the version of Bayes's Theorem used in the late 1960's and early 1970's, mainly for lack of better options then available, was what some now deride as "idiot Bayes," in which all inputs are assumed conditionally independent of one another given the hypotheses of interest. A likelihood ratio (or, better, its logarithm) measures the diagnostic impact of a conditionally independent datum or of a group of data, collectively conditionally independent, treated as a single datum. It is not adequate to express diagnostic impact in general; such impacts must be expressed as probabilities, not as ratios of probabilities. (See Edwards, Schum, and Winkler, 1990, for a full discussion, with Schum's ghost as one of the discussants.) The PIP researchers well understood the importance of conditionally dependent data, but the graph-theoretic tools we now use to represent such complex structures were not then available. The only representation of the inevitable complexities of non-independent data structures known to Edwards, Schum, and their colleagues at that time was the symbolic representation of probability theory. All except Schum and his colleague Anne Martin rejected that representation as too hard to work with. This limitation combined with the end of the DARPA program that had funded the work to cause most people to forget about the PIP idea.

Fortunately, a quite different group, located mainly at Stanford and trained in both decision analysis and artificial intelligence, was not deterred by the failure of PIP, mainly because they had never heard about it. So they resurrected the basic idea, bringing to it the marvelous structural representation tools of graph theory. The result was Bayes Nets (BNs) and Influence diagrams (IDs). (Matzkevich and Abramson, 1995, is an excellent technical review.) These tools have two special merits:

  1. they explicitly separate the representation of the intellectual structure of an inference problem from assessment of its numerical parameters.
  2. They permit adept users to incorporate into a model whatever structural complexities the real problem being modeled may have; they do not force people into working with idiot Bayes.
My personal opinion is that decision analysis will never be the same. These are the tools we have needed all my adult life!

By now we know that IDs and BNs, wonderful as they are, have problems. In particular, models of realistic size and complexity may require thousands of probabilities. The idea of accumulating relative frequencies in order to estimate this large number of probabilities is ridiculous on its face for most possible applications. So, desirable as BN and ID representations of inference and decision problems are, they are also tightly linked with the idea that their numerical inputs must be provided by human judgments--simply because they cannot be had in any other way.

Conceptually this fact poses no problem. We all know that probabilities are orderly opinions--i. e. judgments. So, in order to get such big sets of probabilities, all that is required is a judge able and willing to give appropriate numbers.

Do such judges exist? This paper argues that the answer is yes. To defend that answer, I must deal with a large body of research on a class of phenomena known as the cognitive illusions (for good critical reviews, see von Winterfeldt and Edwards, 1986; Koehler, in press; Lopes, in press). That literature, associated with the names of Kahneman, Tversky, and their collaborators, provides evidence of a variety of non-Bayesian forms of human behavior. Although many details are now beginning to be challenged and revised (e. g. by Koehler, in press; Lopes, in press; Gigerenzer, in press), I have little doubt that at least some of the kinds of non-Bayesian behavior found by the researchers on the cognitive illusions are real, reliable, and reproducible. When I put that question to a distinguished group of decision scientists assembled to confer about utility theory, I got unanimous agreement (see Edwards, 1992, p. 254).

I think that, while some of the phenomena reported in the cognitive illusions literature are real, reliable, and reproducible, they are not relevant. That is, they do not apply to the situation in which a designer needs thousands of probabilities for a BN or ID-based system. Why not? I shall discuss three reasons: domain expertise, probability judgment expertise, and consistency checks.

Domain expertise. Designers of normative systems aren't much interested in the abstract question of whether probabilities are appropriately usable to represent Everyone's experiences of uncertainty. My own answer is that they are, if properly interpreted. But the crux of this first point is that we aren't interested in Everyone's uncertainties because our respondents are not Everyone. Instead, we want to work with domain experts, people who know all there is to know about the subject matter of the probabilities being judged. This purpose deprives the cognitive illusions literature, mostly concerned with the behavior of college undergraduates, of any claim to obvious face validity for this problem.

Some (not all) of the cognitive illusions work uses a research tool intentionally designed to avoid any possibility of expertise: Almanac questions. Example: what is the probability that the population of Addis Ababa, Ethiopia, is greater than 1 million? Anyone with access to appropriate sources of obscure data can design batteries of unrelated questions like that one guaranteed to produce the experience of uncertainty in anyone, no matter how expert about what. While such tools may be useful for studying human response to the experience of uncertainty, data obtained using them are irrelevant to our problem.

This argument is correct and important, but it isn't enough. Many experimental psychologists have had the sad experience of making the we-want-realism argument, finally getting access to real contexts, and then finding that people in those contexts behave surprisingly much like college student subjects in laboratory experiments. I do not know whether that would be the result of experiments on probability assessment by experts. Nor do I know how to find out, since I could easily imagine the answer being "It depends on the domain." Few of us would have much enthusiasm for a research program that required access to a substantial number, say 50, of domain experts in each of 20 different domains. Getting access to one expert in one domain is hard enough! In any case, such research would be beside the point. Domain expertise is necessary, but not at all sufficient.

Expertise in judging probabilities. A domain expert may have the necessary knowledge. But he or she is extremely unlikely to have the experience and practice needed to translate that knowledge into probability judgments. Research in the psychophysics of human judgment makes it quite clear that most judgmental skills are response mode specific. You must learn to judge probabilities, just as you must learn to evaluate a pig, throw a baseball, or remember a nonsense syllable.

Our own literature has not paid much attention to the requirement for response mode expertise as well as domain expertise. Hoping to find a domain expert who is also a trained and experienced probability judge is absurd. The only available option is to train domain experts in probability assessment before eliciting their judgments. How? The experience of the National Weather Service (Murphy and Winkler, 1977) suggests an answer: induce the domain experts to make daily probability estimates as a part of their jobs, provide scoring rule feedback (see von Winterfeldt and Edwards, 1986, pp. 122-131), and make that feedback important by making job rewards contingent on aggregated scores. Given this kind of treatment, Murphy and Winkler's data would lead us to expect spectacular performance. But few if any of us are in a position to create such circumstances.

In the National Weather Service context, scoring rule feedback is possible for at least some probability judgments. You can evaluate a probability of rain tomorrow by waiting till tomorrow and seeing whether or not it rains. Unfortunately, no such possibility exists for most of the probabilities one needs in a BN--even a meteorological BN. The events the probabilities of which are being assessed are often not observable, or not easily observable. And even if they were observable, the conditioning events required for the BN may complicate the task of exploiting them by requiring far too many observations.

I have no panacea for such problems. But I can tell you what my colleagues in the HAILFINDER Project and I did about them. HAILFINDER is a system designed to predict severe weather in the plains of Eastern Colorado in the summer time. (For a fairly full description, see Abramson, Brown, Edwards, Murphy, and Winkler, in press.) Our domain expert was a distinguished, highly knowledgeable, highly competent meteorologist. His understanding of probability was what you would expect of a deeply thoughtful physical scientist with a Ph.D. in Meteorology. That is, he knew the formal ideas very well, but had not thought much about how to obtain or use judgment-based probabilities. He had the normal initial insecurities about whether or not he could estimate the probabilities that he knew the system would require. Fortunately, we anticipated this early, and so insisted that his own ability to judge the probabilities required should be one of the criteria he used in evaluating the emerging structure for the BN. The final structure required, and he eventually provided, 3700 probability estimates.

The project was organizationally complicated; the domain expert lived in Boulder, Colorado, but the two main elicitors, Abramson and myself, lived in Los Angeles. We managed to make a two-day trip to Boulder to work with the expert about once a month. This procedure delayed the work, but contributed to its orderliness and planning. It took about 4 months to obtain the BN in well-specified form, without any probabilities.

By the time serious probability assessments were required, the domain expert had a great deal of commitment to the project--and considerable trepidation still about his ability to make them. I told him that he needed to become expert at making probability judgments in his own domain. He asked how one could know whether one was doing a good job or not; I explained about proper scoring rules and their uselessness in this context. Then I suggested that the one feature common to all successful learning experiences was practice--diligent, conscientious, and careful practice. He recognized that there were indeed skills to acquire, and that practice was needed; in particular, he needed to develop and internalize ground rules in order to be able to make consistent judgments. So, with me present and responding to any questions he had, he practiced. The judgments he made for practice were those required for the HAILFINDER BN, but he knew that we would not use them as final numbers.

As he practiced, he found himself inventing numerical response categories, revising them, catching himself in inconsistencies both of categorization and in the numbers themselves, and generally doing the things that an active, intelligent mind does when it must figure out how to do an unfamiliar number-judging task. I watched carefully, mainly over his shoulder. When I saw a judgment that, on the basis of my now-extensive acquaintance with how he thought, looked to me too high or too low, I occasionally asked him to compare it with another number that he had estimated a while back. He quickly learned from such comparisons that a necessary condition for this task is a strongly held and consistent set of standards about what various numbers mean--what experience of predictability or its opposite each possible judgment corresponds to. I did not specify what the numerical properties of his judgments should be, other than that they should be numbers between 0 and 1, and that they should sum to 1 over the relevant partition. He started thinking of a 10-point category scale, but soon realized that he needed finer discrimination than that to be able to express what he knew about the phenomena he was considering. He ended up using what looked to me a bit like a 17-point category scale between roughly .1 and .9. (That is, he preferred two-digit estimates ending in 0 or 5. By no means did he confine himself to such numbers, however.) I had emphasized the logarithmic nature of evidence and the log odds-log likelihood ratio version of Bayes's Theorem, so he used appropriate logarithmic scales for the extremes. We worked hard on getting the transition from extreme scales to middle-of-the-continuum scales right, so that no discontinuity would exist.

After a full 2 days of such practice, I felt that he had reached a stable enough approach to the task that at least the initial learning phase was over. (And in any case that particular visit to Boulder, Colorado, was over for Bruce Abramson, Allan Murphy, and me.) However, it took four more intensive days of handholding before he was confident enough of his own judgments so that he was willing to make them in my absence.

Consistency checks. Since I could not provide him with a definition of a correct judgment, I tried to provide him with as many stimuli to reconsidering his judgments as I could. The first of these was the sum check. For each mutually exclusive and exhaustive set of events, the probabilities were required to sum to exactly 1--no automatic normalization allowed. Moreover, to the limited extent that I could, I tried to encourage him to treat each event in a partition as a topic of thought independent of the other events in that partition. To the extent that he was able to do so, his probabilities frequently did not sum to 1. Whenever that occurred, I urged him to think carefully about each element of the partition. Was its probability too high or too low? I discouraged him from changing only one number when adjusting to make the numbers sum to 1.

A far more important consistency check arose out of the fact that most of the probabilities being assessed were conditional. Suppose a six-element partition is being assessed conditional on the states of an eleven-element antecedent event. Suppose the partition is concerned with Wind Fields in the Plains and the antecedent events have to do with what kind of day it is. The assessor will have assigned a probability to the wind being from the E-NE given that today is a Denver Convergence Vorticity Zone day, and also given that today is an Indonesian Monsoon day. Which is more likely?

This is a judgment that the assessor has not made before, other than implicitly. Yet is it easy for him to make. (In the example, E-NE winds have p = .30 on an Indonesian Monsoon day, but only p = .05 on a Denver Convergence Vorticity Zone day.) We are very familiar with thinking that the likelihood of a subsequent event depends on antecedent events. No formal rules link such numbers. But it is easy to consider two probabilities for the same event, conditional on different antecedents, and ask if they are in the appropriate rank order. (Example: Is a major stock market crash more likely during the next President's term in office if that person is a Democrat or a Republican?)

Such comparisons offer a new standpoint for evaluating one's own judgments. As kibitzer, I knew enough about how the expert thought so that I could pick out a few early examples for him to look at. He soon found that most of the inequalities were in the right direction. But some were not. I diabolically insisted that each such problem be identified and fixed, but gave no advice about how to do the fixing. As the unfortunate expert immediately realized, one cannot change the probability of only one element of a partition! The fact that the numbers sum to 1 now, and must do so also after the changes, requires changing of at least 2. More often than not, fixing one inequality reverses another, which in turn must also be fixed. I insisted that all such possibilities be checked and, if need be, resolved. This process was initially extremely frustrating. But when the expert learned that any incorrect inequality should lead to rethinking of its topic, rather than to an after-the-fact effort to adjust numbers, he came to value it as the most useful tool available to him for evaluating his own judgments. And, over time, the incidence of incorrect inequalities went way down, indicating that the habit of careful thought to avoid creating them in the first place was setting in.

Test-retest reliability. The word most often used to express what property a number-generating process should have is "validity." Very roughly, this means that the number should be appropriate for its intended purpose. Psychometrics offers many ways of evaluating the validities of test scores. In this context also our main hope is that the probabilities we elicit will be valid, in the sense that they will enable the BN to do what it is supposed to do.

An enlightening way to think of the validity of an empirical measure is to ask how well that measure correlates with an imaginary perfect measure. An obvious point is that it cannot correlate with that perfect measure any better than it correlates with itself. That is, reliability sets an upper bound for validity. Moreover, measuring validity is often difficult-to-impossible; measuring, say, test-retest reliability is much easier.

The linkages among the ideas in this BN, rather thoroughly incorporated into the probability judgments via consistency checks, might lead to very good test-retest reliability. The time cost of a thorough study of that question would have been prohibitive, even if I could have obtained the expert's agreement to be checked up on. But a small and very informal test was quite possible. Early in the probability elicitation process, I re-elicited some already-assessed probabilities. Elicitation and re-elicitation were two months apart. At the time of the original elicitation, the expert was not aware that I would want to re-elicit. Prior to the second elicitation, I explained what I was doing and obtained his permission to do it.

Wind Fields in the Mountains

Scenarios/Winds

DCVZSW
DCVZ
LA
CYS
R.
M w/
SW
Indo
M
DM
S, PFU
FH
RA
Other
Westerly
30/30
25/20
85/85
90/70
55/40
5/5
65/70
20/05
40/50
25/15
40/60
L/V or other
70/70
75/80
15/15
10/30
45/60
95/95
35/30
80/95
60/50
75/85
60/40

Wind Fields in the Plains

L/V

3/5
3/5
5/10
25/20
55/40
35/30
30/35
10/07
10/20
65/65
5/10
Denver Cyclone.
80/77
80/75
0/0
3/10
5/17
20/20
15/05
05/05
35/20
05/10
15/05
Longmont Anticycl.
0/0
0/0
90/80
30/15
15/13
05/05
25/10
10/02
05/10
20/10
25/20
E-NE
05/03
05/05
0/0
02/10
10/10
30/25
05/05
75/78
30/25
0/0
15/20
SE Quad
12/15
12/15
0/0
20/25
10/15
10/20
05/05
0/08
15/25
0/10
10/15
Wide Downslope.
0/0
0/0
05/10
20/20
05/05
0/0
20/40
0/0
05/0
10/05
30/30

Table 1. A few probabilities elicited twice, two months apart.

Table 1 shows the result. While these data do not easily lend themselves to statistical study, they certainly hint at an unusually high test-retest reliability. This topic should be studied much more carefully in an experiment, not an anecdotal observation, designed for the purpose. But this preliminary fragment of information is certainly encouraging.

Conclusion. I have made two plausibility arguments, one having to do with respondent domain expertise and the other having to do with experience and training in making judgments of probability, about why the research literature on the cognitive illusions is not relevant to the task of providing probability assessments for use in BNs. I have discussed how I taught probability assessment expertise to one already very sophisticated meteorologist. And I have suggested an approach to consistency checking for probability estimates that, so far as I know, has not been discussed previously in the literature except in other publications about the same test-retest reliability data. The title of this talk aggressively asserts that this paper is going to provide a method for obtaining high quality probability assessments, such as are required for BN-based systems.

Have I delivered on the promise implicit in my title? In one sense, clearly yes; I have proposed a set of procedures, and argued for their merits. I have hinted at the need for reliability studies, and implicitly sketched how they should be done.

But the question remains: are such numbers valid enough to be a secure basis for normative system design?

System performance is a confounded criterion. The obvious answer is: build the system, use it, and see how well it does its job. This program has major difficulties, since BNs have probabilities as their outputs, and it takes a lot of data to validate one probability. But, in addition, the system performance criterion is severely confounded. BNs and IDs contain structural information as well as probabilities; the combination of the two, not either alone, controls how well they perform.

This problem is familiar from the psychometric literature on validation. What is an IQ? After over 100 years of debate, many researchers on it define intelligence as "what the intelligence tests test," which attains validity at the cost of emptiness.

We may not know what intelligence is, except that it is useful in predicting academic and job performances and is not linked to specific knowledges and skills. Yet we don't hesitate to use intelligence tests when we need them.

The point is this: while it would be pleasant to know whether judged probabilities are "correct," it is not essential. Nor is it essential to know that a BN or ID is the best possible one for its purpose. What is essential is to know that such systems outperform competitors that have other intellectual bases. Research comparing the performance of normative systems with that of rule-based expert systems or with direct human expert judgments or decisions has been reported in at least half-a-dozen contexts, mostly not yet published. The differences are large, and so far they invariably have favored the normative systems.

References

Abramson, B., Brown, J., Edwards, W., Murphy, A. H., and Winkler, R. L. HAILFINDER: A Bayesian system for forecasting extreme weather. Journal name, in press.

Edwards, W. Dynamic decision making and probabilistic information processing. Human Factors, 1962, 4, 59-73.

Edwards, W. Toward the demise of economic man and woman: Bottom lines from Santa Cruz. In Edwards, W. (Ed.) Utility theories: Measurements and applications. Norwell, MA: Kluwer Academic Publishers, 1992.

Edwards, W., Phillips, L. D., Hays, W. L., an Goodman, B. C. Probabilistic information processing systems: Design and evaluation. IEEE Transactions on Systems Science and Cybernetics, 1968, SSC-4, 248-265.

Edwards, W., Schum, D. A., and Winkler, R. L. Murder and (of?) the Likelihood Principle: A trialogue. Journal of Behavioral Decision Making, 1990, 3, 75-87.

Gigerenzer, G., and Hoffrage, U. How to improve Bayesian reasoning without instructions: Frequency formats. Psychological Review, in press.

Matzkevich, I., and Abramson, B. Decision analytic networks in Artificial Intelligence. Management Science, 1995, 41, 1-23.

Koehler, J. J. The base rate fallacy reconsidered: Descriptive, normative, and methodological challenges. Behavioral and Brain Sciences, in press.

Lopes, L. L. Algebra and process in the modeling of risky choice. In Busemeyer, J. R., Medin, D. L., and Hastie, R. (Eds.) Decision making from the perspective of cognitive psychology. Academic Press, in press.

Murphy, A. H., and Winkler, R. L. Can weather forecasters formulate reliable forecasts of precipitation and temperature? National Weather Digest, 1977, 2, 2-9.

Savage, L. J. The Foundations of Statistics. New York: Wiley, 1954.

von Winterfeldt, D., and Edwards, W. Decision Analysis and Behavioral Research. New York: Cambridge University Press, 1986.

1This paper is a product of the HAILFINDER project, a close collaboration among Bruce Abramson, John Brown, Allan H. Murphy, Robert L. Winkler, and myself. I list myself as the only author simply because I wrote it and originated many of its ideas. But all of us had a hand in shaping their final form and implementation. The others could and perhaps should have been listed as co-authors. The project was sponsored by National Science Foundation grant SBR-9106440 between NSF and the University of Southern California, and the work was conducted in the Social Science Research Institute of USC.