Daubert, Rule 26(g) and the eDiscovery Turkey

Herbert L. Roitblat, Ph.D. CTO, Chief Scientist, OrcaTec


In which I argue that how you cooked your eDiscovery turkey in the laboratory may not be a good indicator of its taste when served from your kitchen.  Roast-Turkey-with-Turkey-Gravy

In an article in Federal Court Law Review, Maura Grossman and Gordon Cormack critique Karl Schieneman and Thomas Gricks’ analysis of the implications of Federal Rules of Civil Procedure, Rule 26(g) for the use of Computer Assisted Review (CAR or Technology Assisted Review, TAR). Put simply, (among other things) Schieneman and Gricks argue that one should measure the outcome of eDiscovery efforts to assess their reasonableness, and Grossman and Cormack argue that such measurement is unnecessary under certain conditions.

Rather, Grossman and Cormack argue, attorneys should rely on scientific studies of the efficacy of CAR/TAR systems based on an analogy to the Daubert standard. They argue that evaluating the success of eDiscovery is burdensome and can be misleading. They liken the process of eDiscovery to that of roasting a turkey.

By my analysis the turkey analogy may be appropriate, but the Grossman and Cormack analysis does not stand up to its own criteria. Grossman and Cormack are mistaken about the burden required to measure the efficacy of eDiscovery processes.  In this discussion, I will focus on Grossman and Cormack’s scientific arguments.

Applying Daubert

The central tenets of Grossman and Cormack’s thesis is “Relying on the elements of the Daubert test, we argue that the principal focus of validation should be on (i) prior scientific evaluation of the TAR method, (ii) ensuring its proper application by qualified individuals, and (iii) proportionate post hoc sampling for confirmation purposes” (p. 289). In their conclusion, they seem to drop the third focus.

To be clear, Grossman and Cormack are not claiming that the Daubert standard is literally applicable to eDiscovery, but that the principles it articulates are relevant to our assessment. The parts of Daubert that seem most applicable are “whether the theory or technique in question can be (and has been) tested, whether it has been subjected to peer review and publication, its known or potential error rate, and the existence and maintenance of standards controlling its operation, and whether it has attracted widespread acceptance within a relevant scientific community.”

The stronger version of their argument is that the CAL system, which was designed by Cormack and Mojdeh, is the only CAR/TAR system that has received this prior scientific evaluation (by Cormack and Grossman), and the only one that has been subject to peer review and publication. Its error rate has been presented. Therefore, it is the CAR/TAR system that should be used.

There is no evidence, however, that there are any standards controlling its operations or that it has attracted widespread acceptance in any scientific community. Grossman and Cormack do not say how important they consider these factors.

Their CAL (Continuous Active Learning) process was presented at TREC 2009 and in a paper prepared for the recent SIGIR conference (Special Interest Group on Information Retrieval of the Association for Computing Machinery). I have previously written about the SIGIR paper. Because of serious methodological flaws, it neither supports nor denies the effectiveness of the CAL system.

To summarize some of these flaws:

  • They used the CAL method in the original production and then used the CAL method again to determine how well it predicted the original production. This is circular and invalid.
  • Any document not presented for review by the CAL method in the original review had little chance to be identified as responsive and was defined to be non-responsive in the study. (There was a potential for some responsive documents to be added to the evaluation set through judgmental sampling or random sampling, but there were not likely to be many of them.) Therefore, the study provided no estimate of its true Recall, only its ability to reproduce the results it originally produced. Any document categorized as responsive by any of the other systems that was not categorized by the CAL system was defined as non-responsive in the study, eliminating the possibility of estimating their true Recall values as well.
  • The measure that they used for assessing the various systems was the method used by the CAL system, not the others that they tested. The measure was the ability of the system to present documents predicted to be responsive before other documents. Only the CAL system selected documents for review according to this pattern, so only it achieved what appeared to be high levels of performance.
  • They tested a specific machine-learning algorithm with a specific method and three different means of choosing the training set presented to that single algorithm. Whether inadvertent or not, their arguments have been taken to imply that their results speak to the training set selection methods (Continuous Active Learning, Simple Active Learning, and Simple Passive Learning) across all types of machine learning when, at best, they could only make claims for the one algorithm that they used.

Given these flaws, their SIGIR paper cannot be considered as providing scientific support for the efficacy of their system, at least for the data derived from actual legal matters. These flaws do not mean that the CAL system was necessarily ineffective. Rather, it means that the study simply cannot be used as strong support for the CAL system. The data derived from the TREC studies may be a little bit better, but they cannot be seen as independent in any sense, in that the TREC legal track was overseen by Grossman and Cormack as well.

Is a Daubert-like standard the right way to think about predictive coding?

Even if their CAL system met the criteria they derive from Daubert, is that the right standard to apply? Is it adequate?

The general idea is that we can minimize the effort spent on assessing the quality of an eDiscovery process if that process can otherwise be shown to be adequate. According to Grossman and Cormack, once a system has been certified as passing a Daubert-like standard that should be enough, provided that it is operated by a qualified person. They don’t spell out how we know whether a person is qualified. They could mean, for example, that any attorney is qualified to run the CAL process or that only the attorney who ran it in the studies, presumably Grossman, is qualified. If the latter, the implication would be that the only justifiable use of CAR/TAR would be Grossman using the CAL system. That position is too self-serving to be likely.

On the other hand, if any lawyer is considered qualified, then variability in their search skills may have a significant impact on the quality of the results. If some lawyers are qualified and others are not, then we need a way to identify their qualifications.

Scientific investigations are important to developing CAR/TAR  systems that work, but eDiscovery is not a science project. CAR/TAR systems can be shown to be effective in a number of ways, including their performance in actual cases. The purpose of eDiscovery systems is to help their users meet their obligations, not to support research projects. There is nothing unique about scientific papers that give them a monopoly on ways of knowing. Scientific principles can be applied in many different ways to build CAR/TAR systems. There are many ways of vetting a process, including pilot projects and assessment of the final results.

Grossman and Cormack compare eDiscovery using CAR/TAR with the process of roasting a turkey:

turkey with thermometerWhen cooking a turkey, one can be reasonably certain that it is done, and hence free from salmonella, when it reaches a temperature of at least 165 degrees throughout. One can be reasonably sure it has reached a temperature of at least 165 degrees throughout by cooking it for a specific amount of time, depending on the oven temperature, the weight of the turkey, and whether the turkey is initially frozen, refrigerated, or at room temperature. Alternatively, when one believes that the turkey is ready for consumption, one may probe the turkey with a thermometer at various places.

They argue that we do not need to measure the temperature of the turkey in order to cook it properly, that we can be reasonably sure if we roast a turkey of a specific weight and starting temperature for a specific time at a specific oven temperature. This example is actually contrary to their position. Instead of one measure, using a meat thermometer to assess directly the final temperature of the meat, their example calls on four measures: roasting time, oven temperature, turkey weight, and the bird’s starting temperature to guess at how it will turn out. All four of these have to be known to reasonable tolerance in order to roast a turkey that is done enough to be safe, but not roasted into shoe leather.

To be consistent with their argument, they would have to claim that we would not have to measure anything, provided that we had a scientific study of our oven and a qualified chef to oversee the cooking process. This example points up the essential problem with their position. It may be great to have a good history of cooking turkeys, but the quality of tonight’s dinner depends on the specifics. Grossman and Cormack recognize that there are many variables that go into a successful eDiscovery project, but then like the turkey roaster, ignore those problems in their suggested “way forward.”

eDiscovery Variability

Sources of variability in the eDiscovery process

Some of these sources of variability are diagrammed out in the flow chart at the left.  The completeness of any eDiscovery process can be affected by the document set and case issues. The prevalence of responsive documents in a collection is the most obvious factor that is associated with the document set, but other factors can also contribute to the ease or difficulty of the eDiscovery process, including the number of documents that have to be OCRed, the percentage of obviously junk documents, and others. The subtlety of the distinctions between responsive and non-responsive documents as determined by the case issues or the request for production can also affect the ultimate ease or difficulty of the process.

Some situations and some attorneys are more generous than others in terms of what should be produced. Sometimes it may be more appropriate to apply strict criteria to the documents that are considered relevant, sometimes it may be more appropriate to be less strict. There is an inherent tradeoff between Precision and Recall. Improving Recall may come at the expense of decreasing Precision, producing more non-responsive documents along with the responsive ones. These and other strategic decisions can affect the ultimate judged accuracy of the eDiscovery process.

The quality of the seed set that is used to train the system can also affect the ultimate accuracy of eDiscovery. Some systems are more dependent than others on the quality of the seed set and attorneys vary in their ability to construct these sets.

The quality and consistency of the reviewers can also play an important role in the accuracy of eDiscovery. The variability or inconsistency of the reviewers can affect the training of the system and can affect the judgments about the success of the process.

Technologies used in CAR/TAR also differ in their abilities to produce accurate results. Some of this variability is the extent to which the specific technology is sensitive to the other factors, which may argue that certain technologies might be better suited to different situations. We generally do not know what portion of the success of a system can be attributed to the technology per se, but it is certainly less than 100%.

The system’s performance in a “scientific study” provides no information about any of these sources of variability, except the technology. Moreover, that technology is tested in a single situation, with a single set of queries, and so on.

In another study of CAR/TAR systems conducted by the eDiscovery Institute and Oracle, they found that the same underlying technology used by different vendors could end up performing well or performing poorly. (Although I am a member of the EDI board, I played no role in any aspect of the study or its analysis.)

As Patrick Oot reports: “Technology providers using similar underlying technology, but different human resources, performed in both the top and bottom tiers of all categories.” Just knowing about the technology is, therefore, not enough to predict how well a given CAR/TAR project will work out.

Grossman and Cormack say, “Perhaps a gifted searcher (a �TAR Whisperer’) could judgmentally select precisely the right set of examples to correctly train the SPL tool, but can the typical responding party be relied upon to conduct an adequate search?” The same question applies to any tool that depends on a seed set. The CAL system depends strongly on the seed set because it never presents for review documents that are not already predicted to be responsive (see Cormack and Mojdeh, fig. 4). Documents that score far below the currently predicted responsive set, that is documents that are very different from those already predicted to be responsive, are truncated from the distribution. There is little chance that they can ever be judged by the reviewer and thus added to the set of positive examples. The dependence of the system on its seed set follows directly from the description of the CAL model, and they present no contrary evidence.

Scientific studies or other generic validation methods cannot account for the variability that derives from the employment of the technology in specific matters by specific individuals. Grossman and Cormack, themselves, argue in their description of the Daubert-like framework for proportionate post hoc sampling for confirmation purposes, but then do not follow through to offer any.

Given these sources of variability or uncertainty in the execution of the CAR/TAR process, it seems clear that we cannot replace the need to evaluate the outcome of each individual use of the process. There may be matters where it makes sense for the parties to simply accept each other’s production, but not all cases enjoy that level of trust. For those, it may be better to “trust, but verify.”  That verification requires analysis of the results of the entire process, as Grossman and Cormack suggest, and that requires measurement.  In part two of this blog, I consider some of the measures that could be employed with manageable levels of effort.


Leave a Reply

Visit Us On TwitterVisit Us On LinkedinVisit Us On Google PlusCheck Our Feed