Daubert, Rule 26(g) and the eDiscovery Turkey: Tasting the eDiscovery Turkey, Part 2

Herbert L. Roitblat, Ph.D.

CTO, Chief Scientist, OrcaTec

In which I continue to argue that how you cooked your eDiscovery turkey in the laboratory may not be a good indicator of its taste or wholesomeness when served from your kitchen.  

This is the second post (see the first one here) concerning Maura Grossman and Gordon Cormack’s article in Federal Court Law Review about how to meet ones Federal Rules of Civil Procedure, Rule 26(g) obligations. Their article is a critique of an earlier one by Schieneman and Gricks, who argued that measuring the outcome of an eDiscovery effort can be helpful in assessing the reasonableness of the effort expended to search for responsive documents.

In their critique, Grossman and Cormack argue that such measurement may be unnecessary under certain conditions. They argue that attorneys should rely on scientific studies of the efficacy of CAR/TAR systems based on an analogy to the Daubert standard, rather than on the evaluation of each particular case. Evaluating the success of eDiscovery, they say, is burdensome and can be misleading.  They liken the process of eDiscovery to that of roasting a turkey.  You dont need a meat thermometer, they argue, if you are skilled at cooking turkeys.  On the other hand, you might be successful at properly cooking a turkey without a meat thermometer, or you might not.  Its the same with eDiscovery.  There may be cases where detailed measurement is unnecessary, but most of the time, the parties would benefit from measurements of their eDiscovery process.

Grossman and Cormack consider several post-hoc measures to assess the outcome of the eDiscovery process. They are correct to point out that the best measures will be those that include the entirety of the process, not just the software process used. But then they ignore their own advice and focus solely on the technology. They mention several potential measures, but then dismiss them as requiring an unreasonable amount of effort or for being inaccurate.

They state correctly that the most direct method used to estimate Recall can be highly burdensome. Recall is percentage of responsive documents that were found by the system, but we dont really know the actual number or identity of these documents, so we have to estimate it with a sample. Estimating Recall directly requires a sample of about 385 responsive documents for an estimation with a 5% confidence interval at a 95% confidence level.  Finding that set is where the burden comes in.  We cannot use the CAR/TAR process, itself, to find them, because we want to assess the ability of that process to find these documents. We need an outside standard against which to assess the process.  Instead, we have to sift through a random selection of documents until we find 385 responsive ones. This random set should be derived without knowledge of what the CAR/TAR system identified to achieve a valid measure.

If the prevalence of responsive documents in the whole collection is 1% as Grossman and Cormack say that they typically find, then we will have to sift through an average of 38,500 (385 is 1% of 38,500) randomly selected documents in order to find 385 that are responsive. This level of effort may be unreasonably high for most matters, typically costing more than the effort do the CAR/TAR training whose outcome it is measuring.

Misinterpretation and eRecall

Grossman and Cormack erroneously dismiss other measures of accuracy. One of them, eRecall, estimates Recall by comparing an estimate of responsive document prevalence before using the CAR/TAR process with an estimate of Elusion, which is the prevalence of responsive documents in the set after the CAR/TAR process has done its work to identify the responsive documents. Elusion is the proportion of the documents that are erroneously classified as non-responsive. It is the proportion of responsive documents in the so-called discard set.

eRecall is much easier to estimate than traditional Recall because it requires a simple sample of all of the documents to find the proportion that are responsive (Prevalence) and a simple sample of those designated as non-responsive by the process to estimate Elusion. Good estimates of Recall can be obtained by evaluating a few hundred documents rather than the many thousands that could be needed for traditional measures of Recall.

They incorrectly dismiss eRecall as being mathematically unsound and biased. There is nothing mathematically unsound about eRecall. It is a measure of the ratio of the proportion of responsive documents remaining after the process has run to the proportion of responsive documents in the collection as a whole. It is slightly biased in that it under-estimates Recall around 75% by a tiny amount. The amount of the bias decreases as the sample size increases.

In statistics, we would prefer unbiased measures, but many statistical estimators have some, usually small, bias . Statistical bias does not mean that a measure is unsound. A biased measure is like a clock that runs a little fast or slow.

The reason that eRecall appears biased is that it has a long tail at the low end (a few samples return eRecall estimates that are near zero, and is truncated at the top end, eRecall can never be greater than 1.0.

Simplifying a bit, in the following graph, the blue line represents the likelihood of obtaining a specific eRecall value when the true Recall is equal to 0.75 (75% Recall, the red line). The most likely eRecall value (the peak of the blue line) is right at the expected value with some spread around that. Other values are less likely, and less likely still the further they are from the maximum.  If you average all of these likelihoods, the average will be slightly below 0.75, meaning that the measure has a bias to under-estimate Recall at 75%.

The expected distribution of sampled measures of eRecall for a true Recall level of 75%.

The expected distribution of sampled measures of eRecall for a true Recall level of 75%.

Grossman and Cormack claim that an assumption of eRecall is that it has the same confidence interval as directly measured Recall. This is incorrect. It is not clear why they think that that it is necessary to assume that the two measures have the same confidence interval. They do not, and they do not have to.

In use, eRecall has a larger confidence interval than directly measured Recall because it
involves the ratio of two random samples. On the other hand, the large number of documents that must be reviewed to compute Recall directly means that they will necessarily be reviewed by many reviewers, which leads to inconsistency.  In contrast, the samples needed to estimate eRecall can be evaluated by a single person, thereby reducing the inconsistency of the review. Reviewer inconsistency would arguably contribute more variability than sampling would.  In any case, a larger confidence interval does not make the measure mathematically unsound as Grossman and Cormack claim.

Even a measure with a large confidence interval may still be useful for assessing the success of an eDiscovery process. Grossman and Cormack note that “it is neither feasible nor necessary to expend the effort to estimate recall with a margin of error of ±5%, at a 95% confidence level—the standard required for scientific publication.” Rather, as I have argued, the confidence interval adopted for any given measurement should be one that is consequential.

eDiscovery is not a science project. We are not very much interested in estimating the exact Recall level achieved, but in being confident that the process has been reasonably complete. In that regard, we are usually more interested in being confident that the eDiscovery process has met certain standards, not in estimating its exact value. Even if an overly precise measure is not obtainable, that does not mean that a measure is not still desirable. It would be erroneous to conclude that no measure is better than one that is not guaranteed accurate.

eRecall depends on an estimate of the overall prevalence of responsive documents in the collection as well as an estimate of the prevalence of erroneously classified responsive documents in the set designated as non-responsive.

For Cormack and Grossmans CAL system, the overall estimate of collection prevalence can be a burdensome requirement because it requires a separate sample, on top of their search for a seed set. For systems that use random sampling, on the other hand, the estimate of prevalence comes automatically from judgment of the random samples. Therefore, it is much more efficient to calculate eRecall with random sampling than with continuous active learning.


Grossman and Cormack try to argue that assessment of individual eDiscovery processes is overly burdensome and unnecessary. They suggest that reliance on prior demonstration of a system’s efficacy as run by a qualified user is sufficient to establish the reasonableness of an eDiscovery process. Although they note that a proper assessment will include more than the technology, they fail to recognize that their proposal ignores these very factors that they claim are important.

They noted that the variance for a given measure is a function of the sampling error and the measurement error. By relying on previous scientific studies, they effectively compound that variability with an approach that includes:

Variance due to characteristics of the study

  • The original document set
  • The query set (the RFP)
  • The special conditions, if any of the investigation
  • The system’s users
  • The quality of the keywords used to build the seed set
  • Statistical sampling error
  • Assessment error

Plus variance due to the characteristics of the eDiscovery for which the system will be used:

  • The document set
  • The sampling error
  • The query set (the RFP)
  • The special conditions, if any of the collection
  • The system’s users
  • The quality of the keywords used to build the seed set
  • Statistical sampling error
  • Assessment error

Their proposal would not satisfy the very Daubert-like criteria that they propose to use because it ignores many important variables and because of the serious flaws in the scientific research that they use to support their position.

If we follow their logic, we would be left with an analysis of the reasonableness of a CAR/TAR process that consists mostly of “The process is reasonable because I said it is reasonable.”  We can do better.


Leave a Reply

Visit Us On TwitterVisit Us On LinkedinVisit Us On Google PlusCheck Our Feed