Random Sampling as an Effective Predictive Coding Training Strategy

By Herbert L. Roitblat, Ph.D.

I don’t usually comment on competitors’ claims, but I thought that I needed to address some potentially serious misunderstandings that could come out of Cormack and Grossman’s latest article, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” Although they find in this paper that an active learning process is superior to random sampling, it would be a mistake to think their conclusions would apply to all random sampling predictive coding regimens. The random sampling training regimen used by OrcaTec, for example, achieves higher levels of Recall* with less training than Cormack and Grossman’s best learning algorithm.

Cormack and Grossman compare three methods for going through documents in an eDiscovery collection. Their best method is called CAL (Continuous Active Learning). It starts with a seed set, and then presents for review those documents that are predicted to be most likely to be responsive. The second method that they use is the SAL method (Simple Active Learning), which also begins with a seed set derived from a keyword search. The SAL algorithm presents for review those documents about which the algorithm is least certain. The process with this method continues until the cost of adding more documents is outweighed by the effort to review them.

Both of these methods are examples of “active learning,” in that they choose for reviewer feedback certain documents based on the predictions of the respective algorithm. These methods contrast with the third method they study, which Cormack and Grossman call the SPL method (Simple Passive Learning). It uses either random sampling or the operator’s choice to select documents to be reviewed and added to the training set.  They find that the SPL method yields very poor results compared to the other two methods. They claim that these results call into question not only its [random sampling's] utility, but also commonly held beliefs about TAR.  About this they are mistaken.  Random sampling can and does yield better results than their CAL method.

The first graph below replots their CAL results for the four legal matters that they reported. I have not included the results from the TREC series of questions because these examples are a bit contrived and because the four curves in this graph represent actual legal matters on which their software was used. On the same graph, I also show comparable results from three matters on which the OrcaTec predictive coding system (OrcaPredict) was used.

OrcaPredict achieves higher accuracy than the CAL algorithm

OrcaPredict achieves higher accuracy than the CAL algorithm

The x-axis shows the number of training documents that were reviewed. The y-axis shows the level of Recall obtained. The best performance is represented by points in the upper left hand corner, where high levels of Recall are obtained after reviewing only a small number of documents.

With the exception of Matter C, the results of the OrcaTec predictive coding system (OrcaPredict) were substantially superior to the results obtained with the CAL algorithm. OrcaPredict required between 1,200 and 5,000 training documents to be reviewed to achieve high levels of Recall.  There is certainly no support for the claim that the CAL algorithm is better than the random sampling approach embodied in OrcaPredict.

The next graph shows the number of documents that had to be reviewed to achieve the observed Recall level. For the three OrcaTec results, the observed Recalls were 0.907, 0.965, and 0.81. For the Cormack and Grossman matters, the number of documents to achieve a given Recall was determined to be  the first point in the corresponding graph above either 0.80 (80%) or 0.90 (90%). In general, many more documents needed to be reviewed to achieve a high level of accuracy in the Cormack and Grossman results than in the OrcaTec projects.

The number of documents to review to achieve either 80% or 90% Recall

The number of documents to review to achieve either 80% or 90% Recall

The third graph shows Cormack and Grossman’s results from the CAL algorithm and the three OrcaTec matters in terms of Recall and Precision**. Perfect performance on this graph is in the upper right-hand corner. Cormack and Grossman achieve high levels of Recall, but at the expense of low levels of Precision. OrcaPredict achieves high Recall and Precision, with small levels of effort. Higher precision means that fewer documents need to be reviewed in order to find the responsive ones.  The CAL process required (simulated) reviewers to examine many more documents to identify responsive ones than the OrcaPredict reviewers.

Precision vs Recall for Cormack & Grossman and for OrcaPredict

Precision vs Recall for Cormack & Grossman and for OrcaPredict

In summary, it is very difficult to support the claim that Cormack and Grossman’s CAL procedure is superior to random sampling in general. Although in their investigation, they do find that their implementation of random sampling yields notably poorer Recall than the other methods they employ, that difference does not extend to other implementations. In particular, it does not extend to OrcaPredict.

Why does OrcaPredict perform so well when Cormack and Grossman’s SPL algorithm performs so poorly?

We don’t have enough information to explain definitively why their SPL process is so much poorer than OrcaTec’s random sampling-based process, but there are a few factors that might merit further investigation.

First, we engage in different tasks. Although both approaches are intended to separate responsive from non-responsive documents, their approach involves continuous active learning. In their view, “the TAR process typically begins with no knowledge of the dataset and continues until most of the relevant documents have been identified and reviewed. A classifier is used only incidentally for the purpose of identifying documents for review.” In contrast, the OrcaTec process includes exploratory analysis so that the subject matter expert does, in fact, know something about the data set before beginning training.

Further, OrcaPredict separates review and training into separate steps. The classifier is not incidental to review, it is an essential part of identifying the responsive documents.  Training typically requires the subject matter expert to review 1,200 – 5,000 documents, following which the system can accurately predict the status of the remaining documents in the collection. In contrast, Cormack and Grossmans CAL algorithm combines training and review into a single step, so the subject matter expert(s) using their system must review many thousands of non-responsive documents (reviewing a total of 4,000 – 18,000 documents to find 1,000 to 16,000 relevant documents). OrcaPredict subject matter experts review only a relatively small number of non-responsive documents.

Cormack and Grossman used a simulated feedback process, whereas the OrcaTec results were derived from an actual case. We do not know what effect the simulation had on the accuracy of the results.

Evaluation of the OrcaPredict projects was done directly by the attorneys running the project, but Cormack and Grossman relied on a so-called gold standard. As I understand it, Grossman uses the CAL algorithm in her engagements with her clients, so their production, on which this gold standard was based, relied on this specific algorithm, plus some second-pass review and “quality assurance efforts.” In other words, they used the CAL process to identify the very documents that the CAL process was so good at finding. They do not know if there were other documents in the set that were not identified by the CAL process, because they only looked at documents that were selected by the CAL process. What we learn from using the CAL process as both the test and the standard is that it is repeatable. If this is correct, then it is a very serious flaw in their study.

Another potentially important difference is that OrcaPredict represents the documents that it classifies by the words, in Unicode, that they contain. In contrast, CAL used the first 30,000 bytes of the ASCII text representation, including sender, recipient, cc, bcc, subject and date) broken into overlapping 4-byte shingles, where each 4-byte sequence was represented by one of about a million different values. This shingle representation would tend, I think, to destroy any semantics (meaning) in the documents, which could help to make the documents more difficult to classify for any algorithm other than CAL.

The actual words used in a document may be critical for other machine-learning algorithms, for example, by allowing words learned in one kind of document, such as an email, to be gainfully used in another kind of document, such as a spreadsheet. It is not clear, for example, how useful letter strings like “cker” or “ters” are for distinguishing responsive from non-responsive documents.

These and other differences might account for why their random sampling performs so poorly.  If theirs were the only random-sample predictive coding results available, then they would be justified in claiming that random samples are not suitable for training predictive coding, but they ignore the success of this approach when used by their competitors.  Random sampling is not only a viable method of training predictive coding, it can yield superior results.  Therefore it would be wrong to conclude that random sampling is not an appropriate method for training predictive coding.

Random sampling is still preferable

Even if random sampling were somewhat less efficient than other approaches, there would still be good reasons to adopt it.

eDiscovery is not just an academic exercise. It’s not clear how to generalize the results of Cormack and Grossman’s study to the real-world of day-to-day eDiscovery operations. In real eDiscovery, it may not be clear how to determine when it would be reasonable to stop looking for additional responsive documents, though clearly Grossman uses their process for that purpose. Unless a separate effort is made to assess the proportion of responsive documents in the collection (using a random sample), it is impossible even to estimate how well the system is doing.

OrcaPredict, in contrast, generates a new random sample after each iteration that allows the attorneys to evaluate the current level of Recall and Precision. Within statistical confidence limits, it is possible to know exactly how well the system is doing. Measurement is an intrinsic part of the training method, not something that has to be tacked on separately.  The quality of the process is immediately and transparently available at each iteration.

eDiscovery is often conducted in an adversarial situation. The receiving party may have concerns about the operation of the technology, but they may also have concerns about the willingness or ability of the producing party to select the information that is required. These concerns may be inappropriate, but some of the most complicated predictive coding protocols that have been published have been built around assuring a level of trust in the efforts of the producing party. Random sampling for training seems to help somewhat in establishing that trust. Receiving parties, for better or for worse, seem to believe that there is less chance that the producing party can game the system when the training set is chosen randomly.

If the training set needs to be turned over to the other side as part of building confidence, then the fact that it was randomly generated provides no additional attorney work product beyond whether the document was considered responsive or not. Courts have been inconsistent on whether these training sets must be shared.

The CAL protocol presents for review the highest-ranking remaining documents at the moment, so how the initial set is chosen can have a profound effect, because the documents that are presented for review are those that are most like those already identified. This paper is not sufficient to investigate this question because of the circular evaluation it uses. The CAL algorithm was a key component in producing the so-called “gold standard” so documents not identified by this algorithm would be unlikely to be considered for review. Analyzing performance with different initial seed sets would help (which they begin to investigate in their paper), but more information is needed.

Three factors are important determinants of the effectiveness of predictive coding: Validity, Consistency, and Representativeness. Validity means that the training examples of responsive documents need to actually be responsive. They need to be valid examples of responsive documents. Consistency means that the same documents need to be categorized in the same way each time they are encountered, or similar documents need to be treated in similar ways. Representativeness means that the example responsive documents need to reflect the breadth of responsive documents in the collection. These three factors are not absolutes. Most systems can tolerate some deviation in any or all of them, but the more valid, consistent, and representative the training set is, the higher the accuracy of the system will be, all other things being equal.

Training conducted by a subject matter expert contributes to validity. Training by a single subject matter expert contributes to consistency. Random sampling contributes to representativeness.  These are the methods adopted by OrcaPredict and they appear to contribute to its effectiveness.


Although Cormack and Grossman found in their investigation that the SPL method with random sampling performed poorly, there is no reason to think that this pattern of results would apply to random sampling in general. Results from actual matters where random sampling was used to train the predictive coding system resulted in higher levels of Precision and Recall than their CAL system, not the very low levels of accuracy they found with SPL. Furthermore, there are advantages to using random sampling as a training method. Random sampling is thought to be harder to game than motivated sampling by some receiving parties. It provides representativeness of the training examples, and it allows one to assess the current accuracy of the system throughout training.

In short, it would be a serious mistake to take one vendor’s (Cormack and Grossmans) failure to observe effective results with a competitors method as evidence that they have identified the key limitations in the products of other vendors (e.g., OrcaTec).

*Recall is the proportion of responsive documents in the entire collection that have been retrieved.
**Precision is the proportion of documents the computer determined to be responsive that were found upon review to be, in fact, responsive.


Leave a Reply

Visit Us On TwitterVisit Us On LinkedinVisit Us On Google PlusCheck Our Feed