By Herbert L. Roitblat, Ph.D.
Let’s face it. “Active learning,” where the computer picks the training examples sounds cooler than “passive learning,” where the training examples are chosen randomly. Who wants to think that they are passively sitting by when they can be actively going out and finding responsive documents? But when you get past the feel-good aspects of the name, there are some real advantages to a system based on “passive” random sampling.
Predictive coding uses machine learning algorithms to construct computational criteria for separating responsive from non-responsive documents. There are many protocols and algorithms that can be used to make that distinction. Many of them start with an initial set of documents that are labeled responsive or non-responsive by some subject matter expert or group of experts. This initial set is typically called a “seed set.” I prefer not to use a seed set because it usually relies on keyword searches, which have been known for nearly 30 years to be relatively ineffective (Blair & Maron, 1985). I explain more about that below.
After processing the seed set, if any, most predictive coding systems go through a series of steps in which the subject matter expert is presented with additional example documents to label as responsive or non-responsive, and the learning system is adjusted to better approximate the ultimate distinction between document categories in the hope of extending the distinction accurately to as-yet unseen documents.
Here, I concentrate on the way in which example sets are selected for review by the subject matter expert(s).
When the additional examples presented for labeling are selected by the computational algorithm, then the process is called “active learning.” When the examples are selected randomly, the process is called “passive learning.” There may also be situations where the examples are chosen by the subject matter expert, but we will not be considering them further here.
No matter what technology you use, there are three factors that affect the
success of a predictive coding process: validity, consistency, and representativeness.
The more your training set confirms to these three goals, the better the accuracy
of the system will be, all other things being equal.*
Validity means that the distinctions in the training set between responsive and non-responsive documents correspond to the actual distinctions between responsive and non-responsive documents.
Consistency means that the same information is treated in the same way if it is encountered multiple times and that similar information is treated in similar ways every time it is encountered. High validity cannot be achieved without high consistency.
Representativeness means that the set of training examples represents the variability among the documents to be classified, not just subsets of them, for example, not just responsive documents that are easy to find.
Although other approaches can achieve high representativeness, one critical reason for using random sampling to pick the training set is because it automatically maximizes the representativeness of the training set. With other approaches, you need to take additional steps to ensure representativeness (for example, having an initial seed set that is sufficiently representative).
There are certain conditions under which some document categorization algorithms are more efficient with active learning algorithms than passive learning algorithms, for example, using support vector machines (see Tong and Koller, 2002). But this generalization may not apply to all tasks and all algorithms.
For example, Hanneke (2012) reports that “It is now well-established that active learning can sometimes provide significant practical and theoretical advantages over passive learning, in terms of the number of labels required to obtain a given accuracy. However, our current understanding of active learning in general is still quite limited in several respects.” He further notes that “for many problems, the minimax label complexity [the number of documents that must be labeled responsive or non-responsive) of active learning will be no better than that of passive learning. In fact, Balcan, Hanneke, and Vaughan (2010) later showed that, for a certain type of active learning algorithm — namely, self-verifying algorithms, which themselves adaptively determine how many label requests they need to achieve a given accuracy — there are even particular target concepts and data distributions for which no active learning algorithm of that type can outperform passive learning.”
My purpose in raising these ideas is not to criticize active learning, but to note that the its advantages may not apply universally to every task and every machine learning algorithm, let alone to every predictive coding process.
Even if there is a speed advantage for active learning in a particular situation, that difference may be small and may be outweighed by other considerations. For example, a process that required few training examples, but took weeks to compute might not be preferable to one that took a few more training examples, but required only minutes or hours to compute. Similarly, a process that required fewer labeled examples, but ended up with more non-responsive documents to review might not be preferable to one that was slower to train, but ultimately resulted in the review of fewer total documents.
The goal of a predictive coding system is to help attorneys save time, cost and effort while increasing their accuracy. In addition to the accuracy of the technology, the total cost of a predictive coding effort also depends on legal, ethical, and rhetorical factors. All of these factors should be weighed in choosing the most effective process for a particular matter.
Predictive coding is conducted, often, in an adversarial situation. What this means is that the science alone is not enough to determine how predictive coding should be conducted. Science is a critical part of the answer, but it is not the only part.
From a scientific perspective, the goal of predictive coding is to identify as close to all of the responsive documents in a collection (high Recall) and to misidentify as few of the non-responsive documents as possible (high Precision). Economically, the goal for the producing party is to meet its obligations with the lowest possible total cost in the least amount of time. For the receiving party, the goal is to get the information needed at the lowest cost, also in the least amount of time.
The producing party has a legal and ethical obligation to produce responsive documents, but there can be differences of opinion concerning just what has to be produced. The receiving party sometimes has concerns that these differences of opinion will be exploited by the producing party to avoid delivering information that would be damaging to its case unless it is compelled to do so.
The producing party sometimes has a concern that they will inadvertently produce unnecessary documents that could either violate privilege or provide ammunition for the receiving party’s next legal action. All of these considerations, and others, figure into choosing a predictive coding approach. Available predictive coding approaches differ not just on the accuracy of their results and speed of training, but they also differ with respect to how they address these other potential concerns.
When I first considered the predictive coding process that would be used by OrcaTec, I considered a large number of candidates, including some that used active learning. I determined at the time, using realistic data, that we could either make the users work harder or make the computer work harder than the model I eventually chose, but not get significantly better results.
I prefer simple, scalable solutions, and the one we selected delivers on more than a theoretical basis. To be frank, we discovered some of these advantages through using the system. I cannot claim that all of this was explicitly considered in the original selection. I’m not that smart.
OrcaTec predictive coding is related to Bayesian categorization, but is based directly on language modeling. It separates the predictive coding processing into a learning stage and a categorization stage. During the learning stage the system gets examples of responsive and non-responsive documents and forms a language model for each of these two categories. During the categorization stage, it applies these language models to each document in the collection and determines whether that document is more like the responsive or the non-responsive language models.
During the course of training, the subject matter expert sees randomly chosen documents and must decide whether each one is responsive or non-responsive. The subject matter expert needs to know how to recognize responsive documents, but does not need to know how to find them.
I chose this approach for a number of reasons. One of them was because of research on the psychology of memory showing that people are much better at recognizing correct answers than at recalling them. They are much poorer at saying who the actor was who played opposite Tom Hanks in the movie “Charlie Wilson’s War” than to recognize that it was Philip Seymour Hoffman. Multiple choice tests are usually easier than essay tests.
A second reason to use random sampling was because of its statistical “guarantee” of representativeness. I had found in my previous work and from studies that I had seen that people were not necessarily accurate when guessing the right words to search for (analogous to memory recall) and that they seldom evaluated the accuracy of their guesses.
In the famous Blair and Maron study, mentioned above, the lawyers thought that they had found 75% of the responsive documents, when they had actually found only about 20%, an example of the over-confidence effect. Random sampling minimized the dependence of the outcome on the subject matter expert’s search skills and on the subject matter expert’s tendency toward over confidence. Achieving a representative training set is difficult with any other method, but I think that it is critically important for effective results.
A third advantage of using random sampling is that it does not require the effort of constructing a seed set. In Kleen Products, for example, one defendant said that they spent 1400 hours coming with a set of key words. Any time and cost savings of using predictive coding could be wiped out by the effort of constructing a conscientious keyword list to ensure representativeness.
A side effect of constructing a seed set is the opportunity for the two sides to meet and confer about exactly how it should be constructed. These negotiations can also take an unreasonable amount of time. If no seed set is required, there is no need to negotiate about how it will be constructed.
Some receiving parties see random sampling as a more direct, simpler, process that is more difficult to game than using a seed set is. The problem is lack of trust in opposing counsel to put in the effort needed to identify responsive documents. Their thinking might go like this: I believe that the producing party will be ethical in that if they see a responsive document, they will accept their responsibility and mark it as such. On the other hand, their ethical obligation does not extend to finding new responsive documents, so they may not be willing to extend their search, but would be willing to accept responsive documents that they were shown. Therefore, a passive system, where we take the producing party’s level of motivation out of the equation, would be a better choice than an active one, where they have to actively create a seed set. If the receiving party cannot control the process of constructing a seed set, then their best choice is random sampling.
Many attorneys also recognize that they do not have enough information at the start of a litigation to effectively build a seed set, even if they had the opportunity, and they may be handicapping themselves by forcing this selection without that information.
A fourth advantage of using random sampling is that each training iteration involves a completely new, completely independent set of documents. The system’s ability to predict how these new documents should be categorized, then, gives a fresh estimate of how the system as a whole would behave if training were stopped. Without random sampling it is very difficult to estimate how well the system will do. Has there been enough training? Have enough documents been presented for review? How will the system do?
Some predictive coding systems rely on a so-called control set to measure their predictive coding progress. The control set is typically randomly selected at the start of training and the progress of the predictive coding system is assessed by measuring its ability to predict the responsive documents in this control set.
Control sets are generally effective, but they raise three risks. If the control set is reasonably small (say a few hundred documents), then the progress assessments are based on the ability of the system to correctly classify these few documents. If 10% of the documents in the collection are responsive and the control set consists of 500 total documents, then estimates of Recall are based on the same 50 or so documents, on every iteration. Second, any machine learning system that is evaluated repeatedly against a single standard set has a risk of “over fitting.” It can learn to effectively categorize the documents in the control set, but not be able to predict other documents. A bigger risk is that the labels applied to the control set are applied at the beginning when the subject matter expert knows the least about the distinction between responsive and non-responsive documents in the particular document set.
Whether these risks are acceptable is a matter of judgment, but predictive coding based on random sampling does not face them at all.
A new random sample on each iteration means that the number of responsive document documents that can be used to estimate Recall is growing throughout training, new documents are always being predicted so that there is no chance that the system will over-fit, and if there is some change in the criteria used by the subject matter expert, the system takes those changes into account as it learns.
Finally, if we are to measure the accuracy of the predictive coding system, we will most likely need to use a random sampling process. We can draw a random sample from the set of all of the documents that have been considered, or we can draw samples of documents predicted to be responsive and those predicted to be non-responsive, but they will be random samples drawn from some population of documents. If a random sample is a good means of assessing the success of the predictive coding process, then it is not clear why it would not be an equally good means of training the system. The sample properties that make random sampling a good approach to assessment (e.g., representativeness), make it a good approach to training as well.
From the above, it is clear that using random samples as the basis for training predictive coding is certainly a viable option and there are some very good reasons to adopt it over other training methods. Predictive coding involves more than just finding the responsive documents. It typically involves interaction and negotiation with opposing parties. Any savings achieved through the use of technology to lessen the effort of selecting responsive documents could be eaten up by the increased cost of negotiating over the training set or other aspects of the predictive coding methods.
Keeping the process as simple as possible not only helps to improve its accuracy, it makes the process easiest to understand, easiest to explain, and easiest to implement.
*Highlighting and color changes were not added by Dr. Roitblat, but by his editor for readability and emphasis.
Balcan, M.-F., Hanneke, S., and Vaughan, J. S. (2010). The true sample complexity of active learning. Machine Learning 80, 2-3, 111-139.
Blair, D.C. and Maron, M.E. (1985). An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System, Communications of the ACM,Vol.28, No.3, pp. 289-297.
Hanneke, S. (2012). Activized Learning: Transforming Passive to Active with Improved Label Complexity. Journal of Machine Learning Research 13 (2012) 1469-1587.
Tong, S. & Koller, D. (2001) Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research, 2, 45-66.