by Herbert L. Roitblat, Ph.D.
This is my second blog post about Cormack and Grossman’s latest article, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” In their article they compared several versions of TAR (or, as I prefer, CAR — Computer Assisted Review), one of which was based on using simple random sampling as the training set. In my previous blog, I argued that they are CAR vendors, they have a vested interest in making their algorithm appear better than others, and that the system that they tested using random sampling bears almost no resemblance to any system using it to actually do predictive coding.
In this post, I want to focus more on the science in the Cormack and Grossman article. It seems that several flaws in their methodology render their conclusions not just externally invalid — they don’t apply to systems that they did not study, but internally invalid as well — they don’t apply to the systems they did study.
Their paper has been accepted for presentation at the annual conference of SIGIR (The Special Interest Group on Information Retrieval) of the ACM (The Association for Computing Machinery). I am also a member of this organization.
Poor performance on a caricature of a random sampling system or one using simple active learning does not indict all systems that use random sampling or simple active learning. As a demonstration that this inference is not warranted, I showed results from some OrcaTec projects that used random sampling. Rather than returning poor results, these projects returned results that were better than Cormack and Grossman’s favored algorithm. My point was not to show that their CAL algorithm is not good, but rather, in contrast to their claims, to show that random sampling can lead to high levels of predictive accuracy.
Some of the differences between the system that they tested and the OrcaTec system include:
What they describe as the “TAR task” is different from the task that most predictive coding systems seem to use, including OrcaTec’s. Cormack and Grossman’s task is to continuously optimize the documents presented for review, whereas it is more common to divide the predictive coding process into two parts, one where the system is trained, and a second where the system’s predictions are reviewed. The continuous task is well suited to their CAL model, but not to the other two systems that they tested.
Cormack and Grossman’s SPL system uses a support vector machine as its underlying machine learning model. The support vector machine is common among several predictive coding systems, but it is not used in the OrcaTec system. The different underlying models in the two systems means that it is not valid to extend the results obtained with one to the other. They have different learning properties.
The Cormack and Grossman system used a peculiar way of representing the documents in the collection, whereas the OrcaTec system uses words. I discuss this issue in more detail below, but this difference means that the results of their SPL method cannot be validly applied to other forms of random-sampling training.
These differences limit the validity of any comparisons between the results that Cormack and Grossman obtained and the results of competing predictive coding providers, including OrcaTec.
All other things being equal, one might be willing to say that in their specific tasks, with their specific collections, and their specific models, that their Continuous Active Learning (CAL) system performed better than a Simple Active Learning (SAL) or Simple Passive Learning (SPL), but even this comparison turns out to be invalid.
As mentioned, the Cormack and Grossman task is well suited to CAL, but because neither of the other two systems are designed to present the most relevant documents for review at each of the training stages, their task does not apply to them. Presenting random documents for review (SPL) or the most uncertain documents for review (SAL) is simply not designed to produce highly responsive documents during the initial stages of training. These systems are designed for a two stage process of learning followed by prediction. Whatever the technical merits of a system, pitting one designed to produce responsive documents at the beginning of the task against two systems that are designed to get information about documents at the beginning of the task is not a valid comparison.
Their choice of task strongly biases their results in favor of the CAL task, meaning that their internal comparisons, within their study, are invalid. It is metaphorically like comparing a dolphin and a horse and saying, oh look, that dolphin does not run so fast in the desert.
The bigger issue is with how they constructed the set of documents that they used to measure the performance of the system. Their so-called “gold standard,” unless I am seriously mistaken, guarantees that their CAL system and not any of the others will be successful.
The measurement of Recall (the completeness of the process identifying responsive documents) and Precision (the selectivity of the process identifying responsive documents), requires that we know the true status (responsive or non-responsive) of the documents being assessed by the TAR/CAR system. How that standard set is constructed is critically important. As they note: “To evaluate the result, we require a ‘gold standard’ indicating the true responsiveness of all, or a statistical sample, of the documents in the collection.”
Here I want to focus on how the true set, the so-called “gold standard” was derived for the four matters they present. They say that for the “true” responsiveness values “for the legal-matter-derived tasks, we used the coding rendered by the first-pass reviewer in the course of the review. Documents that were never seen by the first-pass reviewer (because they were never identified as potentially responsive) were deemed to be coded as non-responsive.”
If I am reading this correctly, it means that the CAL algorithm and (perhaps) the keyword search were used to identify the set of responsive documents. If true, this approach is a very serious problem. They used the tool that they want to measure to derive the standard against which they were planning to measure it. They used the prediction of the CAR model as the standard against which to measure the prediction of the CAR model. The reviewers could decrease the number of documents that were deemed responsive, but they could not apparently increase the number of documents that were deemed responsive.
It is not at all surprising that the documents predicted to be responsive during the standard setting were the same documents identified by the same system during the assessment. They were both predicted by the same system.
The double use of the same measurement means that only the documents that were identified as responsive by the CAL algorithm or by the keyword search used to create the seed set could be considered responsive. If either of the other two algorithms identified any other documents as responsive, they would have matched an unseen document, and so by Cormack and Grossman’s definition, they would have been counted as an error. Again, if true, it means that only the CAL algorithm could consistently produce documents that would be called responsive. The other algorithms could produce correctly classified responsive document only if they happened to agree with the CAL algorithm or the keyword search. Only those documents would have been reviewed and so only those could be candidates for being responsive.
Cormack and Grossman report the relative contributions of the keyword search and the CAL algorithms. Recall and Precision for the keyword searches used to generate the seed sets are shown in their Table 3. For two of the matters (B and D), the keyword searches yielded high Recall (presumably when judged against the merged standard of keyword and CAL), but low to moderate Precision. In fact, in these two matters, the keyword search resulted in more responsive documents than the CAL algorithm did. For the other two matters (A and C), the CAL algorithm identified more responsive documents than the keyword search did.
Their keyword seed set sometimes showed better Recall than their CAL results.
|Table 3: Keyword seed queries|
|Recall||Precision||Max CAL Recall||Recall 90||Precision|
Two matters out of four found poorer Recall after using the CAL method than the keyword search used to generate the seed set achieved.
Keyword Recall and Precision are from their Table 3.
Max CAL Recall is the maximum that CAL achieved in the study.
Recall 90 is their first Recall above 0.90, or max Recall if it never reached 0.90
Precision at Recall 90 is the last column
The CAL algorithm selects for review the top 1,000 documents that it predicts to be responsive. So this algorithm does not present any documents for review that are not predicted to be responsive. If a document is responsive, but does not resemble any of the documents already predicted to be responsive, then it will not be found, except, perhaps at the very end of training when the number of remaining similar documents is exhausted. Documents that are similar to the seed set are considered, documents that are not similar to the seed set are unlikely to ever be considered, let alone predicted to be responsive. This limitation is hidden by the apparent fact that only similar documents were contained in the validation set (the “gold standard”). The system can appear to be performing well, even when it is not. An invalid standard set means that any measures of accuracy compared to this standard are also, necessarily, invalid.
Cormack and Grossman do not provide, therefore, even a valid measure of what their CAL system has accomplished. The CAL system may be quite good, but we cannot tell that from this study without validation on an independent set of documents that were not generated by the CAL system itself.
In text classification, the more disjoint the relevant topics are (where disjoint topics share few target words or other features), the more difficult it will be for a method like Cormack and Grossman’s to find them all. Their method depends on the starting condition (the seed set) because it ranks documents by their resemblance to the seed set. It is very difficult to add to the seed set in their situation, because documents that are not similar to the seed set are not encountered until the similar ones have been exhausted. And in their study, those documents would automatically have been counted as non-responsive.
Random sampling is known to be slower than certain other training algorithms, under some conditions, but more resistant to noise. As mentioned, it is also more resistant to training bias because the items selected for review are not determined by the success or failure of previous predictions. Whether it will, in fact, be poorer depends on the complexity of the space. Even if it is somewhat slower, though, it still has the advantage of being open to identifying disjoint responsive documents.
Another possible limitation of Cormack and Grossman’s study is the document representation that they used. Their CAL algorithm could be run with any kind of document representation, but they chose one that may be particularly difficult for other approaches, such as random sampling.
Cormack and Grossmans method destroys any semantics (meaning) in the documents. By breaking the text into 4-character shingles (including spaces and punctuation, I think), it destroys the morphemes. Morphemes are the basic units of meaning in a language. Words consist of one or more morphemes. Cormack and Grossman’s method treats documents as sequences of characters, not words.
They do two other things that make training more difficult. The number of distinct words in a document collection is typically on the order of around 100,000 for small document sets, growing by roughly one new word per document beyond this. About half of those words occur exactly once in the corpus. The number of four-character shingles is much larger. Most, if not all machine learning algorithms are sensitive to the number of distinct “dimensions” on which they are trained. A dimension is simply a variable on which the items may differ, for example the presence or absence of a word, o the presence/absence of a letter shingle.
Typically information retrieval systems work to reduce the number of distinct features to classify, but Cormack and Grossman appear to have chosen a representation that increases them. Instead of limiting the dimensions to a few hundred thousand words, Cormack and Grossman use shingles. They note that there are 4,294,967,296 potential shingle representations , which they reduce through hashing to 1,000,081 distinct codes. Each code (each of the 1 million final values) represents about four thousand different four-character shingles (4 billion potential shingles to 1 million codes).
If I understand correctly, they encode each shingle originally as a 32-bit binary string (whether a character, including some punctuation, is present or not in the shingle), throwing away repeat characters in the shingle. They then encode the shingles into a million-bit binary string, throwing away the number of times each shingle occurred in a document. Thats a lot of information to throw away. This representation may work well for the CAL algorithm, but it is not optimal for other machine learning algorithms.
In conclusion, Cormack and Grossman claim that
None of these claims is supported by the evidence that they present. Their use of an inadequate and biased evaluation standard, a so-called “gold standard,” makes it very difficult to draw any real conclusions from this study. Other design decisions in the study also strongly bias the results to favor their CAL system.
There is no reason to think that their results imply that random sampling per se is the issue. The best that they could reasonably claim is that it did not work well in the situation that they arbitrarily constructed, with the algorithm that they chose for it, and measured against a task that was also biased in favor of their CAL algorithm. The fact that random sampling does work well in other situations belies their conclusion. Random sampling can, in fact be very effective at identifying responsive documents and it can be verified by an independent standard of judgment.
Cormack and Grossman raise a straw-man argument that others claim that only random sampling is valid for training predictive coding. I don’t know who actually makes this claim. I believe that random sampling is sufficient to produce accurate results, but I do not argue that random sampling is necessary.
At OrcaTec, we frequently use combinations of pre-categorized documents and random sampling, but random sampling alone is efficient and sufficient for most of our matters. Further, there are other reasons to want to use random sampling for training predictive coding.
As Cormack and Grossman note, random sampling can give more representative training examples, their trumped up example of a system learning that spreadsheets are not responsive aside. They are correct in noting that random sampling is more likely to capture common categories of documents than rare ones, but unless there is some method for identifying rare documents, no statistical learning system is going to be very good at identifying them. If the users know of examples of rare responsive documents, then these can be included in a training set, followed by random sampling. As Cormack and Grossman note, however, there is no scientifically useful protocol available for identifying these rare instances.
Random sampling gives direct, unbiased, and transparent feedback as to how well the system is predicting responsive documents
I am truly puzzled that Cormack and Grossman argue that their CAL system is not affected by starting bias. They argue that a new classifier is generated after every 1,000 documents. This classifier derives its positive examples from the documents that have previously been identified as responsive and then presents more like those for further review. I see no opportunity for a rare unanticipated document type to be selected.
Random sampling, on the other hand, presents for review a representative sample of documents without regard for the system’s predictions about which are responsive. New and rare topics, thus, have the opportunity to be included in the training set, but they are not guaranteed to be included.
Finally, to paraphrase Mark Twain, any reports concerning the death of random sampling for training predictive coding are grossly exaggerated. Far from being a questionable practice, if your goal is to minimize the number of unnecessary non-responsive documents to review while maximizing the accuracy of your selection, then random sampling can provide a powerful, defensible, means to achieve that goal.