Business Affiliate Programs · SEO · Personals · Advertising · Resources

A Critique of “Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps”

Comments welcome

The paper “Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps”[1] by Tan Apaydin, Guadalupe Canahuate, Hakan Ferhatosmanoglu, and Ali Saman Tosun makes a number of unsupported claims. It proposes a new indexing method for answering queries on large scientific datasets. However, the serious flaws in these claims make the paper significantly below the standard of others published in the same conference. Here we review these erroneous claims and suggest how one might avoid the same mistakes. The paper has been passed through a supposedly rigorous peer-reviewing process, which reveals some significant holes in the existing review process. A critical review of these lapses will benefit the whole community.

The “Approximate Encoding” described in the paper being reviewed was named Approximate Bitmap or AB for short. In the remaining of this review, we will refer to the paper as the AB paper. This remaining of this review is organized in three sections. We describe the major flaws in the AB paper in Section 1, discuss some notable but arguably less serious problems in Section 2, and finally we discuss the lapses that let the AB paper passes through the VLDB review process and suggest possible ways to close these holes in the review process.

1. Major Flaws

We see at least two major mistakes in the paper, one in the abstract regarding the performance advantage of the new indexing method, and the other is a misrepresentation of the current state of art. We will discuss each of these in turn.

1.1. Fantastic performance claim

The last sentence of the abstract in the AB paper claims that Approximate Bitmap scheme “improves the speed of query processing from 1 to 3 orders of magnitude compared to WAH.” This leaves no doubt that AB is at least one order of magnitude faster than the WAH compressed bitmap index. However, this is not supported at all by the evidence presented in the "Experimental Results" section. For example, the third paragraph in Section 6.3 states, “In Figure 14(b), for a query targeting 10K rows of the uniform data, AB execution time is 76.09 msec, which is two thirds of WAH execution time.” This statement clearly contradicts the claim in the abstract! From Figures 14 (b)–(d), where timing results for WAH are presented, we see that AB is more than 10 times faster than WAH in 14 (d). In other word, in two out of three cases, the difference between WAH and AB is less than one order of magnitude. This evidence can not support the unqualified claim in the abstract of the paper!

1.2. Misrepresenting current state-of-art

The fourth paragraph of the Introduction section starts with the sentence “While many other approaches, including compressed bitmaps, compute the answer in O(N) time, where N is the number of points in the grid, we want to be able to compute the answers in the optimal O(c) time, where c is the number of points in the region queried.” This is a misrepresentation of the current state-of-art because many of the commonly used indexing schemes such as variants of B+-tree and B*-tree are indeed optimal in the sense of having time complexity of O(c) [2]. The AB paper does not refer to reference [2] or any paper on B-tree indices; we could excuse this as an oversight. However, there is no such excuse regarding compressed bitmap indices, because the AB paper specifically refers to a number of articles on the WAH compressed bitmap index [3] [4] [5] and those articles state that WAH compressed bitmap index can achieve the same optimal time complexity O(c) for query answering. In fact, reference [3] makes a much stronger claim about the expected time complexity of WAH compressed indices on the same type of spatial data that the above quoted statements refers to (for the context of the quoted statement, refer to the third paragraph in the Introduction section). Clearly, the quoted statement is wrong.

In Figures 14 (b)–(d) where WAH compressed bitmap index is presented, the time values for WAH stayed constant. The three references [3] [4] [5] cited in the AB paper demonstrated with both theoretical analyses and timing measurements that the elapsed time used by a WAH compressed index is bounded from above by a linear function of c. On uniform random data used in Figure 14 (b), the elapsed time used by a WAH compressed bitmap index is a linear function of the number of hits, not a constant as shown in Figure 14 (b).

2. Other Notable Problems

The next few errors are not as apparent, but as they relate to the core contributions of the paper, we will briefly discuss them as well.

2.1. How likely will a user specify a list of rows in a query?

One core motivation of designing the Approximate Bitmap is to speed up the processing of queries where the user have specified an arbitrary set of rows in their query. During query processing we may end up with this type of situation, but it is unlikely that a user will specify a list of rows in their query. In relational algebra, the rows are not assigned a number, therefore there those rows must of select through some keys, likely in the form of range conditions. Even if a DBMS system assigns a row number to each row, it is often hard for a user to make direct use of the row numbers.

2.2. Is it fair to call AB a compressed bitmap?

The Approximate

Bitmap scheme uses a set of Bloom filters to represent bit matrices. One can argue that this is a form of lossy data compression. However, it would be much more appropriate to just call them Bloom filters as they are. This would make it abundantly clear that the analysis Bloom filter applies to the proposed scheme. After one realizes that AB is simply a set of Bloom filters, the analyses presented in Section 4 are obvious.

2.3. What is a bitmap index?

In the first paragraph of the

introduction, it states, “The idea of a bitmap is basically to partition the data into bins.” As the authors demonstrated later, the key idea of a bitmap index is to represent the base data as a set of bitmaps. In this particular case, they happen to use a set of Bloom filters as a lossy data compression method to compress the bitmaps. The core of a bitmap index is not binning.

2.4 Have they compared against the obvious choice?

For the type of problems given in the AB paper, the simplest possible indexing method for performance comparison should be the projection index [6]. Since the AB paper refers to reference [6], it is natural to assume that the authors are aware of the method. Because the projection index is much simpler, it would be much easier to get the details right and produce a meaningful comparison. It is commendable that they attempted to implement a more sophisticated method, namely the WAH compressed bitmap index. However, this also raises an issue whether they have done an adequate job to implement the more complicated method. Based on reference [5], which was referred in the AB paper, the WAH compressed bitmap index can answer one-dimensional range queries on 2.2 million records of a HEP data set in an average of 52 milliseconds (cf. Table 1 in reference [5]). It is hard to believe that the same WAH compressed bitmap index should take 1400 milliseconds to answer the queries presented in the AB paper.

2.5 Why report CPU time instead of elapsed time?

In database research, because the bottleneck is usually the disk I/O, it is much more common to report elapsed time instead of CPU time. Even if we accept that there are rationale for reporting CPU time, shouldn't the authors tell us what was the clock speed? In reference [5], a 400 MHz UltraSPARC II processor was used for timing measurements, is it possible that the authors of AB paper used a 20 MHz Intel 386 processor. That would explain the fact that their WAH implement is almost 30 times slower. However, that seems hardly plausible since 386s are common in late 1980s, whiile the AB paper was published in 2006.

3. Comments and Suggestions

Academic publication is based on an honour system of self-discipline and peer-review. In this particular case, we feel that authors have not put in enough effort to examine their own work and the peer-review process has failed to catch the serious mistakes. Next, we postulate a plausible explanation for how the system might have failed and give three recommendations to address the problem.

Since the AB paper comes out two universities, very likely it was primarily written by an enthusiastic graduate student who is not very familiar with the database research field. It would be easy to attribute the mistake on the current state-of-art to this inexperience. However, this should have been caught by the internal review among the authors or by the conference program committee who are responsible for ensuring the quality of the papers selected.

Though we can see an excuse for misrepresentation of the current state-of-art, the error “Fantastic performance claim” should have been much easier to catch. Since the paper itself contains evidence that contradicts the claim, the inconsistency could be easily noticed.

There is a common belief that the conference program committees never read more than a few pages of a conference paper when they make a decision on whether to accept or reject a paper. Since the statement contradicting the claim in the abstract is near the end of the tenth page, it would be easy to for such a referee to miss the contradiction. Being on the program committee for a conference such as VLDB is generally regarded as a privilege, at least one of the committee members should have read the whole paper. Our first recommendation is for the program committee to assign one member to be responsible for a paper. To encourage committee members to take this responsibility seriously, the accepted paper should be printed with the name of the responsible person. Many journal articles are printed with a footnote to the effect of “Accepted on the recommendation of So-and-So.” Similar footnote could be printed on the final copy of the conference proceedings.

Our second recommendation is for authors to read their own papers. This seems to be the minimal requirement for being an author of a paper. In this particular case, the more senior authors did not pay enough attention to catch an obvious inconsistency.

Our final recommendation is for authors that propose new methods to include the most straightforward method in their comparisons. Of course, one should always compare against the closely related methods and the currently best (or best-known) methods. In the case of answering multi-dimensional range queries as in the AB paper, the most straightforward method should be the projection index [6]. Since the project index is very easy to understand and implement, it is much easier to conduct performance measurements and discuss the relative performances of a new method.

John Wu © John Wu