9:00 a.m. After arriving at work, you receive urgent email about an emergency meeting at 10 a.m. to discuss the disappointing sales that have occurred since the company changed its ad campaign. Your task is to prepare a breakdown report on sales data explaining the effects of the campaign.
To generate this report, you'll have to run some "big picture" aggregation queries against hundreds of gigabytes of data, comparing the sales figures before the ad campaign to those afterward. But the sales data contains dozens of dimensions to explore. Because each aggregation query rolls up a large percentage of the database, you only have time to run three or four queries between now and the meeting. But which three or four?
Perhaps the ad campaign was more effective in urban areas than rural areas. You fire off a query to test this hypothesis and stare impatiently at the hourglass cursor, waiting for results. After a few idle minutes, you dash off to get a cup of coffee.
9:15 a.m. Fully caffeinated, you return to your desk and wait impatiently for the final answer. When it arrives, it shows that revenue was down by 2.635274 percent in cities with a million people or more and down by 2.671134 percent in cities of less than a million people. While this result is highly accurate, it does not help explain the effects of the ad campaign. The database system chewed up 15 minutes computing an uninteresting answer to six decimal places of accuracy. If the database software were more like a Web browser--providing a rough picture of the results while it's running--you could have given up on this hypothesis and quickly tried another one.
As this scenario highlights, database systems are too much like black boxes: You give them an input, they compute silently for a long time, and then they spit out accurate output. The running time of these queries is long because the system is computing an accurate answer. And in most systems, there is no way to get a ballpark estimate of the answer while the query is running. The overhead of these queries prevents analysts from interactively exploring their data, resulting in conservatively chosen analyses and cookie-cutter reports.
A far more attractive model for databases would be that of crystal balls that let users look into the future of a long-running query. With such a crystal ball interface, users could watch the query being processed, seeing a running estimate of the final answer and watching that estimate being refined as the query runs. If the results did not appear to be promising, the query could be modified or canceled. This sort of system would encourage analysts to explore their data freely, interactively testing and refining radical hypotheses.
New research centered at the University of California at Berkeley is focusing on crystal ball behavior for various data intensive applications, from the back office (such as decision support and OLAP), to the desktop (spreadsheets), to new applications (data mining and visualization). The research focuses on a combination of new ideas in query processing, statistics, and user interfaces. This work is occurring in partnership with major database vendors such as Informix and IBM--meaning that there is probably a crystal ball in your data analysis future.
Data analysis queries are inherently long-running operations. Currently, the best interface for these queries is the hourglass cursor--that is, the system makes users wait while it uses old-fashioned "batch" processing with an obsessive focus on accuracy. Lack of interactive performance results in an inability to perform serious data exploration. Much of the value of data collection and decision-support systems is wasted, because the overhead per query discourages users from testing hypotheses. As a result, users tend to run a set of canned reports and never really explore the data.
Clearly, users would prefer their analysis queries to run at the "speed of thought" rather than as batch jobs reminiscent of punch cards and machine operators. Prior to the research described here, there have been two standard techniques for making big queries interactive:
Precomputation. This is the trick used in many OLAP systems: They in fact do not process (P) anything online (OL); rather, they precompute answers in advance (in batch, black box mode) and let users browse the results online. One could argue that these systems would be better termed precomputed analytic processing (PAP) systems.
When applicable, precomputation is effective. Unfortunately, it is not a scalable technique--the OLAP products that use it cannot handle databases larger than a couple of dozen gigabytes. Worse, precomputation restricts the class of queries available to those that have been precomputed. Finally, these systems are typically decoupled from operational data stores, using slightly out-of-date information. Precomputation is appropriate only for limited, time-delayed analysis of moderate-sized databases.
Sampling. A few commercial database servers and OLAP tools are beginning to support queries over samples (random subsets) of the database. Sampling is a powerful technique commonly used to analyze hard-to-collect data sets; opinion polls, for example, use statistical samples of the population. Sampling is especially effective for getting a rough picture of a data set.
Traditional database sampling can be frustrating to use, however. First, deciding how much to sample to achieve a sufficiently precise answer requires some statistical sophistication. While some professional data analysts are comfortable with these issues, many users find setting parameters like "precision" and "confidence" daunting. Worse, these parameters must be set before the query is processed and cannot be calibrated on the fly. Finally, for breakdown queries with multiple answers (GROUP BY queries, for example), these techniques do not let you set parameters differently for different groups--you must compute each group to the same level of precision, even if you're more interested in some groups than in others.
The research efforts at U.C. Berkeley are tackling these issues through the Continuous Output and Navigation Technology with Refinement On-Line (CONTROL) project. The goal of CONTROL is to give users a crystal ball interface to long-running, data intensive operations. This goal of crystal ball behavior is more crisply defined as three subgoals:
1. To provide continuous, progressively refining estimates of final answers so users can get quick feedback
2. To provide nontechnical, intuitive reporting of the current precision of a particular estimate
3. To allow on-the-fly control of query processing so users can act on the feedback.
CONTROL-based systems have several advantages over traditional approaches. First, unlike precomputation-based systems, they allow ad hoc querying and do not constrain users to simple slice-and-dice analysis. Second, unlike traditional sampling systems, CONTROL-based systems let users trade off precision against time on the fly in an intuitive manner. Finally, these systems scale arbitrarily--the time required for an answer at a particular precision is independent of the data set's size. Moreover, because this response time is typically quite short, CONTROL system users consume fewer computer resources. This short response time can lower the burden on the data store, and hence potentially lower costs.
As an example of the difference between CONTROL systems and traditional systems, consider an ad hoc aggregation query of the sort used in relational OLAP (ROLAP). In keeping with data we understand well at Berkeley, let's use a data set containing enrollment records and grades for students at a major university. Our general task here is to prepare a breakdown report on grade inflation. As an initial query, we ask for the average grade (on a four-point scale) given in courses within each college at the university. (Examples of colleges include Engineering, Letters & Sciences, Education, and so on.) The SQL for this query is:
SELECT college, AVG(grade) FROM ENROLL GROUP BY college;
If we issue this query in a standard query tool over a traditional database system, we get an interface like the one from Microsoft Query shown in Figure 1 --essentially, it provides no information beyond the hourglass until the query completes. This screen is particularly frustrating: it provides no early answers, no sense of how long the query will run, and not even a button to cancel the processing.
Some tools on the market today are slightly more sophisticated and warn users that such queries will take a long time. Informix's MetaCube Explorer does so and offers a mechanism to run the query in a disconnected batch mode called QueryBack. This user interface is preferable to black box processing in the server. But clearly the main problem is not merely a user interface issue, but the fact that a traditional database server provides no information while a query is running.
In contrast, Figure 2 shows the interface to a CONTROL-based server running the query over an actual university's data. The server provides ever-refining estimates of the query answer per group; these are displayed in an animated fashion in the client tool. On the bottom, the client shows a chart of current estimates per college (the dots), and vertical bars (error bars) that indicate windows of possible final values with 95 percent probability. You can set the desired probability setting via the confidence slider on the lower left--statisticians call the combination of a probability and the bars around a point a confidence interval, which is depicted here in an easily understood graphical manner. As the query runs, the dots move up and down within the vertical bars, and the bars themselves shrink. When the bars are sufficiently small, the user can terminate the processing of this query and proceed to a new one.
In practice, it takes many minutes or even hours to complete a query like this one, while it takes only a few seconds to get an acceptable estimate. Figure 2 shows how, within seconds, it's obvious that two colleges (Education and Air/Military/Naval Science) seem to give out the highest grades--a quick insight that could lead to a subsequent query that drills down into those colleges.
An additional CONTROL feature is highlighted in the upper part of Figure 2; next to a running tabular estimate of the answer you can see a set of controls for each group. While the query is running, these controls let the user pause, slow down, or speed up each group. Pausing a group tells the server that it should devote no resources to that group; the speed buttons tell the server to devote more or less to a group relative to others. These controls let the user interactively guide the refinement of the query to favor some groups over others--which can be useful before proceeding to a new drill-down query at finer resolution.
In summary, this interface illustrates the three key features of CONTROL technology: progressively refining estimates, intuitive interfaces for precision, and on-the-fly control of processing.
User interfaces are critical aspects of a good crystal ball system, but most of the technology that drives the interface resides in a modified database server being built at Berkeley. In this section, I'll briefly describe the algorithms used to enable crystal ball behavior. The key distinction in this new technology results from a shift in performance goals: while black box database systems are optimized for quick time to query completion, CONTROL tries to maximize the precision and interactivity of estimates over time.
Data delivery: index and heap stride. The buttons shown in the upper left of Figure 2 highlight the user's ability to control data delivery from the server; for example, users may favor one group over another. This ability is provided by special access methods in the server--index stride and heap stride--that consider the user's preferences and modify the data delivery to suit.
Index stride access uses a B-tree index over the grouping columns (enroll.college in our example). It internally opens multiple cursors into the B-tree--one per college--and fetches from those different cursors at rates corresponding to client preferences.
When no index is available, heap stride provides the same function. Heap stride scans a sequential file of records but postpones delivering any record until the record's group is due for an updated estimate. As a result, time spent in network traffic and display is devoted to the records of greatest interest to the user. While heap stride is not as effective as index stride in all scenarios--in particular, it cannot quickly identify groups with few values--in most cases it serves as an effective approximation.
Ripple joins. Join algorithms also need to be modified to provide crystal ball behavior. Traditional join algorithms such as sort-merge join and hash join spend the first half of their run time setting up scratch files--they deliver no records for quite some time. This is classic black box behavior.
Traditional nested-loops join delivers records at a steadier rate but is relatively slow.
Figure 3 shows how nested-loops join works: It grabs a record of one relation (R) and scans through all of another relation (S) looking for matches. As Figure 3a illustrates, this scan sweeps out the space of pairs of R and S records in a sequence of long lines. An alternative CONTROL scheme is shown in Figure 3b. Here, R and S records are combined in a more interleaved fashion: A record of R is combined with a small set of records from S, then a record of S is combined with a small set of records from R. We call this operation a ripple join because the pairs of records are swept out like ripples in a pond.
Ripple joins have several advantages over nested-loop joins. Most significantly, they let the system control the relative rates at which records are delivered from R and S; either both can be delivered at the same rate, or the statistically more influential relation can be delivered more aggressively to shrink the confidence interval as quickly as possible. Our experiments have shown that ripple joins provide tight estimates far faster than nested-loop joins and offer very precise approximations in better than 1/100 the time needed to compute an accurate answer using the best black box algorithm.
Estimation techniques: statistics. The statistical techniques used to estimate results and produce confidence intervals are a key aspect of these interfaces. The statistics are based on a type of sampling. Basically, at any point during crystal ball processing, the system has examined some subset of the database. This subset can be viewed as a random sample, and you can apply statistical methods to use the sample as a representative of the whole data set. CONTROL research has resulted in a set of estimators and confidence intervals for CONTROL systems that work on aggregates such as COUNT, SUM, AVG, and STDE that can applied over arbitrary expressions (for example, AVG((salary+benefits)/12).
Note that these statistical techniques do not support some aggregates, particularly MIN and MAX. These aggregates are essentially looking for a needle in a haystack: the one outrageous value in the complete data set. Determining these values based on only partial knowledge of the data set is impossible. In essence, these aggregates are not asking for the "big picture;" hence, they can't be improved with CONTROL techniques. A CONTROL-based system can, however, provide a MAX-SO-FAR interface in these scenarios. Although this technique does not provide a reliable estimator of the accurate result, it does offer some running insight into the general properties of the data set.
CONTROL techniques are not only applicable in the setting of relational databases; they apply to all long-running, data intensive processing as well. In related work, researchers in the CONTROL project are attacking topics such as data visualization and the development of user interface widgets for desktop applications that operate on large data sets, such as spreadsheets.
Data visualization. Data visualization is the task of creating pictures from data. There are numerous tools and methodologies for doing so--from simple charts in spreadsheets to more complex 3D "data space" walk-through environments. Generating these visualizations for large data sets is a laborious process akin to running a large query and then a complex graphics program. Many tools also let users pan and zoom into such a picture, which involves scanning, aggregating, and rendering large data sets. While these operations should operate at point-and-click speeds, they typically degrade to batch-style performance if there is too much data.
CONTROL can enhance the data visualization process in much the same way it helps in aggregation queries. In data visualization, the "big picture" that the user desires is literally a picture. CONTROL techniques deliver the most significant data to the visualization tool first, and then the visualization tool must be able to render ever-refining approximations of the final image.
For example, Figure 4 shows a running visualization of cities in the United States being plotted on a map as they are delivered from the database, ordered alphabetically by state (from Alabama to Georgia at the time of the screenshot).
Figure 5 shows a visualization that has been running for the same amount of time but uses CONTROL techniques to deliver the data in a better-distributed order and provide rough "clouds" of shading to better approximate the query's final answer. Note that the CONTROL technique provides a quick approximation of the final image: In particular, it shows the relative density of cities in different regions. One particular advantage is that blank spaces are quickly identified as being blank, which means that boundaries are quickly drawn and shapes easily identified. Moreover, you can quickly distinguish densely populated regions from lightly populated regions; an insight that lets you identify regions deserving a closer "zoomed" look.
User interface widgets. One common criticism of database systems is that they are much harder to use than desktop applications such as spreadsheets. A common criticism of spreadsheets is that they do not gracefully handle large data sets; many spreadsheet behaviors are painfully slow on such data sets despite impressive improvements from the vendors. The problems typically have to do with the point-and-click nature of the GUI widgets involved.
Consider the common list box widget--it lets the user bring up a list and scroll through it or jump to particular points by typing prefixes of words in the list. Now imagine a list box over gigabytes of unsorted, unindexed data resulting from an ad hoc query. This list is likely to be rather unpleasant to use; similarly unpleasant behavior arises when sorting, grouping, recalculating, or cross-tabulating over large data sets-activities that are usually interactive in spreadsheets. Today's desktop applications carefully, almost imperceptibly discourage these behaviors over large unindexed data sets.
Such behavior is desirable in many scenarios, however. It should be possible to put online processing behind many of the widgets found in desktop applications so they continue to run smoothly over large data sets. For example, a list box should let users browse as many items as are currently available, while gracefully adding new items to the list as they're delivered.
Many development environments and toolkits include libraries of GUI widgets, so a CONTROL GUI widget kit is a natural thought. The CONTROL group at Berkeley is developing such a kit as an extension to Java's Swing widget set. In some cases, CONTROL GUI widgets could present different interfaces than standard widgets. In others, the widget would be similar to its original interface but add capabilities that let users manipulate the data as it is delivered.
The CONTROL project's vision is to spare users from information-free wait time. With the proliferation of inexpensive processing, wait time today arises largely in data intensive applications that require significant disk or network I/O. As the CONTROL research group begins to hand off its initial work on aggregation queries to the industry, it is continuing to explore the interaction of CONTROL with new research issues such as parallelism, distributed and heterogeneous data, mobile clients, and new data analysis metaphors such as data visualization and mining.
One attractive aspect of CONTROL is that it lowers the burden on systems because most CONTROL-based queries need not run to conclusion to provide useful information. The result is that large-scale decision-support queries become short tasks that needn't noticeably intrude on the performance of operational data stores. Thus one promising application space for CONTROL systems is that of "virtual data warehouses": CONTROL can, in principle, provide large-scale decision-support processing and perform data cleaning on the fly without imposing a large processing burden on operational data stores.
Obviously, computers will become faster and cheaper as time passes. Conversely, we will always collect more than sufficient data to ensure that data intensive processing remains a slow and laborious process. Given these trends, the importance of CONTROL seems likely to remain constant for the foreseeable future. Given the industrial interest in CONTROL technology, you can expect a crystal ball to appear on your desktop soon--and stay there for the foreseeable future.
The CONTROL group at Berkeley includes Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, and Kirk Wylie. We work closely with Peter Haas of IBM Almaden Research Center and with Mike Stonebraker, Cristi Garvey, and Chih-Po Wen of Informix. CONTROL is funded by Informix, a California MICRO grant, and the National Science Foundation.
You can find additional information about CONTROL on the Web at control.cs.berkeley.edu.
Joseph Hellerstein is an assistant professor of computer science at U.C. Berkeley. His research interests include query processing, query optimization, indexing, and distributed systems, especially as applied to object database systems and online data analysis. He is on the ACM SIGMOD advisory board and was recently named an Alfred P. Sloan Research Fellow. You can reach him at firstname.lastname@example.org, or via the Web at www.cs.berkeley.edu/~jmh.
The CONTROL project is the result of combined efforts among university researchers, commercial researchers, and commercial developers. This cooperation has been beneficial for all parties and can serve as a model for seeding radical new information technologies. The initial research vision came from Berkeley and is aided by an ongoing collaboration with IBM Research. Simultaneously, Informix provides research funding for students as well as access to its server and client software so that research ideas can be tested in the context of serious commercial-grade systems. More recently, the National Science Foundation has bestowed a grant for additional academic research. The net results of this broad collaboration are that the CONTROL project has made significant progress by leveraging existing commercial technology, and that the companies involved have benefited from the infrastructure and expertise of students and faculty at Berkeley. Furthermore, this research is likely to hit the streets in the near future.
Industry and academia often don�t see eye to eye on future development ideas, but it doesn�t have to be that way. When industrial and academic researchers work in concert with product teams, exciting technology is often the result�a result that none of those institutions would have been capable of producing on its own.
The database community has a long history of healthy research and development and industrial-academic collaboration going back to the early relational days at Berkeley and IBM. This kind of synergy is a rare national asset that requires continuing cultivation.