Over the last couple of years, there's been a lot of talk about the advantages of column-oriented databases for data warehousing, with Michael Stonebraker from Vertica being particularly vociferous and bold in his claims that row-based databases are going to be completely replaced. As a vendor of a row-based database, I obviously have a vested interest in refuting his claims, but I'm going to try my best to be even-handed in discussing this issue.

The Claims
If you were new to database technology and read some of Stonebraker's articles, you might be forgiven for thinking that column-oriented databases were a completely new invention and were set to sweep row-oriented databases from the data warehousing market.

He claims that column-oriented databases are 10-50x faster than traditional row-oriented systems and offer significantly higher compression ratios, thereby bringing down the cost. Benchmarks against Oracle are usually put forward to back up these claims.

The Reality
The fact is that column-oriented databases have been around for some time. In the data warehousing market, long-established (but not very successful) examples include Sybase IQ and Sand.

There are some advantages of column-orientation for DW workloads. For example, data compresses slightly better when stored in columns (DATAllegro compresses between 2:1 and 6:1 depending on the content of the rows, whereas column-oriented systems claim 4:1 to 10:1). Also, some queries (i.e. those that only access a few columns) will perform better.

However, in most real-world implementations, these advantages don't make a great deal of difference.

At the end of the day, column orientation is just one approach to limiting the amount of data read for a given query. In effect, it's an extreme form of vertical partitioning of the data. In modern row-oriented systems such as DATAllegro, we use sophisticated horizontal partitioning to limit the number of rows read for each query. We're also working on clever usage of materialized view technology to limit the number of columns we need to read. The end result is very similar performance to that claimed by Stonebraker i.e. 10 to 50x that of traditional databases such as Oracle.

My name is Stuart Frost. I founded DATAllegro in 2003 and I've been the CEO of the company from the beginning.

As CEOs go, I'm pretty technical and still get heavily involved in specifying the architecture of the product, although I haven't written any of the DATAllegro code (much to the relief of the engineering team).

I have a degree in electronic engineering and started my career as a programmer in the telecoms and defense industries back in England, writing low level code for such things as phone exchanges and sonar and radar systems. While I didn't know it at the time, I guess this mix of software and hardware was an ideal grounding for what I do now—leading an appliance vendor.

I started my first company, SELECT Software Tools, in 1988 and ran it as CEO & Founder for 10 years, through several rounds of funding and a Nasdaq IPO that brought me to the US. The VC that backed SELECT made a 26x return. After leaving that company in 1998, I took a couple of years off and missed most of the Internet boom. Great timing!

By late 2002, I was looking for my next startup idea. While at SELECT, I'd been involved in several large database design projects (SELECT was a software design tools company), so I started studying the DBMS market to see if there were any disruptive opportunities and quickly started focusing on the data warehousing sector.

The database market in general was a no-go area for VCs through the 1990s. After all, Oracle had won, hadn't they? This started to change with the introduction of a couple of strong open source databases i.e. MySQL and Postgres and accelerated when Netezza attacked the data warehousing market.

Netezza came to market with an interesting business model and value proposition:

It leveraged an open source DBMS (Postgres) to reduce engineering costs and time to market.
It used an appliance business model to create a tightly integrated software and hardware stack, thereby removing a significant area of complexity for DBAs and system admin staff.
It shifted to sequential I/O from the more typical random I/O generated by the incumbents. This allowed the use of much larger and cheaper SATA disk drives and led to a highly competitive price/performance ratio.

However, there is a significant flaw in Netezza's strategy - in achieving #3, they created a highly proprietary hardware platform and, effectively, a proprietary software platform (with little of Postgres remaining).

Netezza secured its first few customers around the time DATAllegro was being founded. Looking at the Netezza architecture, I realized that there was an opportunity to create a similar value proposition while using a completely non-proprietary platform. Hence, my vision was to create a massively parallel DW appliance with an embedded, off-the-shelf open source DBMS (Ingres) running on Linux and using completely standard servers, networking and storage from major vendors.

DATAllegro

Almost five years after starting DATAllegro, I'm very pleased to see that my vision has become a reality. We now have a highly competitive DW appliance that uses an array of Dell servers (or Bull servers in Continental Europe), Cisco networking and EMC storage.

Each server runs a highly tuned copy of the Ingres DBMS on SuSe Linux. Our proprietary software turns these separate databases into a massively parallel, shared nothing database system that offers incredibly good performance, especially under complex mixed workloads.

The appliance model is key to getting great performance. Tuning a large database using traditional approaches is extremely difficult and requires highly skilled DBAs. One of the main problems is the difficulty of understanding and tuning the interface between the DBMS software and the underlying OS and hardware platform. Database vendors such as Oracle and Microsoft have to build their software to run on any hardware. Hence there are a plethora of tuning parameters and options for the DBA and sys admins to setup. In the appliance model, we have the luxury of controlling the entire software and hardware stack from SQL to storage. As a result, we can hide all of the complexity.

Another very important aspect of performance is ensuring sequential reads under a complex workload. Traditional databases do not do a good job in this area - even though some of the management tools might tell you that they are! What we typically see is that the combination of RAID arrays and intervening storage infrastructure conspires to break even large reads by the database into very small reads against each disk. The end result is that most large DW installations have very large arrays of expensive, high-speed disks behind them - and still suffer from poor performance.

Through a lot of trial and error, smart engineering and code changes to the database engine, we've been able to create a platform that sustains sequential reads - even under very high levels of concurrency. This allows us to use relatively low-cost, high-capacity SATA disk drives and therefore to provide a very high price/performance ratio.

Exciting Times

It's an exciting time to be involved in the data warehousing market. It's rare to see a $30bn market go through such a rapid transition, with a few powerful incumbents under attack from several fast-moving, innovative disruptors.

In my next few blog entries, I'll be talking about the various players in the market and how I think they fit in and stack up. Don't worry, it won't be the usual self-serving PR blog - I'll be honest and straightforward about how I see the strengths and weaknesses of the various players, including DATAllegro.

DATA-Beat A DATAllegro Blog on Data Warehouse Appliances

May 1, 2008

Columns & Rows

April 10, 2008

Who I am and why I'm here.

DATAllegro

Exciting Times

Recent Posts

Categories

Archives