This is the first post in the series about tools and how you can leverage them to increase your ability to scale. In this post we're going to take a look how it is possible to store a large quantity of data in the Hypertable. The data store chosen here is for a reason, as you will see in a moment.
Consider the following scenario, that you have a lot of data exposed in some internet location that you want to have at local site and be able to sift through it. Perhaps you are creating a web crawler and you must have this data on site, not in a remote location to avoid latency factor when processing it. Let's take Wikipedia for example. You could get the dump of the whole Wikipedia and try one of the two: search through the exported wikipedia in a XML file, maybe even extract articles into separate files (that would be inefficient and would probably be a real pain for your filesystem) or store it in some more clever way. What you need in such scenario is a data storage that can handle this amount of data with great ease. It turns out a Hypertable is a good tool for doing this kind of work.
If you want to follow this example then I'm assuming that you have hypertable compiled, installed and configured on some cluster already, and if you are having trouble then please leave me a note in the comments.
We're going to consider the latest data dump of an english version of Wikipedia that can be downloaded at the following address: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This file is 4.1 gigabytes in size, and it's about 20 gigabytes after decompression so you'll need at least 25 gigabytes of free space to decompress it. After decompression you can remove the compressed file.
Downloading the Wikipedia can be done with a standard set of tools
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 $ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2
Once you are done, then grab this script and save it somewhere.
The next step is to prepare Python thrift library so that we can connect to Hypertable using Python.
This can be done by copying the gen-py and hyperthrift directories from the source code directory you've compiled Hypertable at. For example
$ mkdir ~/wiki_example $ cd ~/wiki_example $ wget http://meme.pl/convert_wiki.py $ cp -R ~/hypertable/src/py/ThriftClient/gen-py ~/wiki_example/gen-py/ $ cp -R ~/hypertable/src/py/ThriftClient/hypertable ~/wiki_example/hypertable/ $ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 $ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2
Next step is to create a wikipedia table and we do so by typing in following commands into the
hypertable command line utility
hypertable> create table wikipedia (title,article, ACCESS GROUP meta (title), ACCESS GROUP text (article));
The above command will create a table called wikipedia with just two column families:
- title – containing the title of the article
- article – containing the page content
and two access groups
These two access groups tell the hypertable to store their member column families in a separate physical file. With the above type of configuration we're going to have titles stored in a different file than full text of the articles. This will enable us to search through titles much faster as the files containing article text will not be processed at all during this process.
Once done with this setup we go back to the command shell and run the conversion script
$ cd ~/wiki_example $ env PYTHONPATH="gen-py" python convert_wiki.py
This step is a last step and it can take a few hours depending on your cluster configuration, but once everything goes well you'll have a wikipedia dump imported into your hypertable installation.
The import script uses articles' titles as record keys so they are used for querying.
So let's search for some articles which titles start with Dirichlet:
hypertable> select title from wikipedia where row =^ "Dirichlet"; [... output truncated for brevity ...] Elapsed time: 0.21 s Avg value size: 22.02 bytes Avg key size: 23.02 bytes Throughput: 13245.13 bytes/s Total cells: 63 Throughput: 294.13 cells/s
We can see there're 63 articles starting with the word Dirichlet.
Doing the same type of query for the word Time yields following results
hypertable> select title from wikipedia where row =^ "Time"; [... output truncated for brevity ...] Elapsed time: 0.04 s Avg value size: 26.74 bytes Avg key size: 27.74 bytes Throughput: 4348717.61 bytes/s Total cells: 3585 Throughput: 79815.66 cells/s
You can clearly see the throughput numbers are really high but to be honest with the readers I have to admit that this is a heavy caching in action, as doing the query for the first time is not going to yield such impressive numbers. For example running the query for articles starting with ZX yields these results:
hypertable> select title from wikipedia where row =^ "ZX"; [... output truncated for brevity ...] Elapsed time: 0.19 s Avg value size: 9.51 bytes Avg key size: 10.51 bytes Throughput: 3783.18 bytes/s Total cells: 35 Throughput: 188.89 cells/s
Now, running it for the second time gives us much higher numbers
Elapsed time: 0.01 s Avg value size: 9.51 bytes Avg key size: 10.51 bytes Throughput: 47612.58 bytes/s Total cells: 35 Throughput: 2377.23 cells/s
Let's assume, for the sake of example , that you're interested in articles starting with a letter 'A'. You can get these easily by running this query
hypertable> select title from wikipedia where ('A' <= ROW < 'B');
and the numbers you get for cached and uncached version are really high in both cases. Initial query results in these performance numbers
[... output truncated for brevity ...] Elapsed time: 14.17 s Avg value size: 18.42 bytes Avg key size: 19.42 bytes Throughput: 1005019.91 bytes/s Total cells: 376198 Throughput: 26557.52 cells/s
Subsequent queries result in this kind of performance
Elapsed time: 2.77 s Avg value size: 18.42 bytes Avg key size: 19.42 bytes Throughput: 5131230.21 bytes/s Total cells: 376197 Throughput: 135591.75 cells/s
As you can see clearly for yourself the numbers are really high. They could be even better! Trying to be objective I have to confess that the first test was much slower due to output going to the console over a wireless network which results in delays when performing output.
This post's goal was not to be a thorough performance review of hypertable so if you are interested in numbers then you should definitely take hypertable for spin yourself, but if you're interested in how you can utilize extra scalability then stay tuned for the next post in the Tools series. I'm going to explain how to make wikipedia imported into the hypertable searchable.
Update: I forgot to mention that these tests were performed on a x86_64 Ubuntu 8.10 Linux with hypertable compiled and operating in 64 bit mode on a single Intel Core 2 Quad Q6600 processor clocked at 2.4 GHz with 4 GB of physical memory and a single 500 GB SATA II disk.