### Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek | January 21st, 2008 | Filed Under: Uncategorized

## 1.

Some remedial computer science, for the PCI auditors in my audience.

I hand you an array of random integers. How can you tell if the number three is in it?

Well, there’s the obvious way: check the numbers sequentially until you find the “3” or exhaust the array. Linear search. Given 10 numbers, you have to assume it could take 10 steps; N numbers, N steps.

Linear search is bad. It’s hard to do worse than linear. Let’s improve on it. Sort the array.

A sorted array suggests a different strategy: jump the middle of the array and see if the value you’re looking for is less than (to the left) or greater than (to the right). Repeat, cutting the array in half each time, until you find the value.

Binary search. Given 10 numbers, it will take as many as 3 steps —- log2 of 10 —- to find one of them in a sorted array. O(log n) search is awesome. If you have 65,000 elements, it’s only going to take 16 steps to find one of them. Double the elements, and it’s 17 steps.

But sorted arrays suck; for one thing, sorting is more expensive than linear search. So we don’t use binary search much; instead, we use binary trees.

To search a binary tree, you start at the top, and ask yourself “is my key less than (left) or greater than (right) the current node”, and repeat until ok, ok, ok, you know this stuff already. But that tree is pretty, isn’t it?

Search with a (balanced) binary tree is O(log n), like binary search, varying with the number of elements in the tree. Binary trees are awesome: you get quick lookup and sorted traversal, something you don’t get out of a hash table. Binary trees are a better default table implementation than hash tables.

## 2.

But binary trees aren’t the only tree-structured lookup mechanism. Binary radix tries, also called a PATRICIA trees, work like binary trees with one fundamental difference. Instead of comparing greater-than/less-than at each node, you check to see if a bit is set, branching right if it’s set and left if it isn’t.

I’m leaving out a lot about how binary radix tries work. This is a shame, because radix tries are notoriously underdocumented —- Sedgewick infamously screwed them up in “Algorithms”, and the Wikipedia page for them sucks. People still argue about what to call them! In lieu of an explanation of backlinks and bit-position-labelled edges, here’s a tiny Ruby implementation.

Here’s why radix tries are cool:

Search performance varies with the

*key size*, not the number of elements in the tree. With 16 bit keys, you’re guaranteed 16 steps regardless of the number of the elements in the tree, without balancing.More importantly, radix tries give you

*lexicographic matching*, which is a puffed-up way of saying “search with trailing wildcard”, or “command-line-completion-style search”. In a radix tree, you can quickly search for “ro*” and get “rome” and “romulous” and “roswell”.

## 3.

I’ve lost you.

Let’s put this in context. Tries are a crucial data structure for Internet routing. The routing problem goes like this:

You have a routing table with entries for “10.0.1.20/32 -> a” and “10.0.0.0/16 -> b”.

You need packets for 10.0.1.20 to go to “a”

You need packets for 10.0.1.21 to to to “b”

This is a hard problem to solve with a basic binary tree, but with a
radix trie, you’re just asking for
“1010.0000.0000.0000.0000.0001.0100” (for 10.0.1.20) and “1010.*” (for 10.0.0.0).
Lexicographic search gives you “best match” for routing. You can try
it in the Ruby code above; add *”10.0.0.0”.to_ip* to the trie, and
search for *“10.0.0.1”.to_ip*.

The correspondance between routing and radix tries is so strong that the most popular general-purpose radix trie library (the one from CPAN) is actually stolen out of GateD. It’s a mess, by the way, and don’t use it.

If you understand how a trie works, you also understand how regular expressions work. Tries are a special case of deterministic finite automata (DFAs), where branches are based exclusively on bit comparisons and always branch forwards. A good regex engine is just handling DFAs with more “features”. If my pictures make sense to you, the pictures in this excellent article on Thompson’s NFA-DFA reduction algorithm will too, and that article will make you smarter.

## 4.

You’re a network operator at a backbone ISP. Your world largely consists of “prefixes” —- IP network/netmask pairs. The netmasks in those prefixes are hugely important to you. For instance, 121/8 belongs to Korea; 121.128/10 belongs to Korea Telecom, 121.128.10/24 belongs to a KT customer, and 121.128.10.53 is one computer inside that customer. If you’re tracking down a botnet or a spamming operation or worm propagation, that netmask number is pretty important to you.

Unfortunately, important though they are, nowhere on an IP packet is there stamped a “netmask” —- netmasks are entirely a configuration detail. So, when you’re watching traffic, you essentially have this data to work with:

Surprisingly, given enough packets to look at, this is enough information to guess netmasks with. While working at Sony, Kenjiro Cho came up with a really elegant way to do that, based on tries. Here’s how:

Take a basic binary radix trie, just like the ones used by software routers. But bound the number of nodes in the tree, say to 10,000. On a backbone link, recording addresses out of IP headers, you’ll exhaust 10,000 nodes in moments.

Store the list of nodes in a list, sorted in LRU order. In other words, when you match an IP address with a node, “touch” the node, sticking it at the top of the list. Gradually, frequently seen addresses bubble up to the top, and infrequently seen nodes sink to the bottom.

Now the trick. When you run out of nodes and need a new one, reclaim from the bottom of the list. But when you do, roll the data from the node up into its parent, like so:

10.0.1.2 and 10.0.1.3 are sibling /32s, the two halves of 10.0.1.2/31. To reclaim them, merge them into 10.0.1.2/31. If you need to reclaim 10.0.1.2/31, you can merge it with 10.0.1.0/31 to form 10.0.1.0/30.

Do this over, say, a minute, and the standout sources will defend their position in the tree by staying at the top of the LRU list, while ambient /32 noise bubbles up to /0. For the raw list of IP’s above, with a 100 node tree, you get this.

Cho calls this heuristic Aguri.

## 5.

Aguri is BSD licensed. You can go download it, and a driver program that watches packets via pcap, from Cho’s old home page.

## 6.

I’m going somewhere with this, but I’m 1300 words into this post now, and if you’re an algorithms person you’re tired of me by now, and if you’re not, you’re tired of me by now. So, let Aguri sink in, and I’ll give you something cool and useless to do with it later this week.

## Add New Comment

## Viewing 58 Comments

Thanks. Your comment is awaiting approval by a moderator.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

Do you already have an account? Log in and claim this comment.

## Add New Comment

## Trackbacks