Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek | January 21st, 2008 | Filed Under: Uncategorized

1.

Some remedial computer science, for the PCI auditors in my audience.

I hand you an array of random integers. How can you tell if the number three is in it?

Well, there’s the obvious way: check the numbers sequentially until you find the “3” or exhaust the array. Linear search. Given 10 numbers, you have to assume it could take 10 steps; N numbers, N steps.

Picture 1.png

Linear search is bad. It’s hard to do worse than linear. Let’s improve on it. Sort the array.

Picture 2.png

A sorted array suggests a different strategy: jump the middle of the array and see if the value you’re looking for is less than (to the left) or greater than (to the right). Repeat, cutting the array in half each time, until you find the value.

Binary search. Given 10 numbers, it will take as many as 3 steps —- log2 of 10 —- to find one of them in a sorted array. O(log n) search is awesome. If you have 65,000 elements, it’s only going to take 16 steps to find one of them. Double the elements, and it’s 17 steps.

But sorted arrays suck; for one thing, sorting is more expensive than linear search. So we don’t use binary search much; instead, we use binary trees.

Picture 3.png

To search a binary tree, you start at the top, and ask yourself “is my key less than (left) or greater than (right) the current node”, and repeat until ok, ok, ok, you know this stuff already. But that tree is pretty, isn’t it?

Search with a (balanced) binary tree is O(log n), like binary search, varying with the number of elements in the tree. Binary trees are awesome: you get quick lookup and sorted traversal, something you don’t get out of a hash table. Binary trees are a better default table implementation than hash tables.

2.

But binary trees aren’t the only tree-structured lookup mechanism. Binary radix tries, also called a PATRICIA trees, work like binary trees with one fundamental difference. Instead of comparing greater-than/less-than at each node, you check to see if a bit is set, branching right if it’s set and left if it isn’t.

Picture 4.png

I’m leaving out a lot about how binary radix tries work. This is a shame, because radix tries are notoriously underdocumented —- Sedgewick infamously screwed them up in “Algorithms”, and the Wikipedia page for them sucks. People still argue about what to call them! In lieu of an explanation of backlinks and bit-position-labelled edges, here’s a tiny Ruby implementation.

Here’s why radix tries are cool:

  • Search performance varies with the key size, not the number of elements in the tree. With 16 bit keys, you’re guaranteed 16 steps regardless of the number of the elements in the tree, without balancing.

  • More importantly, radix tries give you lexicographic matching, which is a puffed-up way of saying “search with trailing wildcard”, or “command-line-completion-style search”. In a radix tree, you can quickly search for “ro*” and get “rome” and “romulous” and “roswell”.

3.

I’ve lost you.

Let’s put this in context. Tries are a crucial data structure for Internet routing. The routing problem goes like this:

  • You have a routing table with entries for “10.0.1.20/32 -> a” and “10.0.0.0/16 -> b”.

  • You need packets for 10.0.1.20 to go to “a”

  • You need packets for 10.0.1.21 to to to “b”

This is a hard problem to solve with a basic binary tree, but with a radix trie, you’re just asking for “1010.0000.0000.0000.0000.0001.0100” (for 10.0.1.20) and “1010.” (for 10.0.0.0). Lexicographic search gives you “best match” for routing. You can try it in the Ruby code above; add *”10.0.0.0”.to_ip to the trie, and search for “10.0.0.1”.to_ip.

The correspondance between routing and radix tries is so strong that the most popular general-purpose radix trie library (the one from CPAN) is actually stolen out of GateD. It’s a mess, by the way, and don’t use it.

If you understand how a trie works, you also understand how regular expressions work. Tries are a special case of deterministic finite automata (DFAs), where branches are based exclusively on bit comparisons and always branch forwards. A good regex engine is just handling DFAs with more “features”. If my pictures make sense to you, the pictures in this excellent article on Thompson’s NFA-DFA reduction algorithm will too, and that article will make you smarter.

4.

You’re a network operator at a backbone ISP. Your world largely consists of “prefixes” —- IP network/netmask pairs. The netmasks in those prefixes are hugely important to you. For instance, 121/8 belongs to Korea; 121.128/10 belongs to Korea Telecom, 121.128.10/24 belongs to a KT customer, and 121.128.10.53 is one computer inside that customer. If you’re tracking down a botnet or a spamming operation or worm propagation, that netmask number is pretty important to you.

Unfortunately, important though they are, nowhere on an IP packet is there stamped a “netmask” —- netmasks are entirely a configuration detail. So, when you’re watching traffic, you essentially have this data to work with:

ips.png

Surprisingly, given enough packets to look at, this is enough information to guess netmasks with. While working at Sony, Kenjiro Cho came up with a really elegant way to do that, based on tries. Here’s how:

Take a basic binary radix trie, just like the ones used by software routers. But bound the number of nodes in the tree, say to 10,000. On a backbone link, recording addresses out of IP headers, you’ll exhaust 10,000 nodes in moments.

Store the list of nodes in a list, sorted in LRU order. In other words, when you match an IP address with a node, “touch” the node, sticking it at the top of the list. Gradually, frequently seen addresses bubble up to the top, and infrequently seen nodes sink to the bottom.

Picture 6.png

Now the trick. When you run out of nodes and need a new one, reclaim from the bottom of the list. But when you do, roll the data from the node up into its parent, like so:

Picture 5.png

10.0.1.2 and 10.0.1.3 are sibling /32s, the two halves of 10.0.1.2/31. To reclaim them, merge them into 10.0.1.2/31. If you need to reclaim 10.0.1.2/31, you can merge it with 10.0.1.0/31 to form 10.0.1.0/30.

Do this over, say, a minute, and the standout sources will defend their position in the tree by staying at the top of the LRU list, while ambient /32 noise bubbles up to /0. For the raw list of IP’s above, with a 100 node tree, you get this.

Cho calls this heuristic Aguri.

5.

Aguri is BSD licensed. You can go download it, and a driver program that watches packets via pcap, from Cho’s old home page.

6.

I’m going somewhere with this, but I’m 1300 words into this post now, and if you’re an algorithms person you’re tired of me by now, and if you’re not, you’re tired of me by now. So, let Aguri sink in, and I’ll give you something cool and useless to do with it later this week.

Viewing 58 Comments

    • ^
    • v
    Yeah, I gotta say kinda tired. Jose raised awareness back in 2003 and I know you (and presumably other AN folks past or present) undoubtedly have been doing other neat things with it as well, so this teaser is just kind of like the aggravation you get with a cliffhanger, all lead in no pay off.
    • ^
    • v
    I can't say much about how Arbor's software is implemented, but the Aguri discussion predates 2003. And yeah, if you already knew Aguri... =)
    • ^
    • v
    "Binary trees are a better default table implementation than hash tables." Why should I believe that? Some explanation would be nice, or just avoid huge proclamations like that.
    • ^
    • v
    If "compare" is relatively cheap, binary trees provide comparable average performance and better worst-case performance, and provide fast access to sorted subsets --- which is an operation that turns out to be surprisingly useful when you can count on having it.

    It's not a "huge proclamation", nor am I the first person to make it; it's part of the standard C++ library.
    • ^
    • v
    I want to thank you for the excellent post! I don't usually find articles on the net worth reading but this is.

    Please follow up the post explaining binary radix trees in more detail.

    Again thanks for the top post!
    • ^
    • v
    Thank you. Just so you know --- I had to go check to make sure your comment was real. It is totally within the state of the blog-spamming art to write code to generate sentences like "please follow up the post explaining XXX YYY ZZZ in more detail".

    I just thought that was funny, is all. Maybe it's not. Appreciate the vote of confidence!
    • ^
    • v
    "I've lost you." made me laugh out loud.

    Excellent, interesting, and well-written post. Subscribed.
    • ^
    • v
    Some remedial computer science, for the PCI auditors in my audience

    You don't need a netmask or full IP prefix to run Nessus 2 against a target IP address!

    If you’re tracking down a botnet or a spamming operation or worm propagation, that netmask number is pretty important to you

    Wouldn't NetFlow, cflow, sFlow, etc include the full IP prefix including src and dst network masks? Even IP CEF includes these and have the information exportable via tmstats. I think troubleshooting IP related information without a local route-server running Zebra, Quagga, or XORP is more difficult and exhaustive.

    I do find Aguri interesting - thanks for the post! I'm looking forward to what it can be used for outside of products such as Arbor or open-source projects such as Ourmon.

    the most popular general-purpose radix trie library (the one from CPAN) is actually stolen out of GateD. It’s a mess, by the way, and don’t use it

    Are you talking about Dave Plonka's Net-Patricia module? I recall first reading about radix trees in TCP/IP Illustrated Volume 2 with some updated material from "Inside Cisco IOS Architecture". There hasn't been much in the literature lately on these topics.

    if you’re an algorithms person you’re tired of me by now, and if you’re not, you’re tired of me by now

    Actually I was hoping that you would cover multi-bit tries (mtries) and LC-tries, too.
    • ^
    • v
    Wow! Great article. Perfectly executed. I'm not an algorithm guy, but I couldn't stop reading. Great writing...I'll watch this blog.
    • ^
    • v
    this isn't new, and plenty of people have heard of this data structure. Every Oracle dba knows about B-tree indexes. They use this method.
    • ^
    • v
    Even if this was a post about binary trees, you'd be wrong; a b-tree isn't a binary tree.
    • ^
    • v
    Cool, Kenjiro Cho also wrote ALTQ for *BSD. Great post.
    • ^
    • v
    Dre, Netflow doesn't include netmasks (where would it get them from?). I don't know sFlow.
    • ^
    • v
    Binary trees are a better default table implementation than hash tables.

    Eh. Depends on what you want to do. In particular, though, trees of all sorts suffer from horrible data cache polution on modern processors, so if you're doing a simple lookup, hash tables are pretty much always more efficient.

    For partial lookups hash tables are, of course, completely useless, and a trie of some sort is going to be necessary.
    • ^
    • v
    It's a whispered truism among systems programmers that all the "linked" data structures (I hear this about linked lists more than trees) are hellish on caches, but this strikes me as a hollow argument for two reasons:

    (1) You're don't have to settle for the general-purpose allocator when managing tree storage --- in fact, malloc may be really awful for that purpose, given malloc's mandate to maximize memory efficiency and block lookup, pitted against the program's desire to exploit the locality involved in hitting 16 closely-related node structures back to back.

    (2) You don't HAVE to implement a tree structure using the same "struct" definition that Sedgewick wrote in That Book. If you think about the access patterns in a tree search, you can stripe the data in contiguous blocks.

    There's an old paper that I read about the "Suez" software switch design, maybe 4-5 years ago, that went into some detail about cache-optimized tree structures for prefix lookups.
    • ^
    • v
    Well-written, and pretty pictures to boot. Adding this blog to my Netvibes. Can't wait to read the backlog of stuff I've missed so far.
    • ^
    • v
    @ Tom: Dre, Netflow doesn’t include netmasks (where would it get them from?). I don’t know sFlow

    NetFlow V9 / IPFIX has included subnet masks. NetFlow probably gets them from the same place that the IP FIB (e.g. CEF) gets them from. I don't think BGP is required, but NetFlow could also stamp ASN's on a per-prefix basis if a full-table feeds your peer.
    • ^
    • v
    FYI: In Greek "aguri" = cucumber.
    • ^
    • v
    The netmasks exported from the FIB/RIB/CEF are considerably less useful than the "real" netmasks; what's being exported from a core router is essentially the BGP table. You can go to the looking glass for that. Most interesting prefixes don't have ASNs directly associated with them.
    • ^
    • v
    Wait, I'm confused. Why is this something other than a binary radix search splay tree? Moving frequently sought nodes to the head of a structure is old hat.
    • ^
    • v
    Is there even a such thing as a splay radix tree? You can't rotate a classic radix tree; the levels mean something. Can you "rotate" a PATRICIA tree? It's a DFA, the edges are labelled, so... I guess?

    You rotate a splay tree to optimize search. You keep an LRU in an Aguri trie so you can pick nodes to aggregate. It's not JUST an LRU sort; it's also that (1) the trie is bounded in size, and (2) the bounds are deliberately chosen so that the inputs quickly overflow them. An Aguri trie that is "large enough" to contain all inputs is useless.
    • ^
    • v
    You can expand your hash table (in amortized constant time per inserted element) so that it has a very small constant access time per element in the average case. (Making its worst case good can be arbitrarily hard.) Binary trees have both worse big-O performance (O(lg N) instead of O(1) per lookup or insertion) and dramatically worse constant factors. Additionally, they eat memory like there's no tomorrow (you need at least two overhead pointers per inserted value, which are particularly expensive on 64-bit machines, and usually also tag bits to distinguish leaves from internal nodes, plus usually memory allocation overhead), and balanced-binary-tree algorithms are considerably trickier than hashes. (A few years ago, in http://lists.canonical.org/pipermail/kragen-hac... I implemented a move-to-front optimization like that used in splay trees, but for a hash table. The implementation is three lines of code, and the correctness argument is nine words of comments.)

    I assert without proof that sorted traversal is not that important.

    So I don't agree that "Binary trees are a better default table implementation than hash tables."

    On the other hand, I agree that tries rock, at least in theory. I still use hashes most of the time.
    • ^
    • v
    (1) I'm not writing a post about my war on hash tables. I do, in fact, use hash tables. =)

    (2) See N comments ago where I addressed the received wisdom that trees are necessarily hard on memory and cache performance. No, your trees are.

    (3) Account for the memory you burn on keeping load factor down. You're waving away collisions (that's how you claim O(1) for lookup) by spending memory on it. Are hash tables necessarily smaller than tries?

    (4) Sorted traversal is less important than efficient access to ranges. Not all lookup problems fit the "dictionary" ADT. If you can only take one algorithm to the desert island, don't you take the one that supports more than one ADT?
    • ^
    • v
    I did read the comment where you asserted that binary comparison-based trees don't have to be hard on memory and cache performance, but (a) I wish you would be more specific about how you can avoid having them be hard on memory --- do you mean turning them into in-memory B-trees? Or do you mean carving the tree up into, say, four-level-deep triangles that are connected by pointers at the edges, in which case you have a lot of empty spots?, and (b) a naive reader might think that your mentioning this in this context means that you disagree with my assertion that they have "both worse big-O performance and dramatically worse constant factors", which you don't seem to be.

    I'm not "waving away" collisions --- I said you can expand the hash table, which keeps them to a constant level at a fairly low (amortized) cost, and can still tolerate relatively high load factors with a small average number of collisions. (Lua's tables use some hashing algorithm I still don't quite understand which they claim lets them support 100% load factors...).

    Hash tables are not necessarily smaller than tries, for the reasons you point out. But they usually are.

    And, yes, of course, if I could only ever use one of hash tables and binary search trees, I'd take binary search trees, for exactly the reason you point out. Although tries would be nicer.
    • ^
    • v
    If you don't need to handle 4 billion elements in your tree, you don't need 32 bit pointers to store links. We could take that in a million different directions, but I expect we agree on the fundamental point.

    I'm not trying to say that pretty much every linked list and tree library out there doesn't rip up your cache. They clearly do. But there's a lot of ways to mitigate the naive design that causes that; start by just not using malloc for your tree.
    • ^
    • v
    to get excellent information on tries and other optimization info, read George Varghese's excellent book called Algorithmics and in particular, chapter 11.

    http://www.amazon.com/Network-Algorithmics-Inte...
    • ^
    • v
    This is indeed an awesome book, though perhaps not my go-to source for how to implement a trie; most of Varghese's trie stuff assumes you have the background already.
    • ^
    • v
    You guys -suck- at computing, do you know that? Use quantum computing, don't just run through the numbers trying to determine whether the number is in there, just -pick it- out of the list.

    Jeezes, no wonder we're still in the goddamn programming stone age!
    • ^
    • v
    That comment right there? At least as funny as the last 5 XKCD comics.
    • ^
    • v
    This reminds me of Octree Colour Quantisation in the way it aggregates the leaves once you hit a certain number of nodes.

    http://www.cubic.org/docs/octree.htm
    • ^
    • v
    I enjoyed your article immensely and, in particular, the follow-on discussion.

    I personally prefer red-black trees since most of the code I've worked on didn't dynamically resize the underlying number of buckets and we'd eventually end up with some pathological data set that'd stuff beaucoup widgets in a single bucket. Or someone would write a hash function that didn't and we'd degenerate to a godawful linked list.

    That said, there's one thing hash tables I see hash tables allowing you to do more naturally--incremental cleanup of the dataset. If you've got to visit the entire dataset once every, say, 30 iterations of the cleaner, it's pretty clear I must visit 1/30th of the available buckets. On the other hand, I've never found anything equally satisfying* for a tree since a naive solution is icky when an unfortunately chosen object is deleted between runs.

    *elegant suggestions insanely appreciated.
    • ^
    • v
    I think you confused the bounded list's victim selection strategy with the relationship represented in the list contents.

    I think you meant MRU where you said "LRU". Well the unpopular victim is always at the tail of the list, and at completion the most popular node is at the front of the list, so it is surely a Most Recently Used list. Or did I misunderstand you?
    • ^
    • v
    Nice article! It's been a while since I've read up about a nice data structure.

    I like trees for an entirely different reason - they can be made to play nicely with functional programming. They can even be used to implement fairly efficient functional random access lists (http://citeseer.ist.psu.edu/okasaki95purely.html).

    I would be intrigued to see someone comparing well tuned hash and tree lookup structures in software packages that are routinely used.

    Something that might be of interest here, are cache oblivious data structures (http://www.daimi.au.dk/~large/ioS05/ABF.pdf). I stumbled onto the article a couple of months ago and gave it a cursory read-over (and remember being impressed); your article is definitely making me want to go back and read it more thoroughly.
    • ^
    • v
    To those claiming linked data structures are bad, you don't need pointers to implement a tree. Think about it carefully and you'll figure out how to use an array (i.e., no more storage than for the data itself).

    Hint: look at how the old McKusick/Karels malloc works.

    Bonus: implement the freelist using only one additional array element.

    BTW, none of you better have a BSCS or higher. Tree implementation optimization is an undergraduate task at any non-correspondence school.
    • ^
    • v
    (2) See N comments ago where I addressed the received wisdom that trees are necessarily hard on memory and cache performance. No, your trees are.

    Sigh. If you're not doing prefix or neighbor lookups, then a tree is never a win in memory due to cache effects. It's especially not a win on a disk drive as you're causing more seeks.

    This isn't wisdom whispered in halls, this is just plain measurable on any test.

    And yes, I know about optimizing trees to deal with cache effects. I wrote a bone-head boggle solver that creates a ternary tree of all the words, so that I could quickly prune search paths. In part of the optimization of the algorithm, I measured cache misses (to disk, as I had serialized the tree and mapped it directly into memory from disk so that it wouldn't have to be re-run on every invocation). Part of what came up immediately was it was important to create the tree in a nice order so that after the first few comparisons, everything was under a page size. This is your point that you make above.

    But, bottom line, if you have to traverse a tree structure to merely see whether or not a key exists (or just acquire the associated leaf data), it will always be slower, and worse on the entire system (due to cache effects) than a hash.

    I like tries. I use tries. But they aren't always appropriate and you're doing your readers a disservice if you tell them any differently.
    • ^
    • v
    You're really comparing external search to in-memory search? Come on, Ray.
    • ^
    • v
    You’re really comparing external search to in-memory search? Come on, Ray.

    Wow. I must be really miscommunicating here if that's all you took away from that. It was an example to motivate your intuition.

    But to go with it for a moment, yes, I am, as they have parallels. A cache miss in memory is akin to a head seek on a drive. To see why, just think about how the L2 cache is the same as the track cache. The L1 is equivalent to whether a sector is in the RAM cache or not. But this was merely an example. Discard it if it offends you so deeply.

    The point is that cache misses are expensive on modern processors. Back when I started on the 6502 et al (1979), there was no difference. Nowadays, there's a massive difference between code that makes many memory accesses or few, and whether those memory accesses are spread out randomly or in order.

    Try it. Make a large 2-d array of integers (say 1024x1024), and try walking the array in column order first versus row order first, and timing it.

    Anyway, if you don't want your intuition motivated, well then, I guess I'm just going to have to beg you to measure the speed difference between walking a tree and looking up a key in hash.

    And for the third time, there are times when hashes are completely worthless -- such as partial path lookups and needing to traverse neighbors (L/R/Parent) in an ordered collection. But for many things you're generally going to be better off with a hash.

    Don't take my word for it. Go measure it.
    • ^
    • v
    Ray, all you're pointing out is that loading whole 64 bit pointers out of memory and following them to some random other region of memory causes cache misses. You're seem insistent on defining "tree structures" as "structs with two child pointers", and "memory" as "what malloc returns".

    You're obviously a smart guy. Why am I having such a hard time getting you past that idea?
    • ^
    • v
    hey tom. going to explain judy trees next? ;-)
    • ^
    • v
    Judy is disgusting.
    • ^
    • v
    Obviously were both just miscommunicating here, so please bear with me a little longer.

    My initial statement was pretty simple. You said "Binary trees are a better default table implementation than hash tables." and I said it depends. Take the below with that context, okay?

    I don't understand what you mean by "insistent on defining 'tree structures' as 'structs with two child pointers' and 'memory' as 'what malloc returns'." Nothing I said is dependant upon a specific representation. Quite the contrary, I gave the serialied-to-disk example or a ternary tree to counter that possible issue. And as for memory being what malloc returns, all I can say is "huh?". You seem to think that RAM has different characteristics whether it's heap or stack? I'm horribly, horribly confused by what you are trying to get across here.

    So...?

    From my point of view a tree is something that requires a set of traversals. I would define a single "traversal step" as a load from memory and a branch decision (based on a comparison). Whether memory comes from heap or stack or is even purely on disk is immaterial to my argument. Whether it's a b-tree, radix tree, binary tree, or some hybrid is also immaterial.

    So, I guess I just flat out don't understand your point. I'm saying in certain circumstances, a hash is a far better data stucture than a tree, whatever kind of tree that may be. Are you disagreeing with that? I just can't fathom how. Are you saying that a tree of some sort is *always* better than a hash? If so, why do you think ComSci profs teach it? Merely to confuse students?

    I mean, in none of my comments have I been strident about this. I'm just trying to say that in your otherwise nice writeup, you gloss over a point that needs to be addressed more carefully for those who are not as versed in data structures as others. And by reading the comments, there are many who just aren't as well trained as we may hope. Don't mislead them. That's all I'm saying. Okay?
    • ^
    • v
    I'm not talking about the storage hierarchy. We agree, external search, different problem from in-memory search.

    I'm specifically talking about tree structure layouts that minimize cache misses and overhead for links.

    You don't even need to store links directly; you can use minimum-bit or succinct encodings. If you're concerned about optimizing a specific memory access pattern --- for instance, increasing the chance that all the traversal fetches are going to be in loaded cache lines --- you can tailor the memory layout to that as well.

    It's true that "struct treenode { void *key; void *data; struct treenode *left; struct treenode *right; };" is not a particularly memory-efficient or cache-efficient layout. I'm just strongly objecting to tarring the whole concept of a tree with that naive implementation. Realtime network production code running at n mpps wouldn't use "struct treenode", or a hash table.
    • ^
    • v
    For a disgusting example of what I'm talking about that Dug Song just mentioned, consider Judy arrays.
    • ^
    • v
    i have no idea how the 4 in the binary tree can be where it seems to be; maybe this has already been pointed out or sumthin but 4 comes from 6 which comes from 5 so 4 is greater than 5? maybe i just didnt get it but 4 shud be a left-wing descendant of 5 instead of being a left wing descendant of 5's right-wing descendant. at least thats how i wudve done it.
    • ^
    • v
    Hey there, interesting post. I was wondering - I've seen two references now stating that Sedgewick got the implementation for patricia tries wrong (the other reference being http://cr.yp.to/critbit.html) - however I've never seen an explanation on just what exactly was wrong with his implementation.

    Todd H
    • ^
    • v
    To be honest, without digging out my copy of Sedgewick, I can't actually back that statement up. In my defense, I first heard the complaint from Danny Dulai when we and Kneel Fachan wrote the Patricia library for libishiboo, which you may still be able to find online.

    You can check the errata on Sedgewick's site, but he only has errata for the third edition on; third edition is from, I think, 2000? and my apocryphal claim about his error would be from around 1995. So, long story short, if you crib your trie code from the current Sedgewick book, you might be aces.

    The graph algorithms volume to new Sedgewick is also excellent.
    • ^
    • v
    "none of you better have a BSCS or higher"

    Never fear, dude :^)
    • ^
    • v
    "I first heard the complaint from Danny Dulai when we and Kneel Fachan wrote the Patricia library for libishiboo, which you may still be able to find online."

    Ah cool - I found the libishiboo library at his (Dulai's) site. Looks like the code actually implements removal of nodes from the patricia trie too which is neat since I've never been able to find any good explanation of how to go about doing this (sedgewick leaves it as an exercise in his book).

    Todd H
    • ^
    • v
    Don't tell him I told you about it; he may regret leaving that out there. =)

    I've used code derived from Kneel's PATRICIA tree in production, so I'm pretty comfortable with it.
    • ^
    • v
    Hi Thomas,

    nice to see you writing again. This data structure you're describing sounds awfully similar to Binary Decision Diagrams. Basically if I did a binary expansion of that 32-bit IPv4 address and then gave each of those bits a variable name, I could do exactly the same thing you've described with a reduced, ordered BDD. The merging operation you have described is the reduction operation that transforms a binary decision tree into a decision diagram.

    Cheers,
    Ralf
    • ^
    • v
    Good article!

    Correct me if I'm wrong though (it's been a while since my algorithms study), but in response to the original question:

    "I hand you an array of random integers. How can you tell if the number three is in it?"

    Yes, linear search is bad, but isn't it the best way to answer the original question? If you're sorting the array, you're introducing additional complexity that still has to touch each element O(n). From what I remember, there can't be any sorting algorithm better than O(n).

    If we sort the given array we've already spent at least O(n) but we now have to search the array (providing we didn't search while we were sorting). This assumes we're only looking for one element.

    Now, if we can choose to get a sorted array of random integers, that all of course changes.

    Anyway, that's all beside the point of Alguri. Never heard of it but I'll definitely check it out!
    • ^
    • v
    The FPP, Fast Pattern Processor, portion of the Agere Network Processor is Patrica Tree based. The same practice of using Patricia Trees can apply to generic TCP/IP packet processes and "psuedo" regular expression searching. You can leverage them in really neat ways.

    I first used Patricia Trees in creating silicon for routers [NP based Cisco routers] and later on IPS systems. The downfall is the changing of rules. Meaning, reorganizing the tree(s) is an art form [in a sense] and I had to fall onto a genetic algorithm to load the tree based on user settings when it came to 220 bit and beyond Patricia Trees. This seemed to solve the rule switching issue. To show you how bad the rule switching was on a 7200 NP based system (Cisco) it would take up to 120 seconds to reload the BPG table (back when that was hip]. Later in 2001 we all switched to creating devices with dual banked memory for Patricia Tree silicon.

    Great topic by the way - would love to see more articles like this.
    • ^
    • v
    "I hand you an array of random integers. How can you tell if the number three is in it?"

    Maybe you should read the manual.

    if (in_array(3)) {
    echo "3 is in the array";
    } else {
    echo "3 is not in the array";
    }

    Problem solved. I didn't even have to read the rest of the post about aguri. That is a kind of cactus, right?
    • ^
    • v
    /me sighs

    What if you have to write the in_array() function, jackass?
    • ^
    • v
    Speaking of regex and DFAs, I saw an interesting presentation from a couple of Google researchers at IT Defense this week. They have found a bunch of implementation problems in various regex engines:
    http://www.it-defense.de/itdefense2008_com/page...
    (Scroll down to "Regular Exceptions".)

    No slides online or anything yet, I don't think. If you're interested, I'll come back when they are.
    • ^
    • v
    Good summary on trees and tries!
    I wonder if radix tries could be used also to improve the performance on keyword searches in PCAP files or sniffed traffic (like an Echelon functionality). This type of searches can be performed with NetworkMiner, see:
    http://networkminer.wiki.sourceforge.net/Keywor...
    • ^
    • v
    I could burn 8 more blog posts talking about the different theories behind packet filtering --- pcap takes a pretty idiosyncratic response, which is to frame the problem compiler theoretically (though the performance of pcap is dominated by the IO channel you use to get packets, by lines of code, pcap performance is overwhelmingly addressed by IR optimizers.

    I say that because tries are a classical approach to speeding up packet filtering. You can consider routing a special case of filtering, of course, in a single dimension. As you add dimensions, you start cross-producting multiple tries.

    And, of course, radix tries (and edge-labelled PATRICIA tries in particular) are just a specialization of DFAs, which brings us back to pcap and optimizers and...
    • ^
    • v
    FYI: aguri in greek means cucumber.. :oS

Trackbacks

close Reblog this comment
blog comments powered by Disqus