Markov Chains using CouchDB's Group Reduce

Markov Chains using CouchDB's Group Reduce
By jchris in Coding 7 months ago.

Markov Chains are statistical constructions of a series of objects or events (in this case words) designed to approximate a sample set. If you’ve ever received spam that looks almost like it was written by a human, you’re probably looking at a Markov Chain.

free the same person place there were to make no 2 the other things to be taken up a book of painting surpasses.

The math is simple: for a given word or words, what words follow most frequently in the sample set? The important requirement is having enough data, so that for most words, there will be enough instances in the sample set to find a plausible follower. Once you have the data, the rest should be simple – let’s see some pseudocode:


word = intial_word
while word
  print(word + ' ')
  word = probable_follower_for(word)
end

Looking into probable_follower_for(word) we can easily see that it requires random access to the sample set. For each word we need to find all of its occurances in the text (and its context), and give higher weight to words which follow it more often in the set. This isn’t rocket science, but doing it with awk, sort, and uniq -c will thrash your disk with each query.

Storing the sample data in a custom index would make a lot of sense for this application, and don’t I remember something about B-trees and Θ(log N)? – time to bust out the Knuth. If you’re like me you’d rather not write code, whenever possible. Luckily, CouchDB’s sweet spot is building indexes like the one we need. Since it’s also got a sexy JSON / HTTP API, we’d be crazy not to use it!

CouchDB’s Incremental Reduce

If you experimented with my word count example you’re familiar with CouchDB’s reduce functions. Today’s new bit is Group Reduce, which takes the results of the map function, and allows you to run them through the reduce function with segmentation defined at query time. Let’s start with an example map on which to run reductions:


["the","alpha","rays"],                     "value":"outline-of-science"},
["the","alterations","effected"],           "value":"ulysses"},
["the","altered"],                          "value":"da-vinci"},
["the","american"],                         "value":"america"},
["the","american"],                         "value":"america"},
["the","american"],                         "value":"america"},
["the","american","federation","of"],       "value":"america"},
["the","american","federation","of"],       "value":"america"},
["the","american","marine","on"],           "value":"america"},
["the","amoeba","is","one"],                "value":"outline-of-science"},
["the","ancestors"],                        "value":"outline-of-science"},
["the","ancient","architects","beginning"], "value":"da-vinci"},
["the","ancient","man"],                    "value":"da-vinci"},

Here’s our a very simple reduce function, which simply returns the total number of rows in its input:


function(keys, values, combine) {
  if (combine) {
    return sum(values);
  } else {
    return values.length;
  }
}

Run on the above map results, the function returns a succinct (if uninformative) 13. Querying the reducer with the group=true parameter returns the value for each unique key, like so:


["the","alpha","rays"],                     1
["the","alterations","effected"],           1
["the","altered"],                          1
["the","american"],                         3
["the","american","federation","of"],       2
["the","american","marine","on"],           1
["the","amoeba","is","one"],                1
["the","ancestors"],                        1
["the","ancient","architects","beginning"], 1
["the","ancient","man"],                    1

But wait, we want to see the relative frequency of just the next following word. Now let’s get interesting! Run the query with group_level=2, and we get the reducer results for each unique set of the first two elements in our key’s array.


["the","alpha"],              1
["the","alterations"],        1
["the","altered"],            1
["the","american"],           6
["the","amoeba","is","one"],  1
["the","ancestors"],          1
["the","ancient"],            2

Woah that’s rad! Not only can we just take the return value to tell us which word is next in the chain, we’re pushing less than half the bits around, than we would if we tried to query the map directly and do the reduction in our application code. CouchDB optimizes all the housekeeping and indexing, so not only is it’s solution convenient, it is fast!

All the code and data necessary to generate Markov chains is in the CouchRest git repository (I’m having second thoughts about storing the full text of those book in the Git repo, so maybe I’ll write a little shell script to curl them in… but for now.)


git clone git://github.com/jchris/couchrest.git
cd couchrest
ruby examples/word_count/word_count.rb
script/couchview push word-count-example
examples/word_count/markov word

Enjoy!

Can you think of something else the Group Reduce functionality would be good for? Leave a comment!

2 comments on Markov Chains using CouchDB's Group Reduce

I want to thank you again for putting up these tutorials. I’m trying to wrap my head around CouchDB, and these tutorials are helping a lot.

I’d like to know where you found the group_level option and how you can change the value of combine as passed to the reduce function.

Unless I’m missing something it seems like CouchDB is awesome, but not yet very well documented (and rapidly changing!) These are features I’d like to know about and use.

Post a comment