The Daily Bayes
In the mid 90s, Nicholas Negroponte described a personalized newspaper that would learn what sorts of news the reader wanted to read, and compose itself from the relevant news of the day. He called it "The Daily Me."
Even given my own immoderate optimism in the possibilities of the Web at the time, such a thing seemed to me to be pretty ambitious, technically, in addition to requiring the support of media outlets for the news sources in a format that a software system could work with, and some sort of artificial intelligence.
Now, though, all of the necessary tools to make something like this work are in common currency. Increasingly, news outlets are providing RSS, RDF, or Atom feeds of their product. Media are being driven by more agile, narrowcast news sources (personal weblogs, meta-weblogs, technical news sources) to do this.
I primarily consume the Web through an aggregator. My use of a desktop aggregator has recently been supplanted by the ease, convenience, and—to my surprise—speed and power, of a web-based aggregator service.
Is this the Daily Me? Not just yet. It's closer to what I'm interested in, because I follow a lot of narrowcast sources that are mostly aligned with my interests, but the news items are only categorized by source, not by my potential interest in individual items. For example, I read Engadget and Gizmodo, though a high percentage of the news items are about cell phones. Not only am I not interested in cell phones, I have a deep antipathy concerning cell phones. I would like to read about the latest advances in cell phone jamming technology.
But by selecting a few news sources with a high hit rate of "interesting" news for me, I may miss the occasional gem that comes out of a site that, on the whole, I wouldn't find very interesting, and would be a waste of time for me to configure in my news aggregator.
What's needed is some a mechanism for classifying individual news items which can be trained to discern between things I want to see and things I don't really care about, or are repugnant to me. Sound familiar? This is what your spam filter does.
Combine the naïve Bayesian classifier at the core of many spam filters with a pool of RSS news feeds and, with a little training, it would be able to decide which news items you got presented with.
A newspaper, though, has sections. A business section, an arts and entertainment section, technology section, editorials, classifieds and funnies. The spam/ham classification can expose you to the "right" news stories, but in no particular grouping. Luckily, the Bayesian classifier is more versatile than this. In my early experiments, I found that the classifier on which I based my homebrew spam filter could reliably distinguish between emails from a certain group of friends, an individual friend, a group of ex-colleagues, or from one of several technical mailing lists (i.e. I had lots of ham categories).
Peter Merholz has an interesting article today on the classification of web-based data, which I discovered conveniently enough after I was forming this posting in my head. (Disclosure: The article came to my attention through an RSS feed of my del.icio.us subscriptions, specifically mathowie). Peter talks about tag-based classification methods in systems like del.icio.us and Flickr, and how other people's idea of how to tag things may not sync up with mine.
Clearly, such tagging systems are not a panacea; they present many potential drawbacks. With no one controlling the vocabulary, users develop multiple terms for identical concepts. For example, if you want to find all references to New York City on Del.icio.us, you’ll have to look through “nyc,” “newyork,” and “newyorkcity.”
You may also encounter the inverse problem — users employing the same term for disparate concepts. Flow, for instance, can either mean optimal creative experience, or the movement of a fluid.
But it's worse than that. People are very bad at maintaining metadata. They just don't do it. Or they don't think it through. When it comes to metadata, just as in politics, people are lazy and stupid. Even when they understand the value in accurate and rich metadata.
No set of categories is going to be suitable for everyone. General categories (Sports, World News, Weather) might be agreed upon by all users of a shared system, but an individual would want a more fine-grained set of classifications (Anna Kournikova, New Zealand aboriginal politics, Trailer parks destroyed by tornados) which would be untenable in a shared system. A compromise might work best: The shared system uses a coarse, fixed taxonomy (think dmoz.org or yahoo), either human or machine categorized, then individuals can further winnow their news down with locally trained Bayesian classifiers.
