[Full title: Ontology is Overrated: Links, Tags, and Post-hoc Metadata]
There are many ways to organize data: labels, lists, categories, taxonomies, ontologies. Of these, ontology -- assertions about essence and relations among a group of items -- seems to be the highest-order method of organization. Indeed, the predicted value of the Semantic Web assumes that ontological successes such as the Library of Congress's classification scheme are easily replicable.
Those successes are not easily replicable. Ontology, far from being an ideal high-order tool, is a 300-year-old hack, now nearing the end of its useful life. The problem ontology solves is not how to organize ideas but how to organize things -- the Library of Congress's classification scheme exists not because concepts require consistent hierarchical placement, but because books do.
The LC scheme, when examined closely, is riddled with inconsistencies, bias, and gaps. Top level geographic categories, for example, include "The Balkan Penninsula" and "Asia." The primary medical categories don't include oncology, defaulting to the older and now discredited notion that cancers were more related to specific organs than to common processes. And the list of such oddities goes on.
The reason the LC scheme is accumulating these errors faster than they can correct them is the physical fact of the book, which makes a card catalog scheme necessary, and constant re-shelving impossible. Likewise, it enforces cookie-cutter categorization that doesn't reflect the polyphony of its contents--there is a literature of creativity, for example, made up of books about art, science, engineering, and so on, and yet those books are not categorized (which is to say shelved) together, because the LC scheme doesn't recognize creativity as an organizing principle. For a reader interested in creativity, the LC ontology destroys value rather than creating it.
As we have learned from the Web, when data is decoupled from physical presence, it is fluid enough to be grouped differently by different readers, and on different days. The Web's main virtue, in handling data, is to transmute organization from an a priori, content-based judgment to one that can be ad hoc, context-based, socially embedded, and constantly altered. The Web frees us from needing to argue about whether The Book of 5 Rings "is" a business book or a primer on war -- it is plainly both, and not only are we freed from making that judgment firmly or in advance, we are freed from needing to make it explicit at all.
This talk begins by exploring the rise of ontological classification. In the period after the invention of the printing press but before the invention of the search engine, intellectual production was vested in books, objects that were numerous but opaque. When you have more than a few hundred books, categorization becomes a forced move, even if the categories are somewhat arbitrary, because without categories, you can no longer locate individual books.
It will relate this "opaque objects" problem to the more recent history of organizing pure data -- a hierarchical file-system; then the emergence of "symbolic" links, which undermined the hierarchy but left intact the idea that data "was" somewhere, and that all other pointers were second-class; to our current system, where the URL makes all links equally symbolic.
The URL represents the inversion of the traditional scale, making the mere label and not the mighty ontology the key site of organizational value. The talk will go on to describe the tension between productive and extractive modes of metadata, and the effects of scale, heterogenous user assumptions, naï¿½ve and flat classifications, lowered barriers to production and tagging, and long-lived classifications by individuals. These are all things that are inimical to ontology but predictive of extractive organizational value, in the manner of Google.
The talk ends by discussing key technologies in the spread of extractive value -- Google, del.icio.us, fotonotes, purple numbers, RDF -- and wrapping up with some predictions about where value might be encapsulated in user-tagged, semi-structured data in the future.
Clay Shirky teaches at NYU's graduate Interactive Telecommunications Program. He writes and consults on the social and economic effects of the Internet, concentrating particularly on the decentralization of applications (peer-to-peer architectures and programmatic interfaces) and on the current explosion in social software.
This presentation is one of a series from the O'Reilly Emerging Technology Conference held in San Diego, California, March 14-17, 2005.
This free podcast is from our Emerging Technology Conference series.