The Semantic Web In Depth

Recently, the phrase Semantic Web has been popping up more and more. Unfortunately, there are no really good non-technical explanations of what it is. This is a first draft attempt at breaking the Semantic Web down into it's component parts and describing each in summary. Be forewarned that the explanation is detailed and covers many of the facets of the Semantic Web. Please let me know where the wording is unclear or incorrect. I really do want to improve this page!

Identifiers: Uniform Resource Identifier (URI)

Whenever you want to talk to someone about something, you use some sort of identifier. "The North Star" "The strange man at the grocery store" "Those really sour candies Bob always eats" However, when you want to be as exact as possible, you use a name. "Polaris" "Johnathan Roberts" "Mega Warheads"

On the Web, instead of giving something a name, we give it a URI. Anything that has a URI is said to be "on the Web" and we can give anything a URI: you, the book you bought last week, the fly that keeps buzzing in your ear and anything else you can think of -- they all can have a URI.

The URI is the foundation of the Web. While nearly every other part of the Web can be replaced, the URI holds the rest of the Web together. You're probably already familiar with one form of URI: the URL or Uniform Resource Locator. A URL is the address that lets you visit a webpage, like: http://www.w3.org/Addressing/. If you break it down you can see that a URL lets your computer locate a specific resource (in this case, the W3C's Addressing website). In addition to URLs, there are other forms of URIs. For example, mid: URIs identify email messages but they aren't able to locate a copy of the message for you.

URIs are decentralized -- no one person or organization controls who makes them or how they can be used. You don't need any authority or permission to make a URI for something. You can even make URIs for things you don't own, or abstract concepts that don't even physically exist. While this flexibility gives URIs a lot of power, it also brings with it a lot of problems. Since anyone can create a URI, we will inevitably end up with multiple URIs representing the same thing and we have no way to figure out whether two URIs are definitely the same. And we'll never be able to say with certainty exactly what a certain URI means. However, these are the tradeoffs that were made so that something the scale of the Semantic Web could be built.

Common practice for giving something a URI is to create a web page that describes the object and explains that the URL of that webpage represents it. For example, I created a URI for my copy of Weaving the Web. Now I have said that that specific URI no longer represents the web page you get back when you visit it, but instead the physical book that it describes.

This is an important fact to understand. URIs are not recipes describing to your computer how to get a specific file. Instead, they are names, which may or may not contain one way for your computer to get more information about them. Other ways to find out information about a URI are developing and ways to say things about URIs are an important part of the Semantic Web.

Documents: Extensible Markup Language (XML)

XML was designed to be a simple way to send documents across the Web. It allows anyone to design their own document format and then write a document in it. These document formats can include markup to enhance the meaning of the document. If we include enhanced meaning in our documents, they become much more useful. Instead of only being able to be used by one program (a web browser, for example) they can be used by many programs, each only using the markup it understands and ignoring the rest. Even better, each program is free to interpret the markup in the way that's best for it. For example, in a document where words are marked as "emphasized" a web browser might display them in bold. On the other hand, a voice browser (which reads web pages out loud) my read them with extra emphasis. Each program is free to do what it feels is appropriate.

Here's an example of a document in plain text:

I just got a new pet dog.

And marked-up in XML:

<sentence><person href="http://aaronsw.com/">I</person> just got a 
new pet <animal>dog</animal>.</sentence>

The items that I added to the XML version are called tags, because they are like attaching descriptive "tags" to certain portions of the document. A full set of tags (both an opening and closing tag) and their content is called an element and descriptions like href="http://aaronsw.com/" are called attributes.

With XML Namespaces we give each element and attribute a URI. This way, anyone can create their own tags and mix them with tags made by others. Since everyone's tags have their own URIs, we don't have to worry about tag names conflicting. XML, of course, lets us abbreviate and set default URIs so we don't have to type them out each time:

<sentence
  xmlns="http://example.org/xml/documents/"
  xmlns:c="http://animals.example.net/xmlns/"
><c:person c:href="http://aaronsw.com/">I</c:person> just got a 
new pet <c:animal>dog</c:animal>.</sentence>

Statements: Resource Description Framework (RDF)

Now we begin to start getting into the meat of the Semantic Web. It's wonderful that we can create URIs and talk about them with our web pages. However, it'd be even better if we could talk about them in a way that computers could begin to process what we're saying. For example, it's one thing to say "I really like Weaving the Web." on a web discussion forum. However, no computer could process what you said.

RDF gives you a way to make statements that are machine-processable. Now the computer (of course) can't actually "understand" what you said, but it can deal with it in a way that seems like it does. For example, someone could search the Web for all book ratings and provide an average rating for each book. Then, they could put that information back on the Web. Another website could take that information (the list of book rating averages) and create a "Top Ten Most Popular Books" page.

RDF is really quite simple. An RDF statement is a lot like a simple sentence, except that almost all the words are URIs. Each RDF statement has three parts: a subject, a predicate and an object. Let's look at a simple RDF statement:

<http://aaronsw.com/> <http://love.example.org/terms/reallyLikes> <http://www.w3.org/People/Berners-Lee/Weaving/> .

As you might have guessed, that says that I really like Weaving the Web. You may notice that RDF statements can say practically anything, and that it doesn't matter who says them. Their is no one official website that says everything about Weaving the Web, or about me. Instead, this information is spread across the Web. Two people can even say contradictory things -- Bob can say that Aaron loves Weaving the Web and John can say that Aaron hates it. This is the freedom that the Web provides.

The statement above is written in Notation3, a language that allows you to write simple RDF statements. However, the official RDF specification defines an XML representation of RDF:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    xmlns:love="http://love.example.org/terms/"
>
    <rdf:Description rdf:about="http://aaronsw.com/">
        <love:reallyLikes rdf:resource="http://www.w3.org/People/Berners-Lee/Weaving/" />
    </rdf:Description>
</rdf:RDF>

Now, to write RDF like that is not the easiest thing in the world, and it seems unlikely that everyone will start speaking this strange new language anytime soon. So where do we expect all this RDF information to come from? The most likely source is databases.

In the world there are thousands of databases, most containing interesting machine-processable information. Governments store arrest records in databases; companies store part and inventory information in a database; most computerized address books store people's names and phone numbers in ... you guessed it! ... a database. When information is stored in a database, it's very easy to ask the computer certain questions about the data: "Show me everyone that was arrested in the past 6 months." "Print a list of all parts we're running low on." "Get me the phone numbers of the people whose last name is Jones."

RDF is ideally suited for publishing these databases to the Web. And when we put them on the Web, we give everything in the database a URI, so that other people can talk about it too. Now, intelligent programs can begin to fit the data together. Using the available information, the computer can begin to connect the Bob Jones whose phone number is in your address book with the Bob Jones who was arrested last week and the Bob Jones who just ordered 100,000 widgets. Now, we can ask questions of all these databases at once: "Get me the phone number of everyone who ordered more than 1,000 widgets and was arrested in the last 6 months."

Schemas and Ontologies: RDF Schemas and DAML+OIL

All the work on databases assumes that the data is nearly perfect. Few (if any) database systems are ready for the messiness of the Web. Any system that is "hard-coded" to understand certain terms will likely go out of date, or at least have limited usefulness, as new terms are invented and defined. What if someone comes up with a new system that rates books on a scale of 1-10 instead of just saying that someone "reallyLikes" them. Programs built based on the old system won't be able to process the new information.

Worse, there's no way for a computer or human to figure out what a specific term means, or how it should be used. The use of all these URIs is useless if we never describe what they mean. This is where schemas and ontologies come in. A schema and an ontology are ways to describe the meaning and realtionships of terms. This description (in RDF, of course) helps computer systems use terms more easily, and decide how to convert between them.

Two closely related systems, RDF Schemas and the DARPA Agent Markup Language with Ontology Inference Layer (DAML+OIL) have been developed to solve this problem. For example, a schema might state that:

    @prefix dc: <http://purl.org/dc/elements/1.1/> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

    # An author is a type of contributor:
    dc:author rdfs:subClassOf dc:contributor .

Let's say for example that you build a program to collect the authors and contributors to various documents. It uses this vocabulary to understand the information it finds. One day, a vast influx of newbies from AOL start creating RDF documents. None of them know about dc:author, so they make up their own term: ed:hasAuthor.

    # The old way:
    <http://aaronsw.com/> is dc:author of <http://logicerror.com/semanticWeb-long> .
   
    # The new way:
    <http://logicerror.com/semanticWeb-long> ed:hasAuthor <http://aaronsw.com/> .

Normally, your program would simply ignore these new statements, since it can't understand them. However, one kind soul was smart enough to bridge the gap between these two worlds, by providing information on how to convert between them:

    # [X dc:author Y] is the same as [Y ed:hasAuthor X]
    dc:author daml:inverse ed:hasAuthor .

Since your program understands DAML ontologies, it can now take this information and use it to process all of the hasAuthor statements it couldn't understand before.

Logic

From this point on, we're discussing parts of the Semantic Web that haven't been developed yet. Unlike the rest, I'm not discussing specific systems, but instead a general concept that could (and is) become many different systems.

While it's nice to have systems that understand these basic concepts (subclass, inverse, etc.) it would be even better if we could state any logical principles that we wanted to. We make logical statements (rules) that allow the computer to make inferences and deductions.

Here's an example: Let's say one company decides that if someone sells more than 100 of our products, then they are a member of the Super Salesman club. A smart program can now follow this rule to make a simple deduction: "John has sold 102 things, therefore John is a member of the Super Salesman club."

Proof

Once we begin to build systems that follow logic, it makes sense to use them to prove things. Different people all around the World can write logic statements, then your machine can follow these Semantic "links" to begin to prove facts.

Example: Corporate sales records show that John has sold 55 widgets and 47 sprockets. The inventory system states that widgets and sprockets are both different company products. The built-in math rules state that 55 + 47 = 102 and that 102 is more than 100. And, as we know, someone who sells more than 100 products is a member of the Super Salesman club. The computer puts all these logical rules together into a proof that John is a Super Salesman.

A diagram of the Semantic Web bus, courtesy of Tim Berners-LeeWhile it's very difficult to create these proofs (it can require following thousands, or perhaps millions of the links in the Semantic Web), it's very easy to check them. In this way, we begin to build a Web of information processors. Some of them merely provide data for others to use. Others are smarter, and can use this data to build rules. The smartest are "heuristic engines" which follow all these rules and statements to draw conclusions, and kindly place their results back on the Web as proofs, as well as plain old data.

Trust: Digital Signatures

Now you've probably been thinking that this whole plan is great, but rather useless if anyone can say anything. Who would trust anything from this system if anyone can say whatever they want? So you don't let me into your site? Ok, I just say I'm the King of the World and I have permission. Who's to stop me?

That's where Digital Signatures come in. Based on work in mathematics and cryptography, digital signatures provide proof that a certain person wrote (or agrees with) a document or statement. Aha! So I digitally sign all of my RDF statements. That way, you can be sure that I wrote them (or at least vouch for their authenticity). Now, you simply tell your program whose signatures to trust and whose not to. Each can set their own levels or trust (or paranoia) the computer can decide how much of what it reads to believe.

Now it's highly unlikely that you'll trust enough people to make use of most of the things on the Web. That's where the "Web of Trust" comes in. You tell your computer that you trust your best friend, Robert. Robert happens to be a rather popular guy on the Net, and trusts quite a number of people. And of course, all the people he trusts, trust another set of people. Each of these measures of trust is to a certain degree (Robert can trust Wendy a whole lot, but Sally only a little).

In addition to trust, levels of distrust can be factored in. If your computer discovers a document which no one explicitly trusts, but no one has said it has totally false either, it will probably trust that information a little more than one which many people have said is false.

The computer takes all these factors into account when deciding how trustworthy a piece of information is. It can combine all this information into a simple display (thumbs-up / thumbs-down) or a more complex explanation (a description of all the various trust factors involved).

Conclusion: The Grand Vision

One of the best things about the Web is that it's so many different things to so many different people. Everyone can see something useful to them in the Semantic Web. Perhaps it's the fact that now your PDA, laptop, desktop, server and car can all begin to talk to each other. Perhaps it's the fact that corporate decisions that used to be hand-processed can no be automated. Perhaps it's the fact that it will become easier than ever to find the answers to your questions on the Web. Perhaps it's the fact that you can now discover how trustworthy a document on the Web is.

Whatever the cause, almost everyone can find a reason to support this grand vision of the Semantic Web. Sure, it's a long way from here to there -- and there's no guarantee we'll make it -- but we've made quite a bit of progress so far. The possibilities are endless, and even if we don't ever achieve all of them, the journey will most certainly be its own reward.

Acknowledgements

Thanks to Sean B. Palmer for looking over a first draft of this article.

Powered by Blogspace, an Aaron Swartz project. Email the webmaster with problems.