dive into mark

You are here: dive into markArchivesMay 2004How to make a good ID in Atom

Friday, May 28, 2004

How to make a good ID in Atom

Table of contents:

Introduction

Every Atom entry must have a globally unique ID, in the <id> element. This helps aggregators and directories keep track of an entry, even if it gets updated. Some aggregators redisplay changed entries; some don’t; some track changes over time. But before you can do any of these things, you need to uniquely identify the entry, and that’s what <id> is for.

There are three requirements for an Atom ID:

  1. The ID must be a valid URI, as defined by RFC 2396.
  2. The ID must be globally unique, across all Atom feeds, everywhere, for all time. This part is actually easier than it sounds.
  3. The ID must never, ever change.

There are several ways to construct an unchanging, globally unique URI, but some are better than others.

Why you shouldn’t use your permalink as an Atom ID

It’s valid to use your permalink URL as your <id>, but I discourage it because it can create confusion about which element should be treated as the permalink. Developers who don’t read specs will look at your Atom feed, and they see two identical pieces of information, and they pick one and use it as the permalink, and some of them will pick incorrectly. Then they go to another feed where the two elements are not identical, and they get confused.

In Atom, <link rel="alternate"> is always the permalink of the entry. <id> is always a unique identifier for the entry. Both are required, but they serve different purposes. An entry ID should never change, even if the permalink changes.

“Permalink changes”? Yes, permalinks are not as permanent as you might think. Here’s an example that happened to me. My permalink URLs were automatically generated from the title of my entry, but then I updated an entry and changed the title. Guess what, the “permanent” link just changed! If you’re clever, you can use an HTTP redirect to redirect visitors from the old permalink to the new one (and I did). But you can’t redirect an ID.

The ID of an Atom entry must never change! Ideally, you should generate the ID of an entry once, and store it somewhere. If you’re auto-generating it time after time from data that changes over time, then the entry’s ID will change, which defeats the purpose.

Why you shouldn’t use a URN as an Atom ID

RFC 2141 defines a syntax for URNs. URNs specifically designed to be used as globally unique identifiers. They are valid URIs. They look sort of like URLs you might type in a browser, but URNs are not designed to be clickable. They’re just structured identifiers.

So why not use them? Well, the main reason is that they require registration (described in RFC 3406). You can’t just use the domain name you’ve already registered; URN namespace registration is a separate process.

If you have a registered URN namespace, you can use it to generate Atom IDs. But if you haven’t registered one, you can’t just make up a URN and publish it. URNs don’t work that way.

How to construct an Atom ID using tag: URIs

However, there is an emerging standard that allows anyone to construct globally unique identifiers without additional registration: tag URIs. To construct a tag URI, you only need a domain name or an email address. (A subdomain works too.) For the purposes of this tutorial, I’m going to assume that you have your own domain name or subdomain, and that you don’t wish to publish your email address for spammers to scrape.

Start with your permalink URL. I’ll use http://diveintomark.org/archives/2004/05/27/howto-atom-linkblog, a real example of a recent post. Your permalink may look different; it may not contain a date; it may just use a numeric ID; it may contain a fragment identifier (with a # mark). That’s OK, you can make a tag: URI out of any URL.

  1. Discard everything before the domain name.

    Progress so far: diveintomark.org/archives/2004/05/27/howto-atom-linkblog

  2. Change all # characters to /

    Progress so far: unchanged

  3. Immediately after the domain name, insert a comma, then the year-month-day that the article was published, then a colon. Be sure to use a four-digit year, two-digit month, and two-digit day. Don’t forget the colon.

    Progress so far: diveintomark.org,2004-05-27:/archives/2004/05/27/howto-atom-linkblog

  4. Add tag: at the beginning. (Don’t add slashes; it’s just “tag:“. That’s a common mistake.)

    Progress so far: tag:diveintomark.org,2004-05-27:/archives/2004/05/27/howto-atom-linkblog

That’s it! There are other ways to create valid tag: URIs, but this procedure works for any URL.

The only potential problem here is that if your permalinks may change over time (for example, if they are based on title and you modify titles after posting, or if you change your permalink URL scheme entirely), you must not recompute the tag: URI when the permalink changes. Ideally, you should build the Atom ID once and then store it with the rest of the entry data. If this is not feasible, and if you can not guarantee that your permalinks will never change, there are some other ways to build valid tag: URIs that you might want to consider instead.

Other ways to build valid Atom IDs

What if the same entry appears in multiple feeds?

If the same entry appears in two different feeds, it must have the same ID in both places. This is not an exception to the “globally unique” rule; it’s an integral part of it. An entry’s ID is the key for that entry across all time and space. If the same entry appears in two places, it must have the same ID in both places — otherwise it’s not really the same entry.

How could this happen?

In the case of multiple feeds produced by the same site, you just need to make sure that the way you are constructing your Atom IDs will generate the same ID in both places. Make sure the ID is not based on the URL of the feed in which it appears, or the category name, or some other data that is different between the feeds in which the entry appears.

In the case of sites that aggregate content from multiple sites, the aggregator script should preserve the original <id> element from the entries of each feed.

How aggregators deal with duplicate entries is entirely up to them. If an entry appears in two feeds, and you’re subscribed to both feeds, some aggregators may display it in both places, but mark it as “read” in both feeds once you read it once. The behavior of client-side software is entirely up to the client-side developer. The only thing <id> does is try to give developers the ability to make those decisions without complicated and error-prone heuristics.

Summary

An Atom ID is an unchanging, globally unique URI. All parts of that are important. If an entry’s ID changes over time, that defeats the purpose. If you’re reusing IDs for different entries, that really defeats the purpose. There are several techniques for constructing unchanging, globally unique URIs, and you should use whichever one is easiest for you.

Filed under ,

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim