The following is a proposed standard for bringing more semanticity to articles on the Web. In our efforts to provide quality content without the superfluous leavings, we've seen that the Web is a pretty messy place. We hope that by providing some simple guidelines we can help publishers make their content a little more presentable with Readability while also making the Web a bit more semantic.

By and large, you'll find that our guidelines just follow other specifications. We lean heavily on the work of the hNews microformat as well as the new elements provided within HTML5. If anything is unclear, please refer to the hNews microformat specification as well as this handy guide to semantic elements in html5, from Mark Pilgrim's Dive into HTML5.

(loading..)
                    
                

Hover over an element to the left to see more information about its use.

The hNews Microformat

Readability recommends and parses the hNews microformat for Articles. To quote the microformat specification, “hNews is a microformat for news content. hNews extends hAtom, introducing a number of fields that more completely describe a journalistic work.”

Below are a few explanations of the hNews microformat and how Readability uses it. For full information, view the hNews spec at microformats.org

hentry

hentry denotes the beginning of an Entry, which will be the wrapper within which all of our content is found.

entry-title

The entry-title class denotes the title of the Article. This is intentionally distinct from the title tag, which often differs due to organization name or SEO content.

entry-content

The entry-content class denotes what part of the article is the body content. Readability will use this as the body, if found.

entry-summary

The entry-summary class denotes the lede, subhead or dek of the Article. If this exists, it should be content distinct from the title or content of the article that gives a brief summary—one or two sentences—of the article itself.

byline

The byline vcard denotes who wrote the article. Typically a person. 'fn' within it denotes the person's full name. See hCard for more info.

source-org

The source of the article is the organization or group backing the article. If it is solely an individual, the individual itself will suffice (and you may append source-org onto the author vcard).

source-org also follows the hCard spec.


HTML5 Recommendations

The capabilities in HTML5 afford a great deal more semanticity, which is very helpful when trying to understand content. These are a few guidelines that will help your markup be easily understandable as an article.

<article>

Use the article tag to wrap an entry. It's semantic and easy for Readability to spot.

<time>

We’ll be looking for time elements with the pubdate attribute within articles we process. This will help us understand when the article was published. To quote the HTML5 Working Draft, the pubdate attribute “is a boolean attribute. If specified, it indicates that the date and time given by the element is the publication date and time of the nearest ancestor article element, or, if the element has no ancestor article element, of the document as a whole.”

<aside>, <header>, <nav> and <footer>

By using these tags, you can provide a big head start in figuring out what is not the primary content of the page.

<figure> and <figcaption>

These tags should be used for media related to an article. This allows us to pull media in nicely into an article's flow. Most typically images, but other media is also allowed, as per the w3c spec: “The element can thus be used to annotate illustrations, diagrams, photos, code listings, etc.” Please note that Readability may strip content such as flash and images, depending on user preference.


Readability-Specific Directives

These are guidelines specifically created to help Readability or similar parsers with your content. They may not have much semantic value outside of a parser context.

.entry-unrelated

This is a special class that explicitly tells Readability (and other parsers) to ignore the content within it. It can be used on any element. This is currently the only readability-specific directive.

.entry-content-asset

This is a special class that explicitly tells Readability (and other parsers) that the content within it is related to the content. This is particularly useful in cases where you have content that should be an asset in a figure tag, but can't yet switch to HTML5.

.comment

A comment class will help Readability to better filter (or, in the future, display) extraneous comments from an article text.