Authoring HTML: Handling Right-to-left Scripts

This document provides advice for the use of HTML markup and CSS style sheets to create pages for languages that use right-to-left scripts, such as Arabic, Hebrew, Persian, Thaana, Urdu, etc. It explains how to create content in right-to-left scripts that builds on but goes beyond the Unicode bidirectional algorithm, as well as how to prepare content for localization into right-to-left scripts.

Introduction

Who should use this document?

All authors and producers of HTML and CSS who are working with text in a language that uses a right-to-left script, or whose content will be localized to a language that uses a right-to-left script.

This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant from the very start of development.

It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice specifically related to internationalization.

How to use this document

If you don't know much about bidirectional text, you may find it useful to familiarise yourself with the concepts introduced in the tutorial Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts. That tutorial provides an overview of how to create pages in right-to-left scripts.

This document lists a number of do's and don'ts, which we will refer to as techniques, related to authoring pages in right-to-left scripts. Each technique is followed by a 'detail' link which provides further information. Where needed, you can get additional information and explanations by following the links to the appropriate section of the techniques index, listed alongside each section.

If a technique says 'consider', there are usually pros and cons involved in following the advice given, and you should follow the link to more detailed information to be sure you understand these. In some cases it may be that not all browsers support the features described. In other cases, it may be purely up to you to decide whether or not this is a good idea.

Important concepts

Bidirectional (bidi) text

'Bidirectional', or 'bidi', text typically refers to text written using a mixture of right-to-left and left-to-right scripts. For example, in Arabic and Hebrew text the content flows predominantly from right to left, but embedded numbers or text in other scripts (such as Latin script) still runs left to right. Text in other languages, such as English, can also be bidirectional if it includes excerpts from languages such as Arabic and Hebrew.

Scripts such as Arabic and Hebrew, which are predominantly right-to-left in orientation, may be referred to as 'RTL' (right-to-left) scripts.

Several languages use the Arabic script, such as Urdu and Persian. Several other scripts run predominantly right-to-left: these include Thaana, N'ko, and Syriac, as well as other scripts no longer in common use, such as Cypriot, Phoenician and Kharoshthi.

Relationship between language and direction

Direction is a property of scripts, not language.

Be careful about assuming that information about directionality can be inferred from information about the language of the text, as this is not always true. There must be a one-to-one mapping between directionality and language for this to work, and there often isn't. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az is relevant for either.

In addition, when using directional markup inline, the markup and the values of that markup do not necessarily coincide with language declarations.

Also, markup used to indicate directionality has values that indicate that the normal directionality should be overridden; it is not possible to indicate that using language related values.

In the same way, attributes indicating text direction in HTML do not, and should not, provide information about the language of text.

Although it is theoretically possible to infer direction correctly much of the time from language information (no browser does so at the time of writing), it is much better to use directional markup.

Problems with bidirectional source text in markup

There is currently a lack of good editing environments for creating HTML pages using right-to-left scripts. Because of the fact that HTML markup and escapes contain punctuation and strongly typed letters, you are always working with bidirectional source text. However, if the editing application is not aware (as is usually the case) that the markup is not ordinary text, then it can produce some odd effects and make coding difficult.

This section simply mentions some of those problems, so that you are forewarned. It doesn't propose a full solution, but it does offer some advice which may help with problematic editing environments.

Working with markup

Unless your editor recognizes markup in source text as not being normal text, the strongly typed letters and punctuation in the markup will appear in places you wouldn't expect, and sometimes interfere with the order of the content itself.

If you are creating a large amount of right-to-left text, it makes sense to set the base direction of the editing window in your editor to right-to-left. This helps ensure that the content is correctly ordered. Unfortunately, this tends to increase the likelihood that your markup looks strange in the source text.

shows some simple markup in a left-to-right context.

The source contains a p tag followed by a class attribute, followed by a title attribute with some Arabic text (العربي) as its value. The content of the paragraph itself (مشس هخصث خهس تخت تخهثز) starts with Arabic text. The resulting order in a left-to-right environment (where Arabic text is indicated by text in square brackets) is shown below.

Markup being rearranged in LTR source code

مشس هخصث خهس تخت تخهثز.

.

As shows, things are hardly better if the overall context for the source code is right-to-left. In this case, the resulting order for the same source text can be seen here.

Markup being rearranged in RTL source code

مشس هخصث خهس تخت تخهثز.

[paragraph_content]<"[title_value]"=p class="myclass" title>.

Note, however, that this source will display correctly in a user agent. This is just a problem for reading and maintaining the source text.

The title attribute with Arabic text makes the situation much worse than normal in the above examples. The problem arises because there is only 'punctuation' (ie. the quote and angle bracket) between two runs of strongly-typed right-to-left text, so the Unicode bidirectional algorithm considers this to be a single run of text.

It helps a little, if you can do it, to ensure that an attribute with a value that uses left-to-right script text (in the example below, the class attribute) appears last. This would make the text in a left-to-right context look as expected, and in a right-to-left context it would prevent the interaction of markup with content (see ). There are still some issues, however – things are still a little jumbled, and the quotation marks are not where you would expect.

Separating RTL text in the source

مشس هخصث خهس تخت تخهثز.

[paragraph_content]<"class="myclass "[title_value]"=p title>.

It can also help to start the content on a new line (see ), however this doesn't always help with inline markup. Also, you should try to avoid including white space before the closing markup, as this can lead to other problems.

Starting content after a new line

مشس هخصث خهس تخت تخهثز.

If you are dealing with content that is predominantly in a right-to-left script, the ideal solution would be a source editor that recognizes markup as a special construct, and protects it to produce a sensible order for the characters in the source text. Not only that, but if your markup includes a dir attribute to change the directional context of the content, your editor should recognize this and produce a corresponding change in the order of the source code.

For small edits, if they are unable to find a bidi-aware editor, some authors actually prefer to use an editor that knows nothing about bidi. This means that they have to read the right-to-left content backwards, but at least makes it easier to locate and change the items they are interested in.

Adding escapes to the content

If you use a Unicode control character such as the RIGHT TO LEFT MARK (RLM) or ZERO-WIDTH NON JOINER, you will not usually be able to see it in the source text, since it is invisible. For this reason you may think that a useful way to represent these characters is with the pre-defined HTML character entities, &rlm; and &zwnj;, or their numeric equivalents, ‏ and ‌.

Unfortunately, such an approach typically has its problems, too. As described in the previous section related to markup in source text, the strongly-typed left-to-right characters and non-alphabetic characters in the escapes will normally cause the Unicode bidirectional algorithm to display very odd looking source text.

Very few editors currently recognize, for example, the sequence of characters in ‏ as a single unit representing a character with a strong right-to-left direction. They treat this as simply text containing punctuation, numbers and two strongly-typed left-to-right characters (x and F), and apply the Unicode bidirectional algorithm to that as they would to any normal text.

shows a typical view of source text after adding an escape to bidirectional text in right-to-left ordered source text. Focus on the constituent parts of the character escape itself, rather than the order of the Arabic text. The sequence ‏ is displayed ;x200F#& when embedded in right-to-left text. At the beginning or end of embedded English text the escape is broken into fragments, and appears as x200F;text in english#& or ;text in english&#x200F, respectively.

Note that the source will still display correctly in a user agent. This is just a problem for reading and maintaining the source text.

Escape sequences rearranged in RTL source code.

مشس‏ هخصث خهس text in english تخت تخهثز.

مشس هخصث خهس ‏text in english تخت تخهثز.

مشس هخصث خهس text in english‏ تخت تخهثز.

Various approaches are possible, if you want to avoid using characters that are invisible in your source code:

use an editor that recognizes an escape as a single unit representing a RLM/LRM character and produces the expected effect on the surrounding source text
use an editor that provides a symbolic visual representation of the RLM/LRM character, so that you don't lose sight of it
break the source code line around the escape - works in some cases

Otherwise, you just have to learn to live with the undesirable reordering effects for escapes.

Example source text in Internationalization Activity articles

Given the discussion above, representing examples of source text in examples can be quite difficult. Should we show source text in right-to-left order, or left-to-right? Should we assume that the editor recognizes and handles markup and escapes as separate entities from the content, and create source fragments that look like that – or should we show source as it really looks for many people who don't have such clever editors? And particularly, should we assume that the bidirectional algorithm is properly applied in the source editor, picking up cues from the markup, or not?

In most of our articles right-to-left text in code samples is represented by UPPERCASE TRANSLATIONS, and left-to-right text by lowercase. In this case, text in code samples reflects the direction of characters as stored in memory, rather than the displayed result. The original version of text in uppercase translations would be read from right-to-left.

An example of how source text examples are written.

In the following fragment of source code the upper case characters represent the text written with a right-to-left script. What you see is a translation of the text, written from left to right. All the other text in the example is in Latin script, and so is written in lower case. The punctuation is also arranged from left to right, reflecting the order of text in memory rather than the order when displayed.

the phrase 'INTERNATIONALIZATION ACTIVITY!' was found in this page.

In real life, the actual text in this example would look as follows, where the Hebrew text between the quotes is read right-to-left:

The phrase 'פעילות הבינאום!' was found on this page.

Setting up a right-to-left page

Only use bidi markup to set the base direction for the document as a whole, or where you need to change the base direction. detail

Add dir="rtl" to the html tag any time the overall document direction is right-to-left. detail

Don't add dir="rtl" to the body tag. detail If you need to avoid the scroll bar moving on some browsers, put dir on the head element and a div just inside the body element. detail

Use logical order, not visual ordering for Hebrew, and choose an appropriate encoding. detail Except for very rare circumstances you should always use the Unicode encoding UTF-8. detail If you have to use an ISO encoding for a Hebrew page, declare the encoding as ISO-8859-8-i rather than ISO-8859-8. detail

Do not use CSS styling to control directionality in HTML. Use markup. detail

Learn more about:

Setting direction on block elements

Add the dir attribute to a block element to change base direction. detail Don't use CSS or Unicode control characters to control directionality in HTML. Use markup. detail

Only use bidi markup to set the base direction for the document as a whole, or where you need to change the base direction. detail

Learn more about:

Managing text direction in form controls

Add dir="auto" to input tags to automatically align text to the correct side of an input field. detail

Add dir="auto" to textarea and pre tags to make paragraphs align to the left or right according to the initial strong character. detail

Consider using the dirname attribute to pass information to the server about the direction of text in a text or search form control. detail

Learn more about:

Mixing text direction inline

If you know the phrase's direction, or can work it out for injected text, tightly wrap every opposite-direction phrase in markup. Add the CSS shim to your style sheet, and use the dir attribute on that markup. Be sure to nest markup to show the structure. detail

If you want to bullet-proof your code for browsers that don't support the CSS shim where tightly-wrapped text is followed inline by a number or a logically separate opposite-direction phrase, add &rlm; or &lrm; immediately after the phrase. detail

If you don't know the phrase's direction, ie. unknown text that will be injected at run time, then either wrap the phrase in bdi (no dir attribute needed), or if the phrase is tightly wrapped by an element already, just add dir="auto" to that element. detail

Use Unicode control characters for bidirectional control only for attribute text or element text that allows no internal markup. detail

Consider using Unicode control characters to set the base direction around bidirectional text that will be displayed as tool tips, page titles, or on JavaScript dialog boxes. detail

Do not leave white space at the end of inline elements that mark a directional boundary. detail

Learn more about:

Handling parentheses & other mirrored characters

Treat mirrored characters as if any word left in the name meant 'opening', and right meant 'closing'. detail

Learn more about:

Overriding the Unicode bidirectional algorithm

Use the bdo element to force the directionality of a sequence of inline characters. detail

Learn more about:

Introduction

Who should use this document?

How to use this document

Important concepts

Bidirectional (bidi) text

Relationship between language and direction

Problems with bidirectional source text in markup

Working with markup

Adding escapes to the content

Example source text in Internationalization Activity articles

Setting up a right-to-left page

Setting direction on block elements

Managing text direction in form controls

Mixing text direction inline

Handling parentheses & other mirrored characters

Overriding the Unicode bidirectional algorithm

Revision Log

Acknowledgements