Using Special Characters on Web Pages: Cross-Platform Considerations

by Bob Baumel

This article contains advice on using special characters in web pages so they can be seen correctly by anybody running a browser on a PC, Macintosh, or UNIX platform. This page concentrates on the ISO 8859-1 character set, also known as ISO Latin-1, which is still the primary character set for representing Western languages on the Internet. However, as Unicode has emerged as a universal character set for displaying nearly any language, I also have a Unicode Test Page which lets you see how various Unicode entities are displayed by your browser. On this present page, I consider four types of problems:

  1. Microsoft Windows characters outside the Latin-1 set
  2. Latin-1 characters outside the Macintosh character set
  3. Confusion between degree sign and masculine ordinal
  4. Use of the non-breaking space character

The following graphic image from the World Wide Web Consortium shows the ISO Latin-1 characters along with their numeric codes (in decimal). (BTW, your browser is accessing this graphic directly from the W3C website, not from my site.)

Table of printable Latin-1 Character codes

Martin Ramsch's page about ISO 8859-1 is a nice reference on using this character set. Generally, any character can be inserted three different ways: by embedding the 8-bit code in the HTML file, by using a numeric character reference, or a named entity. Here are a few examples:

Code
(decimal)
Numeric
reference
Named
entity
DescriptionAppearance
177±±plus or minus±
181µµmicro signµ
227ããsmall a, tildeã
233éésmall e, acute accenté

For example, the micro sign can be entered by directly inserting the one-byte (8-bit) code whose value in decimal is 181, or by inserting the numeric reference µ, or the named entity µ. One observation: some older browsers (notably Netscape 2.x) do not recognize all the entity names; therefore use of the numeric reference or 8-bit code tends to be more robust than the named entity for some characters. I will now discuss the four problems mentioned at the beginning of this article:

Microsoft Windows characters outside Latin-1

The character set used by the standard TrueType fonts in Microsoft Windows (Windows code page 1252) is a superset of ISO Latin-1. The following graphic image illustrates the Windows character set:

Microsoft Windows Character Set

The above graphic is a screen capture from the Windows "Character Map" accessory, except that I've added numbers to the left showing the range of character codes (in decimal) corresponding to each row. Generally, each row of this Windows table corresponds to a column of the previous Latin-1 table, with one notable exception: the Windows set uses character codes 128-159 (at least, it uses 27 of these 32 codes), whereas the Latin-1 set regards these as "extended control codes" and doesn't assign characters to these codes.

Unfortunately, the Windows codes from 128 to 159 contain many characters people would like to use: typographic curly quotation marks and apostrophes, long dashes, the "TM" trademark symbol, etc. Many users of Windows-based PCs seem to assume that any character in the Windows set is legitimate to use in HTML, and indeed, they do display correctly in web browsers on Windows-based PCs. But results are less predictable on non-Windows computers. Here is what the codes from 128 to 159 look like in your browser:

Characters 128-159 displayed by YOUR browser
ƒ ˆŠŒ Ž ˜š œžŸ
ƒˆŠŒ Ž˜š œžŸ

Here the first row of 32 characters shows the result of using numeric references, while the second row shows the result of 8-bit characters inserted directly in the HTML file (The 8-bit codes are what you get when a PC user inserts the character using the Windows Character Map accessory).

If you are viewing this on a Windows-based PC, both rows display the same characters as shown for codes 128-159 in the Windows Character Set graphic (with the possible exception of the euro sign at code point 128, recently added to Windows-1252 in 1998). If you are viewing it with a recent Macintosh browser, you probably see most of the same characters as in the Windows Set, although with some differences. If you use a very old Macintosh browser, such as Netscape 1.1, you see entirely different characters.

If you are using a UNIX (X-Windows) system, you have a problem because X-Windows fonts contain strict implementations of the ISO 8859-1 character set; therefore, your fonts generally do not contain any of the Windows characters outside Latin-1. In fact, if you view this page using Netscape 3.x for UNIX, all you see in the above table are two rows of 32 blank spaces! Netscape 4.x for UNIX does attempt to render some of the Windows codes using available characters, but results are uneven and differ for the numeric references and 8-bit codes. For example, if a page contains a curly Windows apostrophe in the form of a numeric reference (’), Netscape 4.x for UNIX does display an apostrophe. If a page contains a Windows apostrophe in the form of a raw 8-bit code (’), Netscape 4.x for UNIX displays a question mark.

The moral is to avoid the Windows characters with codes 128-159 when composing web pages. Unfortunately, this is more easily said than done. Many word processing and page layout programs use "smart quotes" algorithms which automatically replace simple straight ASCII quotation marks and apostrophes with curly typographic versions, or replace doubled hyphens with long dashes. When documents prepared with such programs are converted to HTML, part of the process should be to "stupefy" the quotes: replace curly quotes and apostrophes with straight ASCII versions, replace long dashes with doubled hyphens, etc.

Latin-1 characters outside the Macintosh set

As described on a page by Alan Flavell, fourteen characters in the ISO Latin-1 character set have no corresponding characters in the Macintosh set. These are shown in the following table:

Latin-1 characters outside Macintosh set
Numeric
reference
Named
entity
DescriptionIntendedActual
numeric
Actual
8-bit
Actual
named
¦¦broken vertical bar ¦¦¦
²²superscript two ²²²
³³superscript three ³³³
¹¹superscript one ¹¹¹
¼¼fraction one-fourth ¼¼¼
½½fraction one-half ½½½
¾¾fraction three-fourths ¾¾¾
ÐÐcapital Eth, Icelandic ÐÐÐ
ÝÝcapital Y, acute accent ÝÝÝ
ÞÞcapital THORN, Icelandic ÞÞÞ
ððsmall eth, Icelandic ððð
ýýsmall y, acute accent ýýý
þþsmall thorn, Icelandic þþþ
××multiply sign ×××

The "Intended" column of the above table contains graphic images copied from the Latin-1 graphic at the beginning of this article, so it shows what each character is supposed to look like. The "Actual" columns show how your browser displays this character code using either the numeric reference, 8-bit character, or named entity. If you are viewing this page with a Windows or UNIX-based browser, the "Actual" columns show the same characters as the "Intended" column. But if you are viewing it on a Mac, none of the "Actual" characters match the "Intended" characters, and you'll see different results using different browsers.

Some Mac browsers try to substitute similar characters for some of these fourteen (e.g., non-broken vertical bar for the broken bar, letter "x" for the multiply sign, non-superscripted 1, 2 and 3 for the superscripted versions). Older Mac browsers showed other characters. Microsoft Internet Explorer for Mac is, at least, consistent in how it displays the named and numeric entities and 8-bit codes. Netscape is less consistent.

For the sake of Mac users, it is best to avoid all 14 of these characters, if possible. In some of these cases, there are obvious replacements; for example, you don't really need a fraction one-fourth in a single character, as you can always write 1/4. Of course, the missing Icelandic characters would be a problem if you need them.

For superscripts there's an easy solution: Use the HTML <sup> tag -- for example, the symbol for square metre should be written m<sup>2</sup>, which appears as m2. Note that with the <sup> tag, you can represent any superscripts--not just superscripted numerals 1, 2 and 3.

Confusion between degree sign and masculine ordinal

The degree sign and masculine ordinal are both good Latin-1 characters, supported on all platforms. Unfortunately, people often confuse them because in many fonts, especially on Microsoft Windows-based computers, these two characters look nearly identical. The following table contains more info on these two characters:

Numeric
reference
Named
entity
DescriptionIntendedActual
&#176;&deg;degree sign °
&#186;&ordm;masculine ordinal º

As before, the "Intended" column contains graphic images copied from the Latin-1 graphic at the beginning of this article, while the "Actual" column shows how your browser displays the character. Ideally, the masculine ordinal should include an underlined circle (o), as shown in the "Intended" column. Many of the standard fonts on both UNIX (X-Windows) and Macintosh systems display it correctly. However, on Microsoft Windows-based PCs, virtually all standard fonts (including Times New Roman which is the usual default for web browsing) omit the underline, and the masculine ordinal is indistinguishable from the degree sign.

Thus, a Windows user who wishes to insert a degree sign into an HTML document may go to the Windows Character Map and, seeing two different characters that look equally similar to a degree sign, has a 50% chance of picking the wrong one (masculine ordinal instead of degree sign). The result will indeed look like a degree sign to most other Windows users, but will look rather different to most UNIX and Mac users.

The solution is to consult a reference on the ISO Latin-1 character set, such as Martin Ramsch's page, and be careful to pick the character you want (e.g., the degree sign is character code 176, not 186).

Use of the non-breaking space character

In HTML, any "white space" can be a potential line-break, depending on the viewer's current font size and window width. Sometimes you want to make sure that line breaks will not occur at the locations of certain spaces in your text. For example, consider the statement: "The speed of light in vacuum is exactly 299 792 458 m/s" (which is true according to the current definition of the metre). The rules of correct SI notation require that three-digit groupings in long numbers be separated by spaces (not commas or points), and also require a space between the number and unit symbol. However, it might be very confusing to readers if a line-break should occur at one of these spaces.

One way to inhibit line breaks is by using the Latin-1 non-breaking space character (&#160; or &nbsp;). However, as we'll see, the result looks terrible when viewed by most Macintosh users. A second way to inhibit line breaks is by using the Netscape <NOBR> tag (NOBR stands for NO BReak). Here are examples of both approaches (To see exactly what I've done, use your browser's "View Source" function):

Using &nbsp; character: The speed of light is 299 792 458 m/s
Using <NOBR> tag: The speed of light is 299 792 458 m/s

For Windows and UNIX users, both methods appear to produce the same spacing. Here is what the above examples look like to most Mac users (though this problem seems to have been fixed in Netscape 4 for Mac):

Note added 1998-08-29: Although Netscape 4 for Mac has fixed this problem when using the &nbsp; named entity, they haven't fixed it when using the raw 8-bit non-breaking space:

Using nbsp in 8-bit form: The speed of light is 299 792 458 m/s

Note that when &nbsp; characters are used, they produce extra-wide spaces, which are clearly not acceptable. Actually, this problem is not inherent in the Mac operating system, but derives from the particular font that is used: In the standard Times font supplied with Mac system software, the non-breaking space is twice as wide as an ordinary space. Mac users may, if sufficiently knowledgeable, work around this problem by changing their default browsing font, or by installing a version of Times different from Apple's (such as the Times font supplied with Adobe Type Manager). However, web authors cannot expect users to have made such changes.

This suggests that NOBR tags be used instead of non-breaking spaces in cases of this type. I say this with some reluctance because NOBR isn't entirely "standard" HTML; it's an extension introduced by Netscape that was never accepted in the HTML standards. Presumably, the people who develop such standards rejected NOBR because they assumed that &nbsp; could do the same job. For example, the current HTML 4.0 standard includes the statement:

Sometimes authors may want to prevent a line break from occurring between two words. The &nbsp; entity (&#160; or &#xA0;) acts as a space where user agents should not cause a line break.

(In all likelihood, the author of this statement never viewed the result on a Macintosh!)

On balance, I consider NOBR preferable to &nbsp; in such cases, even though NOBR isn't totally standard HTML. The NOBR tag is, in fact, supported by all recent versions of both Netscape and Internet Explorer. If you ever do encounter a browser that doesn't recognize NOBR, the worst that can happen is a line break where you'd rather not have one. Ultimately, you must ask yourself which is the worse evil: the possibility of an occasional inappropriate line break, or the certainty (if you use &nbsp;) that most Mac users will see ugly, extra-wide spaces where the &nbsp; characters are used.

Please note: Although I recommend against using &nbsp; in this type of case (to inhibit a line break between words), there are nevertheless cases where &nbsp; is entirely appropriate (for example, to hold an empty cell in a table), where the extra width of the non-breaking space on the Macintosh doesn't cause any harm.


Check out my other pages on cross-platform browsing issues:
Back to my Cross-Platform Browsing Home Page

[Best Viewed with Any
Browser]

This page by Bob Baumel: Home page |
Last revised 2000-01-17