Skip to main content

The Pango connection: Part 1

Framework makes text layout possible in all languages

Tony Graham (tkg@menteith.com), Senior consultant, Mulberry Technologies
Tony Graham is the author of Unicode: A Primer , the first and currently only book about the Unicode Standard, Version 3.0, and its uses. An Australian, Tony is a Specialist member of the Unicode Consortium. He can be reached at tkg@menteith.com.

Summary:  Pango is an open-source framework for the layout and rendering of internationalized text, and is being included in the next generation of GTK+ and GNOME. In the first of a two-part series, Tony Graham introduces Pango and describes how it handles text, as well as the text attributes that you can specify for formatted text. The article concludes with a summary of Pango's processing pipeline for formatting and rendering a simple text string and a list of its attributes.

Date:  01 Mar 2001
Level:  Introductory
Activity:  1007 views
Comments:  

Pango is an open-source framework for the layout and rendering of internationalized text, including right-to-left scripts and scripts such as Tamil where glyphs are context-sensitive. Not surprisingly, Pango uses Unicode characters internally (represented using UTF-8), and Pango's interfaces also use UTF-8. Other encodings can be supported by using a translation library such as GNU iconv to convert the text to UTF-8 before processing.

Pango is designed as a modular, cross-platform, cross-toolkit, low-level library that can be used in multiple contexts. It is also intimately related to the GTK+ and GNOME projects; the Pango project started because of the need for high-quality internationalized text in GTK+ and GNOME. While Pango can be used separately, the current Pango (0.13) is being included in the development branch 1.3.x versions of GTK+ that are currently under heavy development; Pango will ultimately be incorporated into GTK+ 2.0.

The name "Pango" comes from the Greek "Pan" (Παν), meaning "All," and the Japanese "Go" (語), meaning "Language."

What are GTK+ and GNOME?

GTK stands for GIMP Toolkit, and GTK+ is a library of functions that, among other things, give an object-oriented flavor to the lower-level functions in the GIMP Drawing Kit or GDK. GDK is a library of functions that simplify programming the low-level X library.

GNOME is the name of both a desktop environment and a programming library. A GNOME application uses the objects and functions defined in the GNOME library to interface with the desktop widgets. The application may also mix in calls to GTK+ functions, or even to GDK, X, or lower-level glib or C functions.

GTK+ and GNOME are object-oriented even though they are written in C. While object orientation is not intrinsic to C, the libraries achieve object orientation by convention: The structs representing objects each reference the struct representing their superclass, and objects' properties are changed and new objects are created by calling the appropriate methods. All this requires restraint on the part of the programmer, but the resulting code is more portable because nearly every platform has a C compiler. In addition, GNOME and GTK+ interface bindings from other languages -- both object-oriented and otherwise -- have been defined.

Why UTF-8?

Strings in Pango's interfaces are UTF-8 because of its compatibility with existing 8-bit software, for its pervasiveness on UNIX platforms, the fact that it does not require extra effort to handle characters outside Plane 0, and for its independence from byte-order concerns.

Offets into UTF-8 strings are counted in bytes, not characters. The Pango documentation acknowledges that UTF-8's variable length makes it harder to count characters in a string, but the documentation also notes that, in Unicode, any non-spacing marks in the string break any correspondence between character positions and strings, even for fixed-width encodings.

The Pango documentation also acknowledges that UTF-8 has a 50% overhead for CJKV ideographs compared to UTF-16.


Single characters as UCS-4

Single characters are represented with 32 bits for planned upward compatibility with any characters to be defined in ISO/IEC 10646. While the ISO working group has recently committed to using only the same million or so code points covered by UTF-16, even that reduced range requires 21 bits, and 32 bits is still the next highest standardized word size.


BiDi library

Pango uses Dov Grobgeld's FriBidi implementation of the Unicode bidirectional algorithm (see Resources). When Pango is compiled with the --with-fribidi option, it will use a copy of FriBidi that you provide; otherwise the copy in the Pango source is used. The minimal version included with Pango 1.3 is an older version that supports Unicode 2.1.8, whereas the latest FriBidi version as of this writing supports Unicode 3.0.1.


Language and other attributes

In addition to handling right-to-left text, Pango supports language tagging, so, for example, it will attempt to use a Japanese font for text marked as Japanese. Language tagging, like all Pango text attribute tagging, is a Pango-specific scheme. Language tagging does not use Unicode's Plane 14 language tags, nor does it relate to the xml:lang and html:lang attributes defined by the W3C, but those and other language markup schemes could easily be translated into Pango language attributes.

The complete set of Pango text attributes is shown in the following list:

  • Language
  • Font family: name of a font family or a comma-separated list of families
  • Style: normal, oblique, or italic
  • Weight: six possible values from ultralight to heavy
  • Variant: normal or small caps
  • Stretch: nine possible values from ultracondensed to ultraexpanded
  • Size: font size in thousandths of a point
  • Font description: shorthand label for a particular font family, style, variant, weight, stretch, and size
  • Foreground color
  • Background color
  • Underline: whether the text is underlined with a single, double, or low line
  • Strikethrough: whether the text is struck through
  • Rise: vertical displacement
  • Shape: shape to impose on a glyph
  • Scale

The following two figures show examples of Pango in action. Note the use of German, Greek, Hebrew, Japanese, and Arabic text in the first figure and the additional use of French, Korean, and Russian in the second. Labels and text boxes containing German and French are admittedly easy to achieve on most English or European computer systems, but it is much less common for a computer system to be able to handle those languages and the other languages shown in the figures in combination.


Styled, multilanguage, and bidirectional text
Multilanguage, bidirectional, and styled text example

Multiple languages in widget labels
Multiple languages in widget labels

Marking up text attributes

The different attributes for a sequence of characters, including the language, are maintained separately from the text as a list of structures, one structure for each span of each attribute type. Every structure indicates a single attribute class and the start and end of the character range to which the class applies. Particular attribute types extend this with additional information; for example, the color attributes also record the red, green, and blue components of the color to apply to the span.

You can create the separate attribute list for some text (for example, for a widget label), but it can be a painstaking task when there are a lot of attribute changes. Also, as the Pango documentation notes, the character ranges in each attribute structure will surely be invalid for any later translation of the original attributed text.

As a convenience measure for translators in particular, Pango supports a simple HTML-like markup language for embedding attribute changes in the text, and it provides the pango_parse_markup() function for converting marked-up text into a plain string and a separate attribute list. The root element is <markup>, but it can be omitted. (You can omit both the start tag and the end tag, but omitting just one causes an error.)

The most versatile element, and the one that will have the most common use, is <span>. Like the HTML element with the same name, this marks a span of text, and its start tag may have the following attributes whose values will be translated into Pango text attribute values:

  • font_desc: a shorthand font description, such as "Sans Italic 12" (any other span attributes override this description)
  • font_family: A font family name
  • face: Synonym for the font_family attribute
  • size: Font size in thousandths of a point; a predefined absolute size keyword such as xx-small or xx-large, or one of the relative sizes smaller or larger
  • style: One of normal, oblique, or italic, corresponding to the allowed values of the style text attribute
  • weight: One of six keywords such as ultralight, normal, or heavy -- or a numeric weight
  • variant: normal or smallcaps
  • stretch: One of nine keywords such as ultracondensed, normal, and ultraexpanded that correspond to the allowed values of the stretch text attribute
  • foreground: An RGB color specification such as #00FF00 or a color name such as red
  • background: An RGB color specification such as #00FF00 or a color name such as red
  • underline: One of single, double, low, none
  • rise: Vertical displacement, in ten thousandths of an em. Can be negative for subscript, positive for superscript
  • strikethrough: true or false, whether to strike through the text
  • lang: A language code (for example, fr)

The markup language also includes a handful of convenience elements that do not have attributes:

  • <b>: bold
  • <big>: equivalent to <span size="larger">
  • <i>: italic
  • <s>: strikethrough
  • <sub>: subscript
  • <sup>: superscript
  • <small>: equivalent to <span size="smaller">
  • <tt>: monospace font
  • <u>: underline

The absolute and relative sizes of successive steps of the size attribute and the size increase or decrease from the <bigger> or <smaller> elements is in the ratio 1:1.2 (or 1.2:1); this is the same as the CSS scale factor between its text sizes.

The markup language is case-sensitive, unlike HTML (but like XML), and the only tags that can be omitted are the pair of the <markup> start tag and end tag.


In the pipeline

Pango implements formatting and rendering in a staged pipeline.

The following example adds markup to an example used in both Chapter 3 of The Unicode Standard, Version 3.0 and UAX #9 (see Resources). The uppercase text in the example stands for right-to-left text such as Arabic or Hebrew. The markup makes some of the text underlined, some of it blue, and some of it both underlined and blue.

<u>car </u><span foreground="blue"><u>is </u>THE CAR</span> in arabic

The effect of the markup is shown in the following table.

<table border="1"> <tr> <td>String</td> <td><code>car </code></td> <td><code>is </code></td> <td><code>THE CAR</code></td> <td><code> in arabic</code></td> </tr> <tr> <td>Foreground</td> <td> </td> <td colspan="2" align="center"><span style="color: blue">Blue</span></td> <td> </td> </tr> <tr> <td>Underline</td> <td colspan="2" align="center"><u>True</u></td> <td> </td> <td> </td> </tr> </table>

Itemization

The first step when laying out the text is to break the string into portions with consistent attributes, including consistent language tag, bidirectional category, color, etc.

Markup for the attributes is just a convenience feature, and the pipeline really begins with text and a list of Pango attributes, so step 0, as it were, is to call pango_parse_markup() with the above example as input. This returns a single string containing the text and a list of four Pango attributes -- one for each change in the attributes. The table below shows the spans.

<table border="1"> <tr> <td>String</td> <td><code>car </code></td> <td><code>is </code></td> <td><code>THE CAR</code></td> <td><code> in arabic</code></td> </tr> <tr> <td>Foreground</td> <td> </td> <td align="center"><span style="color: blue">Blue</span></td> <td align="center"><span style="color: blue">Blue</span></td> <td> </td> </tr> <tr> <td>Underline</td> <td align="center"><u>True</u></td> <td align="center"><u>True</u></td> <td> </td> <td> </td> </tr> <tr> <td>Bidi Level</td> <td align="center">0</td> <td align="center">0</td> <td align="center">1</td> <td align="center">0</td> </tr> </table>

Reordering

The items are then reordered into visual order, as the following table shows. Remember that for the purposes of this example, uppercase text stands for right-to-left text such as Arabic or Hebrew.

The "Bidi Level" in the table is the Unicode bidirectional embedding level of the spans, where even numbers (including 0) indicate left-to-right text and odd numbers indicate right-to-left text. Bidi level is not recorded in Pango attributes, but it is calculated by the FriBidi library.

<table border="1"> <tr> <td>String</td> <td><code>car </code></td> <td><code>is </code></td> <td><code>RAC EHT</code></td> <td><code> in arabic</code></td> </tr> <tr> <td>Foreground</td> <td> </td> <td align="center"><span style="color: blue">Blue</span></td> <td align="center"><span style="color: blue">Blue</span></td> <td> </td> </tr> <tr> <td>Underline</td> <td align="center"><u>True</u></td> <td align="center"><u>True</u></td> <td> </td> <td> </td> </tr> <tr> <td>Bidi Level</td> <td align="center">0</td> <td align="center">0</td> <td align="center">1</td> <td align="center">0</td> </tr> </table>

Glyph selection

Pango then selects the appropriate glyphs for the characters in each item.

Pango supports script-specific layout engines so, for example, Tamil glyph selection is done by the Tamil engine and Thai glyph selection is done by the Thai engine. There doesn't have to be one engine per script however, and, in practice, characters from the Basic Latin, Latin-1 Supplement, Greek, Cyrillic, and several other blocks are all handled by the "basic" engine.

Justification

The glyph strings are justified, for example, to the right or to the left as shown by some of the labels in the previous figure.

Rendering

The glyphs are rendered onto an output device. Pango is not a rendering system, but it does include a rendering routine for X fonts. Other output devices will require other, external rendering routines. The following table shows how the example might look when rendered.

<table border="1"> <tr> <td><u>car <span style="color: blue">is </span></u><span style="color: blue">RAC EHT</span> in arabic</code></td> </tr> </table>


More Pango

In the second installment, I'll show the code for the example and discuss how Pango selects glyphs and renders text.


Resources

About the author

Tony Graham is the author of Unicode: A Primer , the first and currently only book about the Unicode Standard, Version 3.0, and its uses. An Australian, Tony is a Specialist member of the Unicode Consortium. He can be reached at tkg@menteith.com.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, Open source, Linux
ArticleID=11092
ArticleTitle=The Pango connection: Part 1
publish-date=03012001
author1-email=tkg@menteith.com
author1-email-cc=