DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

What's New

Reference
Business Guides
Standards List
Standards Fora List
RTD Project List

News
Electronic Commerce
Information Management
Information Society RTD
Standards Conferences
Diffuse Conferences

User Support
Index
Search
Help Desk

Background
About IST
About Diffuse
Diffuse FAQ
RTD Initiatives
IPR Statement
Disclaimer

Guide to Internationalization / Localization

Project funded under the European Commission's 5th Framework IST Programme

This guide discusses the aspects that affect the globalization of applications, particularly for a multi-platform environment. The following subjects are covered:

Globalization

Globalization of applications, i.e. their general availability in and for all relevant markets, is considered of utmost importance for the competitiveness of both the application providers and users alike. An effective road block for this globalization has traditionally been cultural diversity, since uniformity is not seen as a valid goal nor even an acceptable by-product. This diversity is, however, beginning to be taken into proper consideration and supported adequately, rather than being brushed aside in an oppressive fashion.

Note: Standardization organizations tend to use the abbreviations "I18N" for Internationalization and "L10N" for Localization, where 18 and 10 indicate the number of letters that have been omitted from the abbreviation.

In the past, localization used to be seen primarily as translation and it was often implemented after the fact, i.e. based on a completed, specific language version. This, however, has proved to be an expensive approach with often less than satisfactory results. Furthermore, the aspects of cultural diversity that need specific support in ICT systems are not limited to the language, which for certain applications may in fact be of relatively low importance.

Cultural Diversity

From a layman’s point of view, cultural diversity is what makes us so interested in other people's ordinary life, on a holiday or when eating out at an ethnic restaurant. Quite often, however, we don't even realize when and how our own ways are different from those of the others. We only notice the difference when we ourselves cannot behave in the way we are used to.

An introductory article to cultural diversity in ICT can be found at the CEN/ISSS web site for the Cultural Diversity Steering Group, which itself will be discussed later.

A more extensive document, European Culturally Specific ICT Requirements (CWA 14094) is also available at the CEN/ISSS web site, free of charge as part of the eEurope program.

Language /Character Set Aspects

Many of the language related aspects of cultural diversity deal with the set of characters required to support the orthography of a given language in a given country. Furthermore, in any given country or cultural environment, parallel support for multiple languages or, at the very minimum, support for names or loan words of foreign origin is often a clear requirement. Until recently, the number of characters that can be processed at once has been severely limited. The reason for this has been the limited number of bits used to encode the various character sets. These aspects are discussed more thoroughly in the Diffuse Guide to Character Sets.

It is safe to assume that character encoding problems will disappear in future applications, of which there are several indications. Thus, for example, the Microsoft Word and Excel programs have internally used the ISO 10646 Universal Character Set (UCS/Unicode) to encode data since their ’97 versions, quite independently from the character set that is apparent to the user. This, however, highlights as a key problem the entering of the data on the keyboard in a convenient, user friendly manner. This problem can only be solved by providing a variety of specific, i.e. localized, keyboards for the users to choose from. In several instances, these keyboards will have to be different at the hardware level, to meet the requirements for specific keycap engravings. These differences may have to be realized even for the same active character repertoire, e.g. in a case where a minority language will have to be functionally supported for the users of the majority language in addition to being supported as the first choice for the native users. Nevertheless, all keyboard drivers will produce the same UCS/Unicode encoding or a similar USI (UCS Sequence Identifier) encoding for any given character, if at all.

The character repertoire must also be defined for each font collection. For a variety of reasons, users often prefer a given font for a given use in their environment. Normally, well tailored fonts can only be produced for limited repertoires. Typically, any font that supports the full UCS/Unicode code set can effectively only be used for recording or drafting purposes.

In the long run, the above, language related, character set issues will no longer affect application code, provided that the application and its data bases use UCS/Unicode. In the meantime, however, it is important that the character set associated with each data element, i.e. its encoding, be known to the system in order for the data to be properly understood and processed. A number of international standards have been developed for information interchange, which may also be used internally within systems; particularly in the Unix family of products. Various mechanisms have been developed to inform the receiving system of the encoding used. Any mismatch between the external code and the internal code will lead to loss of data, which tends to occur frequently when sending data from systems with proprietary character set encoding schemes (e.g. the PC code pages used by Microsoft and IBM). Systems that utilize UCS/Unicode internally may - in principle - accept any and all input, but they may encounter problems in forwarding data to a lesser capable system.

Even in the long run, keyboard and font issues remain to be solved, but that can be done at the system level, essentially transparent to both applications and users. From a functionality point of you, different vendors could each solve these issues in a manner most suitable to them, but common solutions are required for user-friendly system interchangeability.

Most cultural environments apply their own processing rules for ordering data in a meaningful fashion. Such local rules can be seen to provide a small top layer to the overlay structure, where the bottom layer covers the full UCS/Unicode. In Europe, a layer to cover the rules for Pan-European applications that deal with the Multilingual European Subsets of UCS/Unicode would be applied, with other layers above as needed to meet national or regional needs.

Notation and Presentation Aspects

The best known notational problems deal with the formats used for dates and times. Each possible order for the numeric date, month and year is used somewhere in the world. Particularly the two-digit notations used in North America and Europe may look alike, yet they are totally different. Although standard notations have been developed for the date (yyyy-mm-dd) and time and related from/to periods, they have met persistent disapproval and they have not replaced more established national schemes.

Another extremely important notation problem is caused by the fact that the comma and the point are used in fully opposite meanings within numbers. They can be either the decimal separator and the thousands separator, or vice versa. The problem is not that severe if both of them appear in the number, but how do you interpret 1.025 and 1,025? In addition to the separators being different, the correct positioning of the minus sign and the currency symbols or codes have cultural dependencies.

In both of the above notation problems, the solution can only be provided if the internal notation is known and the external representation is adjusted to match each recipient’s preference. This, however, means that the application program may not use the data directly from any source nor render it into text for viewing or printing. Instead, the program will always have to use a system facility that is aware of the current user’s preferences.

The required facilities exist in virtually all system platforms, but they have not been fully utilized by all application programs or users.

Matching and Browsing Aspects

The emergence of the worldwide web has magnified matching and browsing problems. Once character set problems have been resolved, other language and character related problems remain. For example, the correct letters may not be even recognizable to those with a need to use them, say to deal with Icelandic names. In those instances, the use of fallbacks becomes mandatory. The use of simple fallbacks is also often required for accented letters: the diacritics may be routinely dropped due to their unfamiliarity to the users, in order to avoid a multitude of errors. Established transliteration schemes also cause problems, especially for multilingual searches, since , for example, the same Russian name may be properly written quite differently in the different languages of Europe, using the same Latin alphabet.

Some of these problems may have common solutions that work for a number of applications. These solutions could relatively easily find all relevant fallbacks for any proper expression and possibly present them to the user for acceptance. Finding the proper expression for a fallback and subsequently all other fallbacks for the expression is a much trickier task, since it is difficult to positively identify a fallback as a fallback -- even the user won’t probably know. In this category, one should also deal with different spellings, including the most common misspellings within each user community.

This area is clearly one where the field is open for competition on the one hand and co-operative efforts on the other hand to improve the techniques to implement the required fuzzy logic for the "intelligence" of the systems. Applications may also have specific, often very special requirements, which cannot be met by general solutions.

For the character encoding aspects, W3C is working on a normalization scheme for the web. Flagging the language has become quite important for understanding the semantics involved.

Electronic Commerce Aspects

The growth of electronic commerce on the web requires that buyers understand both the characteristics of the products that they are about to acquire and the conditions that apply to the transaction. For the latter, certain unification will clearly be imposed, particularly within the European single market. For the former, mapping mechanisms must be built to provide especially consumers with, for example localized size information that they need to avoid costly mistakes.

The Semantic Web and the dynamic relationships involved in Web Services set requirements at a level above those for electronic commerce, since lengthy exchanges may be conducted without any human intervention. Multilingual ontologies are seen as providing part of the required solutions.

Systems Approach

Any localization effort would be difficult if not impossible, unless systems and applications are designed for it. This preparatory process is often called internationalization and sometimes such a system is said to be enabled for localization.

Major IT vendors started to build up their "national language support" programs in the 1970’s, at which time the solutions were distinctly proprietary. To counter this, a standardized "locale" structure with a fixed set of elements was developed for POSIX. This structure, however, was designed to cater for the variations encountered primarily in the Western world using the Latin alphabet. Currently, practically all IT vendors support this structure as part of their much wider user preference selection schemes.

The POSIX locale approach, and the related Cultural Registry, is not well received globally, outside Europe and the Americas, due to its inherent weakness in catering for existing cultural differences. This is the primary reason for the current split within both the industry and the user community. In order to truly support worldwide markets and the operation of the worldwide web, the industry is not in favour of extending this type of approach to new areas. Instead, new fora have been set up to define new mechanisms that will provide for interoperability at a global scale.

Standardization Organizations

In Europe, the Information and Communication Technologies Standards Board (ICTSB) is set up to co-ordinate the relevant standardization work among the European standardization organizations, CEN, CENELEC and ETSI. The membership represents important stakeholders from industry and user communities.

Within CEN, the Information Society Standardization System (CEN/ISSS) has responsibility for organizing IT related standardization activities within Technical Committees and open Workshops. The workshop agreements dealing with cultural diversity are available free of charge from the CEN/ISSS web site as part of the eEurope program.

Within CEN/ISSS, the Cultural Diversity Steering Group (CDSG) has the key task to ensure that all Europeans can have equal access to today's information society and can express their cultural background in today's world of information and communication technologies (ICT). It does so not by undertaking standardization itself but by mapping and coordinating ongoing work in this field in Europe and beyond. 

Within CEN/ISSS,  TC 304 used to be the technical committee responsible for the European Localization Requirements in ICT. TC 304, however, currently only aims to complete some of its current work and is not starting any new work items, pending on the follow-on activities to the CDSG report. Related issues have recently come up in a number of other technical committees as well. 

Within CEN/ISSS, several workshops have dealt with and deal with related issues, including the Electronic Commerce Workshop (WS-EC) and the Metadata - Dublin Core Workshop (WS-MMI).

Within ISO/IEC JTC 1, the Cultural and Linguistic Adaptability and User Interfaces (CLAUI) Technical Direction is nominally the coordinating body for these activities. In practice, CLAUI has no organizational structure and very little cooperation exists between the various participating sub-committees and working groups, which are:

  • SC 2: Coded Character Sets and its working groups, WG 2: Multi Octet Codes (responsible for ISO/IEC 10646, the Universal Character Set, UCS) and WG 3: 7 and 8 bit Codes and Their Extensions.
  • SC35: User Interfaces, which has six working groups. 
  • SC22/WG20: Programming Languages, their Environments and Systems Software Interfaces / WG 20: Internationalization. (All other SC22 working groups belong to the Programming Languages and Software Interfaces Technical Direction.)

In the World Wide Web Consortium (W3C), the W3C Internationalization Activity consists of the I18N WG (Internationalization Working Group) and the I18N IG (Internationalization Interest Group). The goal of the I18N WG is to propose and coordinate techniques, conventions, guidelines and activities within the W3C and localization to make it easy to use W3C technology worldwide, using different languages, scripts and cultures. The goal of the I18N IG is to help the I18N WG with advice and opinions from a larger group of people with knowledge in different languages and cultures, as well as different parts of the Web architecture. The I18N IG also provides a forum for W3C members to discuss issues related to the internationalization of the Web. In addition to these members-only groups, the W3C has set up a public mailing list that has a similar goal to the I18N IG, but is open to non-members. Technical discussions that are not member-confidential will be carried out on the www-international@w3.org mailing list.

The core operations of the Internet have traditionally been performed using highly limited interfaces as specified by the Internet Engineering Task Force (IETF). The IETF Internationalized Domain Names (IDN) working group plans to specify requirements for internationalized access to domain names and to specify a standards track protocol based on the requirements. 

The Localization Industry Standards Association (LISA), provides mechanisms, services and networking for professionals interested in sharing information on the development of globalization and localization processes.

File last updated:
September 2002 
The Diffuse Project is funded under the European Commission's Information Society Technologies programme. Diffuse publications are maintained by TIEKE (the Finnish IS Development Centre), IC Focus and The SGML Centre.