Collation in OpenOffice 2.0

 

This file contains work in progress. Its contents might change without warning.

 

 

Javier SOLA - www.khmeros.info - Last edited 06/02/2006

 

 


 

By default, collation (the alphabetic order of characters for sorting) in OpenOffice follows the rules of the Unicode Collation Algorithm.

 

If collation for a specific language or region is not considered as correct by the localisers to that language/country, the collation sequence (order of characters) can be tailored or customized (modified, that is).

 

The Unicode Collation Algorithm does not use Unicode code-points (Unicode numbers assigned to each character) as a reference for sorting, it use a huge list that includes all the characters that are currently in Unicode, and which can be found in:

 

Latest version:
http://www.unicode.org/Public/UCA/latest/allkeys.txt (1 Mb)
The official 4.0 version:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt
 

This list is used as a reference. If you wish to modify the collation sequence for your language, you do not need to modify this list; you just have to create a separate file in which you indicate the discrepancies with this file (with the present collation system).

This file that you need to create contains one or more reordering sequences with a syntax of the style:

&n < ñ <<< Ñ < o

(this particular sequence places the letter “ñ” in the correct order for traditional Spanish sorting, between the letter “n” and the letter “o”, indicating that if two words are identical, but one of them starts with a Capital “Ñ” and another one with lowercase “ñ”, the lowercase one should go first).

But to really know how it works, you should look into the following document, which explains in detail the tailoring syntax and philosophy:

http://icu.sourceforge.net/userguide/Collate_Customization.html

You can also see: http://icu.sourceforge.net/docs/papers/icu_collation_iuc26.ppt (a powerpoint presentation).

You should also look at some simple files such as collator/data/km_charset.txt. You will not be able to see the characters, unless you have a Khmer font, but it will give you an idea of what your file should look like.

Supposing that the ISO code used by OpenOffice for your language/country is “xx”, we would recommend that you call this file xx_charset.txt (you can call it in different ways, but by giving a fixed name it is easier to give instructions). Sometimes more than one sequence might be defined for a language, in which case you would have to use different names, but always starting with “xx_”.

The xx_charset.txt file should be in UTF-8 format (do not use escaped Unicode code-point numbers, as sometimes indicated in the ICU documentation mentioned above).

 

This file has to be introduced in the source in location:

 

i18npool/source/collator/data/xx_charset.txt
 

 
The second step is to modify file:
 
i18npool/source/collator/data/collator_data.map 
 
to include treatment of your file. Add a line similar to:
 
   get_xx_charset;
where xx is you language code
 

 
 
The third and last step is to modify the locale file for you language/country, making sure that the LC_COLLATION section of it looks like:
 
<LC_COLLATION>
<Collator unoid="charset" default="true"/>
<CollationOptions>
<TransliterationModules>IGNORE_CASE</TransliterationModules>
</CollationOptions>
</LC_COLLATION>
 
All locale files are in
 

i18npool/source/localedata/data/

 

 

If you want to stablish more than one collation sequence for a language/region, you should look into the Chinese and Japanese locales and setting in the same files mentioned above and see how difference sequences are included in different files and how they are then added to the files mentioned above.

 

 

Once you have all these modifications, you should check them one more time, prepare a patch with them and submit it as an ENHANCEMENT issue against the Localization (L10n) project. To submit an issue you first need to login into the OpenOffice website, then hit File Issue on the left hand menu… go to proceed in the next page… click in the component l10n in the next one… are you are ready to file it. Select version current, subcomponent code, type ENHANCEMENT, Summary Collation sequence for language…., and hit Submit. The system will ask you if you want to attach a file and what type. Attach the patch and submit it… and you are done.