The Czech and Slovak Character Encoding Mess Explained

Version: $Id: cs-encodings-faq,v 1.11 1997/03/21 12:23:39 luki Exp $
Contents of this page is Public Domain and may be used without restriction. It is NOT warranted to be correct. Please send any corrections and additions to Lukas Petrlik <luki@kiv.zcu.cz>.

The latest version of this text is available here. You can also download the plaintext version of this text.

This text is an informal description of some of the character encodings, which are currently used in Czech Republic and Slovakia (former Czechoslovakia). These are: Kamenicky, PC Latin 2, ISO Latin 2, KOI-8 CS2, cp1250 (MS Windows CS and EE), Macintosh CE and Cork. All of them (except Cork) are ASCII extensions.

There are other encodings as well, esp. several EBCDIC-based encodings for mainframes. These are not covered here.

Kamenicky

Kamenicky encoding (aka KEYBCS2) is used on IBM compatible PC's. It is defined by the behavior of the Public Domain "KEYBCS2" utility, written in 1986 by Kamenicky brothers. Until recently it was the most popular encoding on PC's, because it saved all the graphical symbols. Many printers can print Kamenicky-encoded texts, the FidoNET people and many others still use it.

When IBM and Microsoft came with PC Latin 2 (cp852), the situation slowly changed towards its acceptance.

Some of the local software vendors use the cp895 for Kamenicky encoding (the first localized FoxPro used it), but this code page is defined by neither IBM nor Microsoft (according to a message from Jan Toman <janto@microsoft.com>, there is no official cp895 specification). Some software comes in both cp852 (PC Latin 2) and cp895 (Kamenicky) versions.

See also the Kamenicky encoding table.

PC Latin 2

The PC Latin 2 (alias PC L2) is used on PC's. Most of the current DOS and OS/2 programs use it by default or have an option for using it, because IBM and Microsoft use it and the Czechoslovak standard CSN 36 9103 recommends its use on PC's. It has all of the ISO 8859-2 printable characters, but the accented letters have different positions. The encoding is defined by IBM as code page 852.

MS DOS manuals describe cp852 as "Slavic (Latin II)" code page. Note that some of the languages covered by cp852 are not Slavonic languages, eg. Hungarian.

Most Czech and Slovak users know it only under the name Latin 2 (used by IBM) and don't even know that PC Latin 2 is very different from ISO Latin 2.

See also the PC Latin 2 table.

ISO Latin 2

ISO Latin 2 is the ISO 8859-2 (1987) standard. It is recommended by ISO for use with modern Albanian, Croatian, Czech, English, German, Hungarian, Polish, Rumanian, Slovak and Slovene. It is used mostly on Unices and other Nice Systems. IBM code page 912 is the same as ISO 8859-2.

An almost ISO 8859-2 compliant character encoding is defined by CSN 36 9103 under the name KOI-8 L2 (see "The CSN 36 9103 Standard" below). The encoding is registered by ISO under the registration number 139.

See also the ISO Latin 2 table.

KOI-8 CS2

This encoding is defined by CSN 36 9103. It treats `ch' and `CH' as single letters (as used in the Czech alphabet) and you can get the most used accented character positions simply by setting the sign bit. This encoding was used on old terminals, but now it seems to be dead. Some well known software (the T602 text editor) still has options for using it.

See also the KOI-8 CS2 table.

MS Windows Encoding

The MS Windows (3.1, WfW, W95 and NT) CS and EE editions use cp1250, which has all the printable characters of ISO Latin 2, but 14 characters have different positions (8 of them are used in Czech/Slovak). It also uses the positions 128-159 for printable characters (this is the C1 area, which is used for control purposes in ISO Latin 2 and other ISO 2022 conforming codes).

Thus it not true that cp1250 is a superset of ISO 8859-2.

The code page 1250 is also used in the Hungarian and Polish editions of Windows.

See also the Windows cp1250 table.

MacOS CentralEurope

The MacOS CE (aka Macintosh CE) character set is intended for use with Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Slovak and Slovenian. The encoding is currently used in Czech, Polish and Hungarian MacOS localizations.

See also the MacOS CE table.

Cork

The Cork (aka T1) encoding is used by most European TUGs (national TeX Users Groups) for TeX internal T1 font encoding. The encoding was defined in 1990 on the TUG meeting held in Cork. The TeX DC font family is T1-encoded.

This encoding is not ASCII extension, because it contains printable characters in the lowest 32 positions (0 - 31), which are used for control purposes in ASCII.

See also the Cork encoding table.

The CSN 36 9103 Standard

The standard uses an obscure language and requires careful reading. If you cannot understand the following text, it is because I followed its "good" example. :)

The Czechoslovak standard CSN 36 9103 defines the following character encodings: KOI-8 K1, KOI-8 L2, KOI-8 CS2, DKOI K1, DKOI K2, DKOI L2 and DKOI CS2. KOI-8 codes conform to ISO recommendations (they are ASCII-based) and DKOI don't (DKOI are EBCDIC-based encodings).

The standard is Czechoslovak extension of the SEV standard ST SEV 358-88. The new encodings (which aren't defined in the SEV standard) are KOI-8 L2, KOI-8 CS2, DKOI L2 and DKOI CS2. The remaining encodings are for the cyrillic alphabet used for communication within SEV -- these were never in regular use in our country. The definition of KOI-8 L2 is stated to conform to ISO 8859-2 (1987), except for the characters $, _ and the currency symbol (164), which have different graphic representations. KOI-8 L2 is also known as "charset CSN_369103" by RFC 1345, because it is the only character encoding registered by ISO (ISO IR 139).

The Appendix 5 ``8-bit Codes for Personal Computers'' contains an informative description of the character encoding PC Latin 2 defined by IBM. This encoding is known as IBM Code Page 852, but the cp number is not mentioned in the standard.

The CSN 36 9103 standard had to be revised in 1996.

See also the KOI-8 L2 table.


Tables

Table of English and Czech Accent Names Used in ISO Latin 2 and KOI-8 L2

English name Czech Name (CSN 36 9103) ---------------------------------------------------------------------- acute accent ......... carka nad pismenem, silny prizvuk (c<a'rka nad pi'smenem, silny' pr<i'zvuk) breve ................ breve caron ................ hacek (ha'c<ek) cedilla .............. hacek pod pismenem, cedilie (ha'c<ek pod pi'smenem, cedilie) circumflex accent .... vokan (voka'n<) diaeresis ............ dve tecky nad pismenem, prehlaska (dve< tec<ky nad pi'smenem, pr<ehla'ska) dot above ............ tecka nad pismenem (tec<ka nad pi'smenem) double acute accent .. dvojcarka (dvojc<a'rka) ogonek ............... ocasek (oca'sek) ring above ........... krouzek nad pismenem (krouz<ek nad pi'smenem) stroke ............... preskrtnuti (pres<krtnuti') < xmp> [Missing: Slovak accent names] <h3> TeX and RFC 1345 Accent Representations </h3> <xmp> English name TeX RFC 1345 ---------------------------------------------------------- acute accent ......... \'{x}, \'{\i} ............... x' '' breve ................ \u{x} ....................... x( '( caron ................ \v{x} ....................... x&lt; '&lt; cedilla ............... \c{x} ....................... x, ', circumflex accent .... \^{x} ....................... x&gt; '&gt; diaeresis ............ \"{x} ....................... x: ': dot above ............ \.{x} ....................... x. '. double acute accent .. \H{x} ....................... x" '" grave accent ......... \`{x} ....................... x! '! macron ............... \={x} ....................... x- 'm ogonek ............... \k{x} (LaTeX) ................ x; '; ring above ........... \accent23x, \aa ............. x0 aa '0 stroke ............... \l, \L, \o, \O .............. x/ tilde ................ \~{x} ....................... x? '?

The Czech Alphabet

a a' b c c&lt; d d&lt; e e' e&lt; f g h ch i i' j k l m n n&lt; o o' p q r r&lt; s s&lt; t t&lt; u u' u0 v w x y y' z z&lt;

The digraph "ch" is treated as single character.
The Czech characters "r<", "e<" and "u0" are not used in the Slovak language.

The ISO 639 abbreviation for the Czech language is "cs". The two letter ISO 3166 country code for Czech Republic is "CZ".

Note that the ISO 639/ISO 3166 convention is that language names are written in lower case and country codes are written in upper case.

The Slovak Alphabet

a a' a: b c c&lt; d d&lt; dz dz&lt; e e' f g h ch i i' j k l l' l&lt; m n n&lt; o o' o&gt; p q r r' s s&lt; t t&lt; u u' v w x y y' z z&lt;

Digraphs "ch", "dz" and "dz<" are treated as single characters in the Slovak language. The Slovak characters "a:", "o>", "r'", "l'" and "l<" are not used in the Czech language.

The ISO 639 abbreviation for the Slovak language is "sk". The two letter ISO 3166 country code for Slovakia is "SK".

Charset Tables

Format of the tables is described in RFC 1345 (see "Sources Used"). The following additional mnemonics are used for characters missing in RFC 1345:

 @CH            CAPITAL CZECH LETTER CH  (the digraph "CH")  [CSN]
 @ch            SMALL CZECH LETTER CH  (the digraph "ch")  [CSN]
 @I,            LATIN CAPITAL LETTER I WITH CEDILLA
 @i,            LATIN SMALL LETTER I WITH CEDILLA
 @j.            LATIN SMALL LETTER I DOTLESS
 @SS            LATIN CAPITAL LETTER SHARP S (German)  (the digraph "SS")
 @U,            LATIN CAPITAL LETTER U WITH CEDILLA
 @u,            LATIN SMALL LETTER U WITH CEDILLA

See also comments in the tables.


&charset; ISO_8859-2:1987 &rem; source: ECMA registry &rem; Extracted from RFC 1345 &alias; iso-ir-101 &g1esc; x2d42 &g2esc; x2e42 &g3esc; x2f42 &alias; ISO_8859-2 &alias; ISO-8859-2 &alias; latin2 &alias; l2 &rem; Code page 912 is IBM's alias for ISO 8859-2: &alias; cp912 &alias; 912 &code; 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb DO % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3 DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC NS A; '( L/ Cu L&lt; S' SE ': S&lt; S, T&lt; Z' -- Z&lt; Z. DG a; '; l/ '' l&lt; s' '&lt; ', s&lt; s, t&lt; z' '" z&lt; z. R' A' A&gt; A( A: L' C' C, C&lt; E' E; E: E&lt; I' I&gt; D&lt; D/ N' N&lt; O' O&gt; O" O: *X R&lt; U0 U' U" U: Y' T, ss r' a' a&gt; a( a: l' c' c, c&lt; e' e; e: e&lt; i' i&gt; d&lt; d/ n' n&lt; o' o&gt; o" o: -: r&lt; u0 u' u" u: y' t, '. &charset; IBM852 &rem; source: IBM NLS RM Vol2 SE09-8002-01, March 1990 &rem; Extracted from RFC 1345 and corrected &alias; cp852 &alias; 852 &rem; From cp852_DOSLatin2 to Unicode table: &alias; cp852_DOSLatin2 &rem; The following aliases are used by CSN 36 9103, but not by RFC 1345: &alias; pclatin2 &alias; pcl2 &code; 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb DO % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT C, u: e' a&gt; a: u0 c' c, l/ e: O" o" i&gt; Z' A: C' E' L' l' o&gt; o: L&lt; l&lt; S' s' O: U: T&lt; t&lt; L/ *X c&lt; a' i' o' u' A; a; Z&lt; z&lt; E; e; NO z' C&lt; s, &lt;&lt; &gt;&gt; .S :S ?S vv vl A' A&gt; E&lt; S, VL VV LD UL Z. z. dl ur uh dh vr hh vh A( a( UR DR UH DH VR HH VH Cu d/ D/ D&lt; E: d&lt; N&lt; I' I&gt; e&lt; ul dr FB LB T, U0 TB O' ss O&gt; N' n' n&lt; S&lt; s&lt; R' U' r' U" y' Y' t, '' -- '" '; '&lt; '( SE -: ', DG ': '. u" R&lt; r&lt; fS NS &charset; KOI-8_L2 &rem; The RFC 1345 name for this charset is CSN_369103, but this &rem; CSN standard defines 6 other encodings as well, so the name &rem; shouldn't be used as an alias for KOI-8_L2. &rem; source: ECMA registry &rem; Extracted from RFC 1345 and changed charset name &alias; koi8l2 &alias; iso-ir-139 &g1esc; x2d49 &g2esc; x2e49 &g3esc; x2f49 &code; 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb Cu % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3 DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC NS A; '( L/ DO L&lt; S' SE ': S&lt; S, T&lt; Z' -- Z&lt; Z. DG a; '; l/ '' l&lt; s' '&lt; ', s&lt; s, t&lt; z' '" z&lt; z. R' A' A&gt; A( A: L' C' C, C&lt; E' E; E: E&lt; I' I&gt; D&lt; D/ N' N&lt; O' O&gt; O" O: *X R&lt; U0 U' U" U: Y' T, ss r' a' a&gt; a( a: l' c' c, c&lt; e' e; e: e&lt; i' i&gt; d&lt; d/ n' n&lt; o' o&gt; o" o: -: r&lt; u0 u' u" u: y' t, '. &charset; KEYBCS2 &rem; source: the Reality :) &alias; KAMENICKY &code; 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb DO % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT C&lt; u: e' d&lt; a: D&lt; T&lt; c&lt; e&lt; E&lt; L' I' l&lt; l' A: A' E' z&lt; Z&lt; o&gt; o: O' u0 U' y' O: U: S&lt; L&lt; Y' R&lt; t&lt; a' i' o' u' n&lt; N&lt; U0 O&gt; s&lt; r&lt; r' R' 14 SE &lt;&lt; &gt;&gt; .S :S ?S vv vl vL Vl Dl dL VL VV LD UL Ul uL dl ur uh dh vr hh vh vR Vr UR DR UH DH VR HH VH uH Uh dH Dh Ur uR dR Dr Vh vH ul dr FB LB lB RB TB a* b* G* p* S* s* m* t* F* H* W* d* 00 /0 e* (U =3 +- &gt;= =&lt; Iu Il -: ?2 DG .M '. RT nS 2S fS NS &charset; CORK &rem; source: DC font sources &alias; T1 &code; 0 '! '' '&gt; '? ': '" '0 '&lt; '( 'm '. ', '; .9 &lt;1 &gt;1 "6 "9 :9 &lt;&lt; &gt;&gt; -N -M ?? 0s i. @j. ff fi fl ffi ffl SP ! " Nb DO % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? -1 A( A; C' C&lt; D&lt; E&lt; E; G( L' L&lt; L/ N' N&lt; NG O" R' R&lt; S' S&lt; S, T&lt; T, U" U0 Y: Z' Z&lt; Z. IJ I. d/ SE a( a; c' c&lt; d&lt; e&lt; e; g( l' l&lt; l/ n' n&lt; ng o" r' r&lt; s' s&lt; s, t&lt; t, u" u0 y: z' z&lt; z. ij !I ?I Pd A! A' A&gt; A? A: AA AE C, E! E' E&gt; E: I! I' I&gt; I: D- N? O! O' O&gt; O? O: OE O/ U! U' U&gt; U: Y' TH @SS a! a' a&gt; a? a: aa ae c, e! e' e&gt; e: i! i' i&gt; i: d- n? o! o' o&gt; o? o: oe o/ u! u' u&gt; u: y' th ss &charset; KOI-8_CS2 &rem; source: CSN 36 9103 &alias; koi8cs2 &alias; koi8cs &g1esc; x2d49 &g2esc; x2e49 &g3esc; x2f49 &code; 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb Cu % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3 DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC NS ?? '' ?? '? ?? '( '. ': ?? '0 ', ?? '" '; '&lt; Co Rg dr dl ur ul hh -v W* SE a* g* e* m* p* w* a! a' a&lt; c&lt; d&lt; e&lt; r' @ch u: i' u0 l' l&lt; o: n&lt; o' o&gt; a: r&lt; s&lt; t&lt; u' e: e' u" y' z&lt; ?? ?? o" e. ss A! A' A&lt; C&lt; D&lt; E&lt; R' @CH U: I' U0 L' L&lt; O: N&lt; O' O&gt; A: R&lt; S&lt; T&lt; U' E: E' U" Y' Z&lt; ?? ?? O" E. ?? &charset; windows-1250 &rem; source: cp1250_WinLatin2 to Unicode table, table version 2.00 &alias; cp1250_WinLatin2 &rem; Unofficial aliases: &alias; cp1250 &alias; 1250 &alias; wincs &alias; winee &code; 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb DO % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT ?? ?? .9 ?? :9 .3 /- /= ?? %0 S&lt; &lt;1 S' T&lt; Z&lt; Z' ?? '6 '9 "6 "9 Sb -N -M ?? TM s&lt; &gt;1 s' t&lt; z&lt; z' NS '&lt; '( L/ Cu A; BB SE ': Co S, &lt;&lt; NO -- Rg Z. DG +- '; l/ '' m* PI .M ', a; s, &gt;&gt; L&lt; '" l&lt; z. R' A' A&gt; A( A: L' C' C, C&lt; E' E; E: E&lt; I' I&gt; D&lt; D/ N' N&lt; O' O&gt; O" O: *X R&lt; U0 U' U" U: Y' T, ss r' a' a&gt; a( a: l' c' c, c&lt; e' e; e: e&lt; i' i&gt; d&lt; d/ n' n&lt; o' o&gt; o" o: -: r&lt; u0 u' u" u: y' t, '. &charset; MacOS_CentralEurope &rem; source: MacOS_CentralEurope to Unicode table, table version 0.2 &rem; Unofficial aliases: &alias; macintosh_ce &alias; macce &code; 0 NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb DO % &amp; ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z &lt;( // )&gt; '&gt; _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT A: A- a- E' A; O: U: a' a; C&lt; a: c&lt; C' c' e' Z' z' D&lt; i' d&lt; E- e- E. o' e. o&gt; o: o? u' E&lt; e&lt; u: /- DG E; Pd SE Sb PI ss Rg Co TM e; ': != g' @I, @i, I- =&lt; &gt;= i- K, dP +Z l/ L, l, L&lt; l&lt; L' l' N, n, N' NO RT n' N&lt; DE &lt;&lt; &gt;&gt; .3 NS n&lt; O" O? o" O- -N -M "6 "9 '6 '9 -: Db o- R' r' R&lt; &lt;1 &gt;1 r&lt; R, r, S&lt; .9 :9 s&lt; S' s' A' T&lt; t&lt; I' Z&lt; z&lt; U- O' O&gt; u- U0 U' u0 U" u" @U, @u, Y' y' k, Z. L/ z. G, '&lt;

To Do

Acknowledgments

Thanks are due to all the readers who contributed to the improvement of this FAQ.

Special thanks to Josef Tkadlec <tkadlec@math.feld.cvut.cz>, who reviewed the tables.

Sources Used

CSN 36 9103. Systemy zpracovani informaci: 8bitove kodovane soubory symbolu. (Information processing: 8-bit code for information interchange.) Vydavatelstvi norem Praha, 1989.

Gasparikova, Z. - Kamis, A.: Slovensko-cesky slovnik. SPN Praha 1987. (The Slovak-Czech Dictionary.)

IBM: IBM OS/2 Warp 4. Klavesnice a kodove stranky. (Keyboards and Code Pages.) IBM, 1996.

Knuth, D. E.: The TeXbook. Addison - Wesley, Reading, Massachusetts, 1986.

Lamport, L.: LaTeX. Addison - Wesley, Reading, Massachusetts, 1986.

List of IANA Registered Character Sets.

RFC 1345. Character Mnemonics & Character Sets. [Tables for ISO Latin 2 (ISO_8859-2:1987), PC Latin 2 (IBM852) and KOI-8 L2 (CSN_369103).]

The cp1250_WinLatin2 to Unicode table, 2.00.

The MacOS_CentralEurope to Unicode table, 0.2. [This table also contains verbal description of the code.]