Updated 2012-11-10 15:14:40 by RLE

Paper presented at the Fourth European Tcl/Tk Users Meeting, Nuremberg 2003, by Richard Suchenwirth

Abstract: Problems and solutions of working with writing systems used over the world (internationalization, "i18n") are discussed, with emphasis on encodings, input, output, and the quite satisfying solutions in the Tcl scripting/programming language.

1. Characters, glyphs, code points

"Everything is a string", the Tcl mantra goes. A string is a (finite-length) sequence of characters. Now, what is a character? A character is not the same as a glyph, the writing element that we see on screen or paper - that represents it, but the same glyph can stand for different characters, or the same character be represented with different glyphs (think e.g. of a font selector).

Also, a character is not the same as a byte, or sequence of bytes, in memory. That again may represent a character, but not unequivocally, once we leave the safe haven of ASCII.

Let's try the following working definition: "A character is the abstract concept of a small writing unit". This often amounts to a letter, digit, or punctuation sign - but a character can be more or less than that. More: Ligatures, groups of two or more letters, can at times be treated as one character (arranged even in more than one line, as seen in Japanese U+337F ㍿ or Arabic U+FDFA ﷺ). Less: little marks (diacritics) added to a character, like the two dots on ü in Nürnberg (U+00FC), can turn that into a new "precomposed" character, as in German; or they may be treated as a separate, "composing character" (U+0308 in the example) which in rendering is added to the preceding glyph (u, U+0075) without advancing the rendering position - a more sensible treatment of the function of these two dots, "trema", in Spanish, Dutch, or even (older) English orthography: consider the spelling "coöperation" in use before c. 1950. Such composition is the software equivalent of "dead keys" on a typewriter.

Although an abstract concept, a character may of course have attributes, most importantly a name: a string, of course, which describes its function, usage, pronunciation etc. Various sets of names have been formalized in Postscript (/oumlaut) or HTML (ö). Very important in technical applications is of course the assignment of a number (typically a non-negative integer) to identify a character - this is the essence of encodings, where the numbers are more formally called code points. Other attributes may be predicates like "is upper", "is lower", "is digit".

The relations between the three domains are not too complicated: an encoding controls how a 1..n sequence of bytes is interpreted as a character, or vice versa; the act of rendering turns an abstract character into a glyph (typically by a pixel or vector pattern). Conversely, making sense of a set of pixels to correctly represent a sequence of characters, is the much more difficult art of OCR, which earns my daily bread, but is not the topic of this talk.

2. Pre-Unicode encodings

Work on encodings, mapping characters to numbers (code points), has a longer history than electronic computing. Francis Bacon (1561-1626) is reported to have used, around 1580, a five-bit encoding where bits were represented as "a" or "b", of the English/Latin alphabet (without the letters j and u!), long before Leibniz discussed binary arithmetics in 1679. An early encoding in practical use was the 5-bit Baudot/CCIT-2 teletype (punch tape) code standardized in 1932, which could represent digits and some punctuations by switching between two modes. I have worked on Univac machines that used six bits per "Fieldata" character, as hardware words were 36 bits long. While IBM used 8 bits in the EBCDIC code, the more famous American Standard Code for Information Interchange (ASCII) did basically the same job in 7 bits per character, which was sufficient for upper/lowercase basic Latin (English) as well as digits and a number of punctuations and other "special" characters - as hardware tended to 8-bit bytes as smallest memory unit, one was left for parity checks or other purposes.

The most important purpose, outside the US, was of course to accommodate more letters required to represent the national writing system - Greek, Russian, or the mixed set of accented or "umlauted" characters used in virtually every country in Europe. Even England needed a code point for the Pound Sterling sign. The general solution was to use the 128 additional positions available when ASCII was implemented as 8-bit bytes, hex 80..FF. A whole flock of such encodings were defined and used:
 ISO standard encodings iso8859-.. (1-15)
 MS/DOS code pages cp...
 Macintosh code pages mac...
 Windows code pages cp1...

The East Asian countries China, Japan, Korea all use character sets numbering several thousands, so the "high ASCII" approach was not feasible there. Instead, the ASCII concept was extended to a 2x7 bit pattern, where the 94 printing ASCII characters indicate row and column in a 94x94 matrix. This way, all character codes were in practice two bytes wide, and thousands of Hanzi/Kanji/Hangul could be accommodated, plus hundreds of others (ASCII, Greek, Russian alphabets, many graphic characters). These national multibyte encodings are:
 JIS C-6226 (Japan 1978)
 GB 2312-80 (China 1980)
 KS C-5601 (Korea)

If the 2x7 pattern was directly implemented, files in such encodings could not be told apart from ASCII files, except for unreadability. In order to handle both types of strings transparently, the "high ASCII" approach was extended so that a byte in 00..7F was taken at ASCII face value, while bytes with high bit set (80..FF) were interpreted as halves of multibyte codes. For instance, the first Chinese character in GB2312, row 16 col 1 (decimally 1601 for short), gives the two bytes
 16 + 32 + 128 = 176 = 0xB0
  1 + 32 + 128 = 161 = 0xA1

This implementation became known as "Extended UNIX Code" (EUC) in the national flavors euc-cn (China), -jp (Japan), -kr (Korea). Microsoft did it differently and adopted the similar but different ShiftJIS (Japan) resp. "Big 5" (Taiwan, Hongkong) encodings, to add to the "ideograph soup" confusion.

3. Unicode as pivot for all other encodings

The Unicode standard is an attempt to unify all modern character encodings into one consistent 16-bit representation. Consider a page with a 16x16 table filled with EuroLatin-1 (ISO 8859-1), the lower half being the ASCII code points. Call that "page 00" and imagine a book of 256 or more such pages (with all kinds of other characters on them, in majority CJK), then you have a pretty clear concept of the Unicode standard, in which a character's code position is "U+" hex (page number*256+cell number), for instance, U+20A4 is the Pound Sterling sign. Initiated by the computer industry (www.unicode.org), the Unicode has grown together with ISO 10646, a parallel standard providing an up-to-31-bits encoding (one left for parity?) with the same scope. Software must allow Unicode strings to be fit for i18n. From Unicode version 3.1, the 16-bit limit was transcended for some rare writing systems, but also for the CJK Unified Ideographs Extension B - apparently, even 65536 code positions are not enough. The total count in Unicode 3.1 is 94,140 encoded characters, of which 70,207 are unified Han ideographs; the next biggest group are over 14000 Korean Hangul. And the number is growing.

Unicode 4.0.0 is the latest version, reported to contain 96,248 Graphic characters, 134 format characters, 65 Control characters, 137,468 "private use", 2,048 surrogates, 66 noncharacters. 878,083 code points are reserved for what the future will bring. From www.unicode.org/versions/Unicode4.0.0 : "1,226 new character assignments were made to the Unicode Standard, Version 4.0 (over and above what was in Unicode 3.2). These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants)."

4. Unicode implementations: UTF-8, UCS-16

UTF-8 is made to cover 7-bit ASCII, Unicode, and ISO 10646. Characters are represented as sequences of 1..6 eight-bit bytes - termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:

  • ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.
  • Unicode, pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost bits of page#, bb.. are the bits of the second Unicode byte. These pages cover European/Extended Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic.
  • Unicode, pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the rest of Unicode, including Hangul, Kanji, and what else. This means that East Asian texts are 50% longer in UTF-8 than in pure 16 bit Unicode.
  • ISO 10646 codes beyond Unicode: 4..6 bytes. (Never seen one yet).

General principle of UTF-8 is that the first byte either is a single-byte character (if below 0x80), or indicates the length of a multi-byte code by the number of 1's before the first 0, and is then filled up with data bits. All other bytes start with bits 10 and are then filled up with 6 data bits. It follows from this that bytes in UTF-8 encoding fall in distinct ranges:
   00..7F - plain old ASCII
   80..BF - non-initial bytes of multibyte code
   C2..FD - initial bytes of multibyte code (C0, C1 are not legal!)
   FE, FF - never used (so, free for byte-order marks).

The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data. Besides, it's independent of byte order (as opposed to UCS-16, see below). Tcl however shields these UTF-8 details from us: characters are just characters, no matter whether 7 bit, 16 bit, or (in the future) more.

I found out by chance that the byte sequence EF BB BF is the UTF-8 equivalent of \uFEFF, and the humble Notepad editor of Windows 2000 indeed switches to UTF-8 encoding when a file starts with these three bytes. I don't know how widely used this convention is, but I like it - my i18n-aware Tcl code will adopt it for reading and writing files, in addition to FEFF treatment that I already do.

The UCS-16 representation (in Tcl just called the "unicode" encoding) is much easier explained: each character code is written as a 16-bit "short" unsigned integer. The practical complication is that the two memory bytes making up a "short" can be arranged in "big-endian" (Motorola, Sparc) or "little-endian" (Intel) byte order. Hence, the following rules were defined for Unicode:

  • Code point U+FEFF was defined as Byte Order Mark (BOM), later renamed to "Zero-width non-breaking space";
  • code point U+FFFE (as well as FFFF) is not a valid Unicode.

This way, a Unicode-reading application (even Notepad/W2k) can easily detect that something's wrong when it encounters the byte sequence FFFE, and swap the following byte pairs - a minimal and elegant way of dealing with varying byte orders. For XML, a similar self-identification is defined with the encoding attribute in the leading tag.

5. Tcl: encoding conversions; system encoding

From Tcl 8.1, i18n support was brought to string processing, and it was wisely decided to

  • use Unicode as general character set
  • use UTF-8 as standard internal encoding
  • provide conversion support for the many other encodings in use.

However, as unequal-length byte sequences make simple tasks as indexing into a string, or determining its length in characters more complex, the internal representation is converted to fixed-length 16-bit UCS-16 in such cases. (This brings new problems with recent Unicodes that cross the 16-bit barrier... When practical use justifies it, this will have to change to UCS-32, or 4 bytes per character.)

Not all i18n issues are therefore automatically solved for the user. One still has to analyze seemingly simple tasks like uppercase conversion (Turkish dotted/undotted I make an anomaly) or sorting ("collation order" is not necessarily the numeric order of the Unicodes, as lsort would apply by default), and write custom routines if a more correct behavior is required. Other locale-dependent i18n issues like number/currency formatting, date/time handling also belong to this group. I recommend to start from the defaults Tcl provides, and if necessary, customize the appearance as desired. International data exchange is severely hampered if localized numeric data are exchanged, one side using period, the other comma as decimal point...

Strictly spoken, the Tcl implementation "violates the UTF-8 spec, which explicitly forbids non-canonical representation of characters and requires that malformed UTF-8 sequences in the input be errors. ... I think that to be an advantage. But the spec says 'MUST' so we're at least technically non-compliant." (Kevin B. Kenny in the Tcl chat, 2003-05-13)

If textual data are internal to your Tcl script, all you have to know is the \uxxxx notation, which is substituted into the character with Unicode U+xxxx (hexadecimal). This notation can be used wherever Tcl substitution takes place, even in braced regexp's and string map pairlists; else you can force it by substing the string in question.

To demonstrate that for instance scan works transparently, here's a one-liner to format any Unicode character as HTML hex entity:
 proc c2html c {format "&x%4.4x;" [scan $c %c]}

Conversely it takes a few lines more:
 proc html2u string {
    while {[regexp {&[xX]([0-9A-Fa-f]+);} $string matched hex]} {
        regsub -all $matched $string [format %c 0x$hex] string
    }
    set string
 }
 % html2u "this is a &x20ac; sign"
 this is a € sign

[ BR 2004-01-21 - These two code snippets are missing the "#" that is necessary in HTML, like in "€" instead of "&x20ac;" ]

For all other purposes, two commands basically provide all i18n support:
 fconfigure $ch -encoding $e

enables conversion from/to encoding e for an open channel (file or socket) if different from system encoding;
 encoding convertfrom/to $e $string

does what it says, the other encoding being always Unicode.

For instance, I could easily decode the bytes EF BB BF from a hexdump with
 format %x [encoding convertfrom utf-8 \xef\xbb\xbf]

in an interactive tclsh, and found that it stood for the famous byte-order mark FEFF. Internally to Tcl, (almost) everything is a Unicode string. All communications with the operating system is done in the "system encoding", which you can query (but best not change) with the [encoding system] command. Typical values are iso8859-1 or -15 on European Linuxes, and cp1252 on European Windowses.

Finally, the msgcat package supports localization ("l10n") of apps by allowing message catalogs for translation of strings, typically for GUI display, from a base language (typically English) to a target language selected by the current locale. For example, an app to be localized for France might contain a file en_fr.msg with, for simplicity, only the line
 msgcat::mcset fr File FichierIn the app itself, all you need is
 package require msgcat
 namespace import msgcat::mc
 msgcat::mclocale fr ;#(1)
 #...
 pack [button .b -text [mc File]]

to have the button display the localized text for "File", namely "Fichier", as obtained from the message catalog. For other locales, only a new message catalog has to be produced by translating from the base language. Instead of explicit setting as in (1), typically the locale information might come from an environment (LANG) or registry variable.

6. Tk: text rendering, fonts

Rendering international strings on displays or printers can pose the biggest problems. First, you need fonts that contain the characters in question. Fortunately, more and more fonts with international characters are available, a pioneer being Bitstream Cyberbit that contains roughly 40000 glyphs and was for some time offered for free download on the Web. Microsoft's Tahoma font also added support for most alphabet writings, including Arabic. Arial Unicode MS delivered with Windows 2000 contains just about all the characters in the Unicode, so even humble Notepad can get truly international with that.

But having a good font is still not enough. While strings in memory are arranged in logical order, with addresses increasing from beginning to end of text, they may need to be rendered in other ways, with diacritics shifted to various positions of the preceding character, or most evident for the languages that are written from right to left ("r2l"): Arabic, Hebrew. (Tk still lacks automatic "bidi"rectional treatment, so r2l strings have to be directed "wrongly" in memory to appear right when rendered - see A simple Arabic renderer on the Wiki).

Correct bidi treatment has consequences for cursor movement, line justification, and line wrapping as well. Vertical lines progressing from right to left are popular in Japan and Taiwan - and mandatory if you had to render Mongolian.

Indian scripts like Devanagari are alphabets with about 40 characters, but the sequence of consonants and vowels is partially reversed in rendering, and consonant clusters must be rendered as ligatures of the two or more characters involved - the pure single letters would look very ugly to an Indian. An Indian font for one writing system already contains several hundred glyphs. Unfortunately, Indian ligatures are not contained in the Unicode (while Arabic ones are), so various vendor standards apply for coding such ligatures.

7. Input methods in Tcl/Tk

To get outlandish characters not seen on the keyboard into the machine, they may at lowest level be specified as escape sequences, e.g. "\u2345". But most user input will come from keyboards, for which many layouts exist in different countries. In CJK countries, there is a separate coding level between keys and characters: keystrokes, which may stand for the pronunciation or geometric components of a character, are collected in a buffer and converted into the target code when enough context is available (often supported by on-screen menus to resolve ambiguities).

Finally, a "virtual keyboard" on screen, where characters are selected by mouse click, is especially helpful for non-frequent use of rarer characters, since the physical keyboard gives no hints which key is mapped to which other code. This can be implemented by a set of buttons, or minimally with a canvas that holds the provided characters as text items, and bindings to <1>, so clicking on a character inserts its code into the widget which has keyboard focus. See iKey: a tiny multilingual keyboard.

The term "input methods" is often used for operating-system-specific i18n support, but I have no experiences with this, doing i18n from a German Windows installation. So far I'm totally content with hand-crafted pure Tcl/Tk solutions - see taiku on the Wiki.

8. Transliterations: The Lish family

The Lish family is a set of transliterations, all designed to convert strings in lowly 7-bit ASCII to appropriate Unicode strings in some major non-Latin writing systems. The name comes from the common suffix "lish" as in English, which is actually the neutral element of the family, faithfully returning its input ;-) Some rules of thumb:

  • One *lish character should unambiguously map to one target character, wherever applicable
  • One target letter should be represented by one *lish letter ([A-Za-z]), wherever applicable. Special characters and digits should be avoided for coding letters
  • Mappings should be intuitive and/or follow established practices
  • In languages that distinguish case, the corresponding substitutes for upper- and lowercase letters should also correspond casewise in lower ASCII.

It all began with Greeklish, which is not my invention, but used by Greeks on the Internet for writing Greek without Greek fonts or character set support. I just extended the practice I found with the convention of marking accented vowels with a trailing apostrophe (so it's not a strict 1:1 transliteration anymore). The Tclers' Wiki http://wiki.tcl.tk/ has the members of the Lish family available for copy'n'paste. The ones I most frequently use are

  • Arblish, which does context glyph selection and right-to-left conversion;
  • Greeklish;
  • Hanglish for Korean Hangul, which computes Unicodes from initial-vowel-final letters;
  • Ruslish for Cyrillic.

Calling examples, that return the Unicodes for the specified input:
   arblish   dby w Abw Zby
   greeklish Aqh'nai
   hanglish  se-qul
   heblish   irwsliM
   ruslish   Moskva i Leningrad

9. Conclusion

At university and work, I have dealt with i18n for over 25 years, with code points, fonts, and their rendering. It began very hard, when I had to take care of every single pixel, and gradually became easier, with encodings getting standardized, and exotic fonts available for download. But my real epiphany was when I installed an alpha version of Tcl 8.1, and interactively tried my first \uXXXX. Tk's automatic font finding struck me like magic - and with TrueType fonts on Windows, I could even resize them in real time...

Later I discovered the convenience of encoding conversions, which, apart from dealing with strange input files, also allow for the use of inherent orderings (e.g. Chinese, Japanese) for lean implementations of input methods. Given Tcl/Tk, and at least one font that has the characters I need, i18n really is almost as easy as child's play - and cross-platform, from Unix/Linux and Windows boxes down to the little PocketPC in my hand :-)

Programming languages are a matter of taste, but I can only testify that the Tcl language is best prepared for dealing with i18n issues, and for instance it took me four lines of code to "upgrade" my English eBook reader iRead so it could also render a Chinese eBook in Big-5 encoding, or similarly simple extend the dictionary browser iDict to accept UTF-8 files. I recommend Tcl for working with many languages. Even if you don't want or need that, Unicode (+suitable fonts) can offer surprising support, e.g. providing glyphs for chessmen or card color symbols...