Linux and X11 internationalization: the dead keys controversy


Abstract: this page provides information on using international character sets under Linux and particularly under X11R6. The basic problem with X11R6 is that it puts the burden of handling dead keys on X clients and not in the X server or in the kernel. This results in most X clients not handling dead keys correctly. Two solutions are presented here, a pragmatic one that solves the problem in a clean, neat way and the more idealistic, official solution.

The Basic Problem

I have been using microcomputers to write academic texts and personal mail for nearly 20 years and have always found it unfair that the basic ASCII character set would not include the accented characters used by most Latin languages, specially those used by French and Portuguese.

I mean these characters:

Of course, this is a physical limitation: a keyboard with all the characters above would have more than 140 keys and be very cumbersome. Furthermore, there are many more possible accented characters, produced by composing a letter with an accent or diacritical. By composing I mean that the accent or diacritical is first typed, but nothing appears immediately on the screen (this is why such keys are called dead keys); then a second key is typed, which results in the composed character.

Example: "~" (tilde) gets typed first, nothing appears on screen, the cursor does not even move; then "a" gets typed, and the resulting "ã" gets displayed. If one wants to get the diacritical by itself, it is enough to type it followed by a space (e.g. "~" followed by <space bar> resulting in "~" being displayed).

This composition mechanism is the most natural way to input accented characters. It is also a de-facto standard in the computer industry, since it is used by the Microsoft Windows (tm) family of operating systems. Of course, it is the one that most closely resembles the old typewriter method of typing accented letters (accent + backspace + letter).

Linux and the ISO-8859-1 international character set

ISO, the international standards body, has long ago formalized a standard set of characters for most West European Languages: the ISO-8859-1 (or Latin-1) character set. Whereas ASCII is a 7-bit character set, ISO-8859-1 is an 8-bit superset of ASCII that includes most accented characters. It is the standard character set used on the Internet.

ISO-8859-1 table

Table 1 : the ISO-8859-1 international character set.

ISO-8859-1 covers the following languages: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.

It is interesting to note that Linux can be configured to produce and display the entire set of accented characters in ISO-8859-1 (plus a few others) on the Linux console, using the standard composition mechanism described above and a standard IBM international keyboard or any national keyboard. For the specific details of Linux console configuration, you can have a look at the national HOWTOs available on the Linux Documentation Project site and mirrors (see the Links section below).

The X Window system

Linux, as mentionned above, is very flexible when it comes to international support. However, Linux, just like any UNIX (tm) compatible operating system (except NextStep), does not include a Graphical User Interface (GUI). The standard GUI for Linux is called X Window or simply X, and recognized by its famous logo. X-logo

X cannot be described as a complete GUI, because originally it was not designed as such and there was never any industry-wide drive to make it into one. Recently, the X Consortium, the body responsible for setting X standards, was incorporated in The Open Group, also responsible for Motif and CDE, two other standards that toegther with X, help form a somewhat heterogeneous user interface under the name of Open Desktop.

Even though I believe X presents some very interesting concepts, I also believe it has somewhat fallen back technologically and aesthetically in the very fast-moving computer industry. The situation is even worse if you consider the pace at which Linux is evolving.

This inertia on the part of the X Consortium (and I have no reasons to believe that it will get any better under The Open Group direction) has fueled a debate on the efficiency and ergonomic qualities of X and many projects are appearing on the Internet to fill the gaps left open by X in its own definition of a user interface (for example, FVWM and its recent variants). The future of these alternative GUI efforts and developments is unknown, as they do not seem to abide by any particular standards.

Also note that while X as implemented by XFree86 is freely available, Motif and CDE are not.

X Window in its latest incarnation is called X11R6.3 (X version 11 revision 6 patch 3).

X11R6 international support

When it comes to supporting international character sets and keyboards, X11R6 has a complex system of so called "locales" that are supposed to cover most languages, and through the LC_CTYPE system variable, one can select the adequate locale which will define a set of rules for text input.

This locale setting is entirely supported by a very complex keyboard event handling function, XmbLookupString, which will handle Right-to-Left keyboard input, dead keys and many other features needed to support diverse languages, among others Chinese and Hebrew. However, XmbLookupString has only been available since X11R5, and many programs do not in any way use it. They just call the much simpler XLookupString function.

Now, even though XLookupString ignores locale settings, it is supposed to accept all ISO-8859-1 characters and handle them. This is a quote from the Xlib docs:

Well, it seems most X client programmers have chosen the "expedient" wait to handle keyboard input, and so use XLookupString. The only problem here is that XLookupString does not know how to compose keys, and so entirely ignores dead keys :-(, accepting only those accented characters that have a corresponding key on the keyboard.

David Dawes from the XFree86 Group has proposed that by using the Mode_Switch functionality and programming 4 keysyms per key, this problem could be circumvented. Well, I am sorry, but this does not seem like a very logical or ergonomic solution to me.

Proposed change

A very simple and transparent change to Xlib would provide backward and forward compatibility:

A temporary solution

Thomas Quinot designed and implemented a patch to the standard Xlib, which provides dead keys functionality on all unmodified X clients. His modified libX11 does dead key composition for all X clients, new or old. It is 100% compatible with the standard library and causes absolutely no misbehaviour on any X client, existing or future.

The change proposed in the previous section could be implemented using Thomas Quinot's patch, which is very short (less than a hundred new lines of code) and very cleanly programmed.

The long-term solution

In the long term all X clients will have been modified to take into account Input Methods as defined in the Xlib documentation. Of course, in the long term we might all be dead, that's why I was very happy to find Thomas Quinot's short-term solution :-)

According to the X Consortium, all it takes to modify an X client to correctly handle dead keys is the addition of a single line of code in the source. Quoting:

And as far as other programs go, for PD software compiled from source, I still maintain the right thing to do is fix the broken programs -- that's why you have source -- and feed the changes back to the author. Any author who won't add a one-liner call to XtSetLanguageProc() should be pilloried on all the Usenet groups that apply. :-) Commercial (binary only) software that's broken should be fixed. I'd demand a a fix or a refund if commercial software was this badly broken. I've been telling people about XtSetLanguageProc for over three years now, there's no longer any excuse. :-)

What this single line of code is, I later found in version 1.5 of the Linux Danish-HOWTO by Niels Kristian Bech Jensen, which by the way is recommended reading for anybody wanting to setup an international keyboard on a Linux system:

If you are using e.g. the Xt toolkit and a widget set like Motif you need only add one line to your program. As your first call to Xt, use XtSetLanguageProc. Like this:

int main (int argc, char** argv)
{
...
XtSetLanguageProc (NULL, NULL, NULL);
top = XtAppInitialize ( ... );
...
}

Now your program will automagically look up the LC_CTYPE variable and interpret dead keys etc. according to the Compose tables in /usr/lib/X11/locale/. This should work for all Western European keyboard layouts and is entirely portable. As XFree86 multilanguage support gets better your program will also be useful in Eastern Europe and the Middle East.

This method of input is supported by Xt, Xlib and Motif v1.2 (and higher). According to the information I have available it is not supported by Xaw. If you have further information on this subject I would like to hear from you.

I have not tried the above, so I cannot verify it, but from all I have read it will certainly work.

I18nized X clients

Let us define X clients that have been modified as described in the preceding section as i18nized X clients. Well, there are only three X clients that I know about that have been i18nized: xterm, xjed and I believe xemacs.

At this rate (1 per year), I guess it might take a little while until I can give up on Thomas Quinot's patch...

Links

The following links will take you to various pages or documents on international support on Linux systems:

backindex


Copyright 1997 Andrew D. Balsa