Into the Mist: How Linux Console Fonts Work LG #91

...making Linux just a little more fun!

Into the Mist: How Linux Console Fonts Work
By En D Loozzr

THE CONSOLE DRIVER

As of Linux 2.4.x, the kernel includes a console driver sub-divided in a keyboard driver and a screen driver. The console driver is being entirely re-written for Linux 2.6.0 but at this stage, basically, the keyboard driver sends characters to an application, the application does its job and requests from the screen driver some output on the display. The console driver is complemented by the kbd package which is likely to reside either in /usr/share/kbd/ or in /usr/lib/kbd/.

In the path from the keyboard driver to the application and further to the screen driver, the characters are nothing but codes (hex numbers). And since in the end we want to see their little pictures (glyphs) on the screen there must be a way to associate the glyphs with those codes.

This article will focus on the screen driver only, taking for granted that something happens between keyboard and application. Some basic notions of fonts are required. Also keep the man page for the utility 'setfont' handy. The article is based on material from:

ftp://win.tue.nl/pub/linux-local/utils/kbd/
ftp://ftp.debian.org/debian/pool/main/c/console-tools/
http://qrczak.home.ml.org/programy/linux/fonty/

UNICODE

Traditionally, character encodings use 8 bits and are thus limited to 2^8=256 characters, which is not enough. Of course, once upon time printers and monitors knew nothing about diacriticals (accents, umlaut etc.) and further back in time they only had capitals and despised lower case. Those times are over and in the wake of i18n (internationalisation) 256 characters qualify as appetizers.

The UCS (Universal Character Set), also known as Unicode, was created to handle and mix all the world scripts, including the ideographs from China, Korea, Japan. It has more than 65000 characters for a start but it can go up to 2^31, figure it out.

UCS is a 32-bit/4-byte encoding. It is normalised by ISO as the 10646-1 standard. The most widely used characters from UCS are contained in its UCS-2 16-bit subset. This is the subset used now for the Linux console. The character set Linux uses by default for N and S America, W Europe and Africa is called latin1 or ISO 8859-1.

For convenience, an encoding called UTF-8 was designed for ASCII backward compatibility. All characters that have a UCS encoding can be expressed as a UTF-8 sequence, and vice-versa. Nonetheless, UTF-8 and Unicode are distinct encodings.

In UTF-8 mode, the console driver treats the ASCII range exactly as before, so old text viewers can continue to display ASCII. Characters above the ASCII range are converted to a variable length sequence of bytes (up to 6 bytes per character), UTF means indeed Unicode Transformation Format and UTF-8 covers the conversion of 8-bit characters - which was the range occupied by the traditional character sets.

Unicode is complex. Just keep in mind that it allows to assign an ID to any character. That ID has four bytes in its full form, and two bytes in UCS-2 subset, and here the unicode ID looks like e.g. 0x2502 also written as U+2502. If you know that ID, you can pick up the glyph (picture) for that character from a suitable font. Indeed, even the names of the glyphs are standardized and all capitals, e.g.:

FEMININE ORDINAL INDICATOR

All clear?

Problem 1: find out the official name for a given unicode
Problem 2: get the glyph for a given unicode

Problem 1 is not critical as far as the Linux console driver is concerned. The most common official names can be found in some *.trans files in kbd directory ../consoletrans or some *.uni files in the kbd directory ../unimaps. For more, refer to:

http://partners.adobe.com/asn/developer/typeforum/unicodegn.html

The real hassle is problem 2.

GLYPHS

Although we have already been speaking of glyphs and it is kind of intuitively clear what they are, here are some additional remarks.

Launch your winword or equivalent word processor and type the letter 'a' several times changing font and size every time. All those a's look similar while they do differ in shape and size. What they have in common is that they all represent one glyph, the glyph for 'a'.

The reference to a glyph is just an abstraction from the particular font you will necessarily be using in order to see something.

A font a is a collection of glyphs in a particular shape. While in graphic mode the typeface (shape) is emphasized, in the console we mostly bother about which glyphs are included or not included - and possibly about font size. A soft font for the console comes in a binary file with bit patterns for each glyph. And there is a hardware font in the ROM of the VGA adapter. This is the font you will see, if no soft fonts are loaded at boot time.

UNIMAP

The Screen Font Map gives, for each position in the console font, the list of characters it will render. Under Linux 2.4.x, the screen driver is based on the UCS-2 encoding.

The Screen Font Map is also called Unicode Map or Unimap or Console Map or Screen Map or psf table or whatever. The terminology varies a lot and does not contribute to easy understanding. Especially not as these terms had a different meaning before Unicode came up. And especially not when files that serve the same purpose and have the same format are named with different extension. Since it seems to be spreading and it sounds quite distinct, let us opt for unimap and its files *.uni. If you come across console utilities other than those from the kbd package, be wary of the terminology jungle.

There is always a unimap. It is included in the font or it is loaded from a distinct file or - as a last resort - it is the default straight-to-font or direct-to-font or trivial mapping or direct mapping or null mapping or idem mapping or identity mapping. Here again terminology has not settled and is hindering user empowerment. Idem mapping means that a request for character e.g. 0xB3 is received and the glyph at position 0xB3 in the font is directly picked up. To make the mess messier, the straight-to-font map is sometime not considered to be a unimap. We prefer to say that there is always a unimap even if setfont from the kbd package says otherwise. They use the option

setfont -u none

to enforce straight-to-font. mapscrn, now incorporated into setfont, used to call straight-to-font a special unimap. This is the more sensible choice, we'll stick to it.

One glyph can do for several different unicodes. How come? Well sometimes identical glyphs get multiple names. For instance, the capital letter 'A' is available in Russian and English with different names. But a font that covers both English and Russian does not need the glyph for 'A' twice. So two different unicodes give in this case the same visual result.

It can also happen that two glyphs are different but close to each other visually and only one of them is included in the font to save space and serves as surrogate for the other. This is analog to old habits from the era of the typewriter. For instance, opening and closing quotation marks were the same although in typography they are distinct.

Surrogates are formalised with the fallback entries. A fallback entry is a series of two or more UCS-2 codes, separated by whitespace. The first one is the unicode we want a glyph for. The following ones are those whose glyph we want to use when no glyph designed specially for the first code is available. The order of the codes defines a priority order (own glyph if available, then the second char's, then the third's, etc.)

Fallback entries are enabled if included in the unimap with a line like:

0x04a U+20AC U+004A

(That means: for character numbered 0x04a we want the Euro symbol. If not available, take the currency symbol.)

SCREEN MODES

There are two screen modes, single byte mode (until recently the widely used default) and UTF-8 mode. Switching the screen to and from UTF-8 mode is done with the escape sequences '\e%G' and '\e%@' at the prompt. By issuing:

unicode_start
unicode_stop

you switch both keyboard and console to and from UTF-8.

In UTF-8 mode, the bytes received from the application and to be written to the screen are interpreted as a UTF-8 sequence, turned into unicodes and looked up in the unimap to determine the glyph to use.

Single byte mode applies an additional intermediate map to the bytes sent by the application before using the unimap.

This intermediate map used to be called the Application Charset Map or Application Console Map (ACM or acm). Unfortunately, this is the terminology of the console-tools package that seems to have quietly passed away.

The kbd package does not give any special name to the map, it refers to it as a translation table and puts it in files with extension .trans. The man page for setfont calls it Unicode console map which is extremely odd since it evokes the Unicode map (unimap). As a way out of the impasse, let us call it cmap, an abbreviation that already occurs here and there.

Here is a simple diagram for the two modes:


    single byte mode:
        application ->      cmap ->         unimap -> screen
                     (bytes)      (UCS-2)

    UTF-8 mode:
        application ->                      unimap -> screen
                     (UTF-8 / UCS-2)

Memorize this diagram because it is the machete to cut through the documentation jungle. Make sure you can tell cmap from unimap: what does the cmap do?

WHAT DOES THE CMAP DO?

There are several formats for the cmap and only one that allows to understand what the map really does. As an example, have a look at the file cp437_to_iso01.trans in directory ../consoletrans of the kbd package. Code page 437 stems from the early DOS and is still the font in the ROM of any VGA adapter.

This file has two columns of hex numbers. The first column is an enumeration of the slots in the font, 256 positions maximum. Only 256 can be handled by the cmap.

The second column is the translation. The file under consideration makes it possible to use a cp437 font as if it were a latin1 font. The translation is not perfect but it works. Example:

0xA1 0xAD

The character 0xA1 in cp437 is an accented vowel which is not correct for this code in latin1. So cmap is informing the console driver to react as if the character request were for 0xAD. The console driver goes into the unimap (straight-to-font) and reads the unicode at position 0xAD. This happens to be U+00a1, the inverted exclamation mark. Next stop is the font where the glyph for U+00a1 has to be picked up. In the end, we had a request for 0xA1 but we did not get the character at that position in cp437, we got the inverted exclamation mark for the position 0xA1 in latin1. Our cp437 is behaving like a latin1 font thank to the cmap.

This example works flawlessly but since cp437 and latin1 differ a lot, in other cases you will get a miss, represented by a generic replacement character. Or you will get an approximation, a surrogate. For instance, you get a capital 'A' where you would need the same letter with a circumflex on top of it.

When using 256 char fonts, a cmap that really translates means surrogates. When no surrogates are needed, the cmap is straight-to-font: every character is translated into itself, only the unimap is relevant. This is the most natural and common case.

However, a font may be designed to cover more than one character set. This is evident for 512 char fonts but there are indeed 256 char fonts that can handle more than one character set (albeit only partially). If you are using such a font, the cmap allows you to select one of the character sets covered. One example (lat1-16.psfu) is discussed below.

G0/G1 LEGENDS

Although there is only one cmap active at a given time, the kernel knows four of them. Three of them are built-in and never change. They define the IBM code page 437 from early DOS versions with box draw characters, the DEC VT100 charset also with box draw characters, and the ISO latin1 charset. The fourth kernel charset is user-defined, is by default the straight-to-font mapping, and can only be changed loading a soft font.

The console driver has two slots labelled G0 and G1, each with a reference to one of the four kernel charsets. G0 and G1 can vary from console to console as long as they point to cp437, vt100, latin1. If you put a cmap different from those three in any slot G0 or G1 in any console, all other consoles will switch to that same user-defined charset. By default, G0 points to latin1, G1 points to vt100. G0 and G1 can be acted upon with escape sequences at the prompt. And although they are mentioned quite often, you better leave them alone. Why?

If you load a soft font and send escape sequences to switch between kernel charsets, you may well be applying to your soft font a translation that produces plenty of junk. The cmap you select must be suitable for your font and be a team player with the current unimap. The only guarantee you have in this respect is to rely on setfont and control both cmap and unimap. If you start mixing setfont commands with escape sequences to the console, also partly relying on defaults, you may (you will!) end up losing any sense of orientation. To keep cmap and unimap under control, use fonts that have a unimap built-in and use

setfont -m none this_beauty_of_font.psfu

when loading a 256 char soft font. This gives a good guarantee of no interference if you are not playing with keyboard tools at the same time since keyboard tools may affect the console font. For 512 char fonts, you must know what's inside, and you must know the names of the charsets covered (i.e. the corresponding files *.trans) otherwise you will not be able to switch between them.

And what about the user-defined character set? If you have loaded a soft font (and any run of setfont loads a soft font except when you are just saving from the current font to disk), the escape sequence to pick up the user-defined character set from the kernel will make that soft font active with the charset implicit to it as cmap and you will not be able to revert to the ROM font. If you look into setfont's source code, you will see that they are activating the soft font's character set anyway. Forget the user-defined character set, it's none of your business, leave it to setfont.

On the other hand, if you run the ROM font and have not loaded a soft font, requesting the user-defined charset will only reset to cp437, the reason being that the user-defined charset has the default value straight-to-font. For instance, assume that you have chosen vt100 which does not have lower case letters and will immediately display junk. Send the escape sequence for the user-defined charset (which has not been defined yet and so still has the default value): the junk disappears, you get the lower case letters again.

There is, however, a soft font which has been explicitly made to cope with the kernel charsets. This font is called

lat1-16.psfu

and is not a latin1 font as the name suggests, it is a mongrel. With the cmap set to cp437 it will deliver most of cp437 (all block and box draw elements), with the cmap set to latin1 it will deliver latin1. And it will also deliver vt100 should anybody care for it. Requesting the user-defined cmap unveils that the font uses the normally empty control ranges (0-31, 128-159) to pack together chars from cp437 and latin1.

Advice: if you are in a region where latin1 is not suitable, stick to the font provided by your distro (and kiss most probably good bye to the box draw elements). If latin1 is ok, use lat1-16.psfu. That will give you the latin1 characters plus box lines for your file manager.

DOCUMENTATION OR LACK THEREOF

The issues around Linux console fonts are poorly documented. The man pages are too dense, the terminology is windy, the HOWTO that comes with the kbd package is a despair, I wonder whether people who recommend it ever tried to read it.

The stuff presented in this article is elementary and still took quite an effort to grasp. Let us summarize it from a different angle, it will do no harm.

The following distinctions should be kept in mind:

(i) ROM font (always 256 characters) (ii) console soft font
(a) 256 characters maximum (b) 257-512 characters

The ROM font is the old DOS cp437. After changing from the ROM font to a soft font, there is no return to the ROM font except rebooting.
The cmap and unimap must fit the font or junk will abound. When the mappings fit, a font can be used with different personalities and cover multiple character sets. This is stark in evidence when using a 512 char font.
If you think a 512 char font is 'naturally' a sensible solution, think again - since a 512 char font will disable bold colours. Red Hat 8.0 has such a font as a default and Red Hat users are not pleased. Check, however, what is said below on Mandrake.
You can only display characters that are in the current console font. This is to say, you cannot (!) switch from font A to B on-the-fly, display characters that are in B but not in A, switch back to A. When you are back to A, those alien B characters will be overwritten at the next screen update. This is why your file manager, say Midnight Commander, will not display box lines if you use a pure latin1 font (which has no box elements).

Somebody is working on a new console driver for Linux 2.6.0. Can we place an order? A trick to use console fonts bigger than 512 characters; each console its own font; no interference of big fonts with colours. Thank you very much.

QUERIES & ANSWERS

How do I enforce the ROM font in the console?

There might be a utility for that somewhere but it is not in the kbd package. Without such a utility, the only way to enforce the ROM font is to boot into the ROM font. Check your init scripts and make sure no soft font is loaded. If you fail, rename the directory where the soft fonts reside so it cannot be found at boot time.

How do I save the ROM font to a file?

When using the ROM font, issue
echo -ne '\e(U'
setfont -o cp437-16.psf
at the prompt. The file cp437-16.psf contains the ROM font. This font has a height of 16 pixels.

How do I find out which font the console is currently using?

If you mean which name the font has, look in the boot scripts and/or the shell history to find out what soft font was loaded last (possibly none, so the ROM font is on). If you want to see the characters in the font according to their internal arrangement, issue
echo -ne '\e(K'
setfont -om current_font.trans
and look inside current_font.trans with an editor. This does not work 100% because certain character ranges (0-31 and 128-159) are not properly displayed although they may be storing glyphs. If the font has a unimap, the unimap will list all characters with their official names. That will often give an idea of the glyph.

I have created my own font based on latin1 but adding box draw elements in the unused range 128-159. It works but the horizontal lines have little gaps. How come?

The characters are 8 pixel wide but the VGA hardware adds a 9th column of blanks so as to display them at a small distance from the each other. That is very appropriate for most characters but not for horizontal line segments that should rather close up to each other. For this reason, the VGA hardware makes an exception for box draw elements: instead of inserting blanks, the 9th column repeats the 8th column of pixels. So far, so good. But how does the VGA adapter know where you put your box draw elements? It does not, either you put them in the same range as they were in cp437 or you will get gaps.

How can I use a 512 char font and save my bold colours?

You will have to boot into the framebuffer, for details see Framebuffer-HOWTO.html. Opinions about the framebuffer are divided, Mandrake boots into the framebuffer by default, SuSE advises against. Red Hat's official position is not known to me but they do not boot into the framebuffer although they use a 512 char console font that disables bold colours.

The lati1-16.psfu is a 256 char font and still covers more than one charset. How is it possible?

It is only possible because it covers charsets only partially or covers charsets that are smaller than 256 characters. cp437 is full house, it has exactly 256 characters so lat1-16.psfu covers it only partially. On the other hand, latin1 keeps the control range 0-31 and 128-159 empty so it has only 192 characters. vt100 is handled as 128 characters but complemented with latin1 in the 160-255 range. So what lat1-16.psfu does is essentially keeping box and block draw elements where they used to be in cp437 and moving latin1 characters elsewhere. This way everything fits within 256 characters. Well done.

Is the console font unique for all consoles or may it vary from console to console?

The console font is the same for all consoles, what can vary are the character sets (cmaps) used in the consoles.