Fonts

Mozilla has chosen Unicode as the internal character encoding. This was decided in part because HTML is based on Unicode. Although HTML documents exist in a variety of character encodings, numeric character references are defined in terms of ISO 10646 (a superset of Unicode). Other reasons for choosing Unicode are the fact that it is mostly fixed width, and can represent most of the world's characters.

The problem to be discussed here, then, is how to draw Unicode on a number of devices, particularly screens and printers. These devices are accessible from computers running a number of different OS's, e.g. Windows, MacOS, Unix. The details of the proposed solution to this Unicode problem are highly system dependent, and will be discussed here too.

Most fonts only offer glyphs for a subset of Unicode. Although there are fonts that contain a large subset of Unicode (e.g. Lucida Sans Unicode, Bitstream Cyberbit), these fonts do not always provide the stylistic properties that authors and users prefer. Hence, these fonts are often referred to as "last resort" fonts, to be used only when other, more desirable fonts are unavailable or do not contain the required glyphs.

A Unicode string may contain characters from a number of different parts of the world, or from a number of fields such as mathematics. It may be necessary to use a number of fonts to draw a particular Unicode string, switching from one font to another as we proceed. We will call this process "font switching".

CSS defines a property called font-family that contains an ordered list of fonts. These fonts are supposed to be tried in order, looking both for availability of the font itself, as well as availability of glyphs to draw the current text. Mozilla will have to implement these font lists in order to support CSS.

Prior to the advent of CSS, HTML documents were rendered using fonts that depended in part on the document character encoding (charset). Since both authors and users of such "old-style" documents have become accustomed to this behavior, Mozilla should adhere to this as much as possible. When an HTML document is not accompanied by CSS font rules, we should use a specially tailored font list where the first font is based on the document's charset.

This means favoring whatever font the user has chosen for Japanese, when the document is in a Japanese charset such as Shift_JIS (and there are no font specifications such as CSS or HTML's FONT FACE). The old browser stored font choices in the preferences file, and the new Mozilla could use this as is, or migrate the user's old values to whatever new preference file format we come up with.

Since CSS itself does not have the concept of assigning particular fonts to particular charsets, we are left with the dilemma of whether to base the new font preferences dialog on CSS's font-family lists or the old charset-based selection (or a combination of these). However, regardless of the eventual choice of UI, the GFX implementation will certainly need to support font switching, and so that is what this document will focus on initially.

Another problem is Unicode's Han unification. Unicode uses one set of characters for Chinese, Japanese, Korean and other Han languages. How do we know which font to use if the document is in Unicode? One way is to use HTML's LANG attribute. If the attribute for a particular span of text says "ja", then we can use a Japanese font for that span.

Proposed Solution

The CSS spec says that the implementation is supposed to process the font-family list for each character in the text. If the first font does not exist, or does not contain a glyph for the current character, then the next font should be checked, and so on, for each character.

Since this process could be very slow if implemented badly, we will pay particular attention to finding an efficient algorithm for this process. Checking for the availability of a particular font is not, in itself, a slow process, and it could even be speeded up by caching the list of fonts (if appropriate on the given platform).

However, checking for the availability of a glyph could be very expensive if we call the OS for every character. The proposal is to cache this info in a 64K bit array (Unicode is double-byte). (The current proposal is not to support surrogates and combining marks in the first implementation.) The bit array would only be created for fonts that are loaded (and not for fonts further down in the font-family list). This means that the bit array is only created for characters that actually appear in the document (if the font list has a reasonable order). For example, English speakers will not pay for the time and speed costs of Japanese fonts if the first font in the list is English.

Instead of creating one 64K bit array for each font, we will create one bit array per set of glyphs. Many fonts have identical repertoires of glyphs (e.g. Windows WGL4 set), so we would save memory if we share the bit arrays between similar fonts.

If we use these bit arrays on all platforms, we will certainly be able to share some code (making it XP (cross-platform)). However, the code that checks which glyphs are available is going to be platform-specific. The details are not discussed here (currently).

Another performance consideration is the caching of glyph codes. Generally, the layout process first measures a piece of text, then breaks it across lines (if necessary), and finally draws the lines of text. The font subsystem will have to traverse font lists and load fonts during the measuring phase. Naturally, performance-conscious engineers will come up with the idea of caching this info for the subsequent operations, such as drawing.

However, for this first release of the new Mozilla code, we have decided not to have an elaborate API to cache glyph codes. Instead, we will allow the measuring code to return a single 32-bit integer that can be used in any way by the font engine to speed up the subsequent drawing operation. The layout engine will cache this integer for a given piece of text, maintaining it as long as the text and its stylistic properties do not change.

The current proposal is to use this integer to store an index into a (short) array in the font object. The first element in this array stores a pointer to a function that checks every character in the given text to see which font to use. Each subsequent array element points to a function that deals only with a particular subset of Unicode, thereby obviating the need to visit each character for font switching purposes. For example, if a string is entirely composed of ASCII characters, we will only need an ASCII font, so we don't need to do any font switching. In that case, we might as well go directly to such a font instead of checking each character.

Another idea is to use the 32 bit integer as an ID into a hash table that stores actual glyph code caches for every piece of text in the document. However, this would require some process to free that memory (e.g. based on "least recently used" strings). Or we could require the layout engine to notify the font engine when a particular piece of text has changed, and when the document is being freed. Initially, we will not implement this. Instead we will implement the simple array index mentioned above.

To Do

how to pass document charset and HTML LANG attribute info down to font engine (possibly via CSS2's :lang?)
need to update font API to pass text as (pointer, offset, length) for contextual languages (and maybe even needed for kerning)?