# STM publishing: tools, technologies and changeA WordPress site for STM Publishing

31Oct/11Off

## Unicode, Glyph IDs and OpenType: a brief introduction

As you read about OpenType fonts and Unicode you come across terms such as "Glyph IDs", Unicode characters/code points and suchlike. And this can be a bit puzzling: what's the relationship between them? In this post I'll try to give a brief introduction with the usual notice that I'm skipping vast amounts of detail in the interests of simplicity.

Just as a reminder, one extremely important concept to understand/appreciate is the difference between characters and glyphs. I've discussed this in a previous post but will summarise here (quoting from the Unicode standard):

• Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation.
• Glyphs represent the shapes that characters can have when they are rendered or displayed. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters.

I'll try to expand on this a little. Among the many things that the Unicode standard provides is a univeral encoding of the world's character set: in essence, allocating a unique number to the characters covered by the standard. Unicode does not concern itself with the visual representation of those characters; that is the job for fonts: they provide glyphs.

Today, OpenType font technology is the dominant font standard and is supported by modern TeX engines such as LuaTeX and XeTeX. However, as you start to explore OpenType in more detail you start to see references to terms such as "Glyph ID" or "Glyph index" and may wonder how, or if, these relate to the Unicode character encoding (code points). The two key points to understand are:

• OpenType is concerned with glyphs.
• Unicode is concerned with characters.

For present purposes we can take the very simplistic view that an OpenType font is a container for a large collection of glyphs in the form of the lines and curves required to draw (render) them. Of course, OpenType fonts can provide a lot more than just the glyphs themselves. OpenType fonts can provide extensive support for high quality typesetting via "features" and "lookups" which provide information  that a typesetting or rendering engine can use to do its job (think of them as a set of "rules" for the typesetting/rendering engine to apply).

However, for now just think of an OpenType font as containing a set of glyphs where each glyph has a name and a numeric identifier called its Glyph ID. The Glyph ID is simply a number allocated  to each glyph (e.g., by the font's creator) ranging from 0 to N-1 where N is the number of glyphs contained in the particular font. The point is that the Glyph ID has nothing to do with the Unicode encoding or code points: it's just an internal bookeeping number used internally within the font.

So, we have two sets of numbers: a universal standard for the world of Unicode characters (code points) and another arbitrary set of numbers (specific to each font) for the internal world of OpenType glyphs: the Glyph ID. So the question arises  how and where are these two universes joined together? The answer is that the magic glue is contained within the OpenType font itself: the so-called cmap table or, to give its full name, the Character To Glyph Index Mapping Table.

As the specification says

"This table defines the mapping of character codes to the glyph index values used in the font."

Even a brief perusal of the OpenType specification will make it clear that it's a complex beast and certainly not a topic for detailed discussion here. However, the cmap table is the "secret sauce" within an OpenType font which glues together the Unicode world of characters with the OpenType world of glyphs.

Note: OpenType fonts can contain multiple cmap tables for different encodings and may also contain a significant number of glyphs which are not covered by the cmap table. OpenType fonts may contain many different glyphs (representations) for a particular character and these visual variations fall outside the remit of the Unicode standard. For example, small caps, oldstyle numbers, swash characters etc differ only in visual design, they do not bring additional semantic meaning.

One excellent Windows utility for inspecting cmap tables is the free SIL ViewGlyph — Font Viewing Program. The following screenshot displays the cmap table from arabtype.ttf shipped with Windows.

Open a font and choose Options --> View cmap.

The screenshot clearly shows Unicode character code points in the first column, with the second column displaying the Glyph ID mapped via the cmap table.

The following screenshot from FontLab Studio displays some glyphs in arabtype.ttf listed in order of Glyph ID (or "index" as FontLab Studio refers to it).

Whilst FontLab Studio is a very nice piece of software it is quite expensive. A free alternative to FontLab Studio is the excellent FontForge.

## Digging deeper

Another superb resource for exploring the low-level details of OpenType fonts is the Adobe Font Development Kit for OpenType which is a free download for Windows and Macintosh. One of the utilities it provides is an excellent command line tool called TTX which will generate an XML text file representation of an entire OpenType font file (or just those parts you are interested in).

One extremely useful TTX command line option is -s which will dump the "components" of an OpenType font to individual XML files. For example, the exquisite OpenType Arabic font shipped with Windows, Arabic Typesetting, by Mamoun Sakkal, Paul C. Nelson (sorry could not find a link!) and John Hudson can be exported to XML via

ttx -s arabtype.ttf

which will produce more than 20 XML files containing data from numerous tables within the font.

 Dumping "arabtype.ttf" to "arabtype.ttx"... Dumping 'GlyphOrder' table... Dumping 'head' table... Dumping 'hhea' table... Dumping 'maxp' table... Dumping 'OS/2' table... Dumping 'hmtx' table... Dumping 'LTSH' table... Dumping 'VDMX' table... Dumping 'hdmx' table... Dumping 'cmap' table... Dumping 'fpgm' table... Dumping 'prep' table... Dumping 'cvt ' table... Dumping 'loca' table... Dumping 'glyf' table... Dumping 'name' table... Dumping 'post' table... Dumping 'gasp' table... Dumping 'GDEF' table... Dumping 'GPOS' table... Dumping 'GSUB' table... Dumping 'DSIG' table...

The ones of interest here are the GlyphOrder table and the cmap table. The GlyphOrder table will show you the complete list of glyphs, including ther names, ordered by Glyph ID and the cmap table shows you the character to glyph mappings (using the glyph names).