22Jul/13Off

# Pre-processing Arabic text

Just a quick example of colouring Arabic glyphs via the XeTeX engine (XeLaTeX) using a pre-processor written in C (via FreeType and libotf). The glyph paths were obtained via FreeType and written out as TeX files that used XeTeX (i.e., xdvipdfmx) specials (\special{pdf: content...}) to draw the glyphs, taking care to close paths when FreeType's FT_Outline_Decompose(...) function emits a moveto. It is a relatively straightforward exercise to extend this to fully-shaped Arabic using a (home-grown) context-analysis/shaping engine. I'll post some examples and code.

Filed under: Arabic, Unicode Comments Off
29Apr/13Off

# A very short post

I've been reading about SIL International's Graphite engine and it looks really interesting. I downloaded the code and ran the CMake-based build process through the CMake graphical interface. It didn't work. Eventually, I found some instructions to build it from the command line, so here's the way I did it.

1. Make sure the cmake.exe is in your Windows PATH.
2. Download the Graphite source code and unpack into a directory (e.g, called Graphite).
3. Change directory to the one containing the Graphite source code.
4. I use Microsoft Visual Studio 2008 so you'll need to adjust the -G parameter (below) to your build environment (cmake --help tells you the ones it supports).
5. Run the command (all on one line): cmake -G "Visual Studio 9 2008" -DCMAKE_BUILD_TYPE=Release -DGRAPHITE2_COMPARE_RENDERER:BOOL=OFF

If all goes well you should see something like the following, together with a generated Visual Studio Solution file graphite2.sln.

-- Build: Release
-- Segment Cache support: enabled
-- File Face support: enabled
-- Tracing support: enabled
CMake Warning at CMakeLists.txt:54 (message):
vm machine type direct can only be built using GCC

-- Using vm machine type: call
-- Configuring done
-- Generating done
-- Build files have been written to: E:/SILgraide/Graphite


Your Visual Studio Solution should look something like this:

15Feb/12Off

# Introduction

In a previous post I promised to write a short introduction to libotf; however, before discussing libotf I need to "set the scene" and write something about logical vs display (visual) order and "shaping engines". This post covers a lot of ground and I've tried to favour being brief rather than providing excessive detail: I hope I have not sacrificed accuracy in the process.

## Logical order, display order and "shaping engines"

### Logical order and display order

Imagine that you are tapping the screen of your iDevice, or sitting at your computer, writing an e-mail text in some language of your choice. Each time you enter a character it goes into the device's memory and, of course, ultimately ends up on the display or stored in a file. The order in which characters are entered and stored by your device, or written to a text file, is called the logical order. For simple text written in left-to-right languages (e.g., English) this is, of course, the same order in which they are displayed: the display order (also called the visual order). However, for right-to-left languages, such as Arabic, the order in which the characters are displayed or rendered for reading is reversed: the display (visual) order is not the same as the logical order.

### Arabic Unicode ranges

The Unicode 6.1 Standard allocates several ranges to Arabic (ignoring the latest Unicode 6.1 additions for Arabic maths). These are:

The important point here is that for text storage and transfer, Arabic should be encoded/saved using the "base" Arabic Unicode range of 0600 to 06FF. (Caveat: I believe this is true but, of course, am not 100% certain. I'd be interested to know if this is indeed not the case.) However, I'll assume this principle is broadly correct.

If you look at the charts you'll see that the range 0600 to 06FF contains the isolated versions of each character; i.e., none of the glyph variations used for fully joined Arabic. So, looking at this from, say, LuaTeX's viewpoint, how is it that the series of isolated forms of Arabic characters sitting in a TeX file in logical order can be transformed into typeset Arabic in proper visual display order? The answer is that the incoming stream of bytes representing Arabic text has to be "transformed" or "shaped" into the correct series of glyphs.

#### A little experiment

Just for the moment, assume that you are using the LuaTeX engine but with the plain TeX macro package: you have no other "packages" loaded but you have setup a font which has Arabic glyphs. All the font will do is allow you to display the characters that LuaTeX sees within its input stream. If you enter some Arabic text into a TeX file, what do you think will happen? Without any additional help or support to process the incoming Arabic text LuaTeX will simply output what is gets: a series of isolated glyphs in logical (not display) order:

But it looked perfect in my text editor...:

### Shaping engines

What's happening is that the text editor itself is applying a shaping engine to the Arabic text in order to render it according to the rules of Arabic text processing and typography: you can think of this as a layer of software which sits between the underlying text and the screen display: transforming characters to glyphs. The key point is that the core LuaTeX engine does not provide a "shaping engine": i.e., it does not have an automatic or built-in set of functions to automatically apply the rules of Arabic typesetting and typography. The LuaTeX engine needs some help and those rules have to be provided or implemented through packages, plug-ins or through Lua code such as the ConTeXt package provides. Without this additional help LuaTeX will simply render the raw, logical order, text stream of isolated (non-joined) characters.

Incidentally, the text editor here is BabelPad which uses the Windows shaping engine called Uniscribe, which provides a set of "operating system services" for rendering complex scripts.

#### Contextual analysis

Because Arabic is a cursive script, with letters that change shape according to position in a word and adjacent characters (the "context"), part of the job of a "shaping engine" is to perform "contextual analysis": looking at each character and working out which glyph should be used, i.e., the form it should take: initial, medial, final or isolated. The Unicode standard (Chapter 8: Middle Eastern Scripts ) explores this process in much more detail.

If you look the Unicode code charts for Arabic Presentation Forms-B you'll see that this Unicode range contains the joining forms of Arabic characters; and one way to perform contextual analysis involves mapping input characters (i.e., in their isolated form) to the appropriate joining form version from within the Unicode range Arabic Presentation Forms-B. After the contextual analysis is complete the next step is to apply an additional layer of shaping to ensure that certain ligatures are produced (such as lam alef). In addition, you then apply more advanced typographic features defined within the particular (OpenType) font you are using: such as accurate vowel placement, cursive positioning and so forth. This latter stage is often referred to as OpenType shaping.

The key point is that OpenType font technology is designed to encapsulate advanced typographic rules within the font files themselves, using so-called OpenType tables: creating so-called "intelligent fonts". You can think of these tables as containing an extensive set of "rules" which can be applied to a series of input glyphs in order to achieve a specific typographic objective. This is actually a fairly large and complex topic, which I'll cover in a future post ("features" and "lookups").

Despite OpenType font technology supporting advanced typography, you should note that the creators of specific fonts choose which features they want to support: some fonts are packed with advanced features, others may contain just a few basic rules. In short, OpenType fonts vary enormously so you should never assume "all fonts are created equal", they are not.

### And finally: libotf

The service provided by libotf is to apply the rules contained in an OpenType font: it is an OpenType shaping library, especially useful with complex text/scripts. You pass it a Unicode string and call various functions to "drive" the application of the OpenType tables contained within the font. Of course, your font needs to support or provide the feature you are asking for but libotf has a way to "ask" the font "Do you support this feature?" I'll provide further information in a future post, together with some sample C code.

### And finally, some screenshots

We've covered a lot of ground so, hopefully, the following screenshots might help to clarify the ideas presented above. These screenshots are also from BabelPad which has the ability to switch off the shaping engine to let you see the raw characters in logical order before any contextual analysis or shaping is applied.

The first screenshot shows the raw text in logical order before any shaping has been applied. This is the text that would be entered at the keyboard and saved into a TeX file. It is the sequence of input characters that LuaTeX would see if it read some TeX input containing Arabic text.

The following screenshot is the result of applying the Windows system shaping engine called Uniscribe. Of course, different operating systems have their own system services to perform complex shaping but the tasks are essentially the same: process the input text to render the display version, through the application of rules contained in the OpenType font.

One more screenshot, this time from a great little tool that I found on this excellent site.

The top section of the screenshot shows the fully shaped (i.e., via Uniscribe) and rendered version of the Unicode text shown in the lower part of the screen. Note carefully that the text stream is in logical order (see the highlighted character) and that the text is stored using the Unicode range 0600 to 06FF.

# Summary

These posts are a lot of work and take quite some hours to write; so I hope that the above has provided a useful, albeit brief, overview of some important topics which are central to rendering/displaying or typesetting complex scripts.

11Feb/12Off

# Introduction

I've been thinking about the next article in this series and what should it address so I've decided to skip ahead and give a summary of the documentation, tools and libraries which made it possible for me to experiment with typesetting Arabic. I'm listing these because it actually took a long time to assemble the reading materials and tools required, so it may just save somebody, somewhere, the many hours I spent hunting it all down. For sure, there's a ton of stuff I want to write about, in an attempt to piece together the various concepts and ideas involved in gaining a better understanding of Unicode, OpenType and Arabic text typesetting/display. However, I'm soon to start a new job, which means I'll have less time to devote to this blog so I'll try to post as much as I can over the next couple of weeks.

Just for completeness, I should say that, for sure, you can implement Arabic layout/typesetting for LuaTeX in pure Lua code, as the ConTeXt distribution has done, through the quite incredible work of Idris Hamid and Hans Hagen.

# Documentation

There is a lot to read. Here are some resources that are either essential or helpful.

## Unicode

Clearly, you'll need to read relevant parts of the Unicode Standard. Here's my suggested minimal reading list.

• Chapter 8: Middle Eastern Scripts . This gives an extremely useful description of cursive joining and a model for implementing contextual analysis.
• Unicode ranges for Arabic (see also these posts). You'll need the Unicode code charts for Arabic (PDFs downloadable and listed under Middle Eastern Scripts, here)
• Unicode Bidirectional Algorithm. Can't say that I've really read this properly, and certainly not yet implemented anything to handle mixed runs of text, but you certainly need it.

## OpenType

Whether you are interested in eBooks, conventional typesetting or the WOFF standard, these days a working knowledge of OpenType font technology is very useful. If you want to explore typesetting Arabic then it's simply essential.

# C libraries

It's always a good idea to leverage the work of true experts, especially if it is provided for free as an open source library! I spent a lot of time hunting for libraries, so here is my summary of what I found and what I eventually settled on using.

• IBM's ICU: Initially, I looked at using IBM's International Components for Unicode but, for my requirements, it was serious overkill. It is a truly vast and powerful open source library (for C/C++ and Java) if you need the wealth of features it provides.
• HarfBuzz: This is an interesting and ongoing development. The HarfBuzz OpenType text shaping engine looks like it will become extremely useful; although I had a mixed experience trying to build it on Windows, which is almost certainly due to my limitations, not those of the library. If you're Linux-based then no doubt it'll be fine for you. As it matures to a stable release I'll definitely take another look.
• GNU FriBidi: As mentioned above, essential for a full implementation of displaying (eBooks, browsers etc) or typesetting mixed left-to-right and right-to-left scripts is the Unicode Bidirectional Algorithm. Fortunately, there's a free and standard implementation of this available as a C library: GNU FriBidi I've not yet reached the point of being able to use it but it's the one I'll choose.

### My libraries of choice

Eventually, I settled on FreeType and libotf. You need to use them together because libotf depends on FreeType. Both libraries are mature and easy to use and I simply cannot praise these libraries too highly. Clearly, this is my own personal bias and preference but ease of use rates extremely highly on my list of requirements. FreeType has superb documentation whereas libotf does not, although it has some detailed comments within the main #include file. I'll definitely post a short "getting started with libotf" because it is not difficult to use (when you've worked it out!).

#### libotf: words are not enough!

Mindful that I've not yet explained how all these libraries work together, or what they do, but I just have to say that libotf is utterly superb. libotf provides a set of functions which "drive" the features and lookups contained in an OpenType font, allowing you to pass in a Unicode string and apply OpenType tables to generate the corresponding sequence of glyphs which you can subsequently render. Of course, for Arabic you also need to perform contextual analysis to select the appropriate joining forms but once that is done then libotf lets you take full advantage of any advanced typesetting features present in the font.

#### UTF-8 encoding/decoding

To pass Unicode strings between your C code and LuaTeX you'll be using UTF-8 so you will need to encode and decode UTF-8 from within your C. Encoding is easy and has been covered elsewhere on this site. For decoding UTF-8 into codepoints I use the The Flexible and Economical UTF-8 Decoder.

# Desktop software

In piecing together my current understanding of Unicode and OpenType I found the following software to be indespensible. Some of these are Windows-only applications.

TIP: Microsoft VOLT and the Arabic Typesetting or Scheherazade fonts. I'll talk about VOLT in more detail later but Microsoft and SIL provide "VOLT versions" of their respective Arabic fonts. These are absolutely invaluable resources for understanding advanced OpenType concepts and if you are interested to learn more I strongly recommend taking a look at them.

• The VOLT version of the Arabic Typesetting font is shipped with the VOLT installer and is contained within a file called "VoltSupplementalFiles.exe", so just run that to extract the VOLT version.
• The VOLT version of Scheherazade is made available as a download from SIL.

I can only offer my humble thanks to the people who created these resources and made them available for free: a truly substantial amount of work is involved in creating them.

1Nov/11Off

## A nice UTF-8 decoder

#### Posted by Graham Douglas

If you want to explore passing UTF-8 string data to/from LuaTeX to your C code/library you may want to convert the UTF-8 data back into Unicode code points (reversing the UTF-8 encoding process discussed in this post). To do that you'll need a UTF-8 decoder: here is a nice implementation of a UTF-8 decoder in C. Examples, source code and explanations are available from The Flexible and Economical UTF-8 Decoder. Just to note that  irrespective of the decoder you use make sure you read up and are aware of  UTF-8 security exploits.

Filed under: Unicode Comments Off
31Oct/11Off

## Unicode, Glyph IDs and OpenType: a brief introduction

#### Posted by Graham Douglas

As you read about OpenType fonts and Unicode you come across terms such as "Glyph IDs", Unicode characters/code points and suchlike. And this can be a bit puzzling: what's the relationship between them? In this post I'll try to give a brief introduction with the usual notice that I'm skipping vast amounts of detail in the interests of simplicity.

Just as a reminder, one extremely important concept to understand/appreciate is the difference between characters and glyphs. I've discussed this in a previous post but will summarise here (quoting from the Unicode standard):

• Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation.
• Glyphs represent the shapes that characters can have when they are rendered or displayed. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters.

I'll try to expand on this a little. Among the many things that the Unicode standard provides is a univeral encoding of the world's character set: in essence, allocating a unique number to the characters covered by the standard. Unicode does not concern itself with the visual representation of those characters; that is the job for fonts: they provide glyphs.

Today, OpenType font technology is the dominant font standard and is supported by modern TeX engines such as LuaTeX and XeTeX. However, as you start to explore OpenType in more detail you start to see references to terms such as "Glyph ID" or "Glyph index" and may wonder how, or if, these relate to the Unicode character encoding (code points). The two key points to understand are:

• OpenType is concerned with glyphs.
• Unicode is concerned with characters.

For present purposes we can take the very simplistic view that an OpenType font is a container for a large collection of glyphs in the form of the lines and curves required to draw (render) them. Of course, OpenType fonts can provide a lot more than just the glyphs themselves. OpenType fonts can provide extensive support for high quality typesetting via "features" and "lookups" which provide information  that a typesetting or rendering engine can use to do its job (think of them as a set of "rules" for the typesetting/rendering engine to apply).

However, for now just think of an OpenType font as containing a set of glyphs where each glyph has a name and a numeric identifier called its Glyph ID. The Glyph ID is simply a number allocated  to each glyph (e.g., by the font's creator) ranging from 0 to N-1 where N is the number of glyphs contained in the particular font. The point is that the Glyph ID has nothing to do with the Unicode encoding or code points: it's just an internal bookeeping number used internally within the font.

So, we have two sets of numbers: a universal standard for the world of Unicode characters (code points) and another arbitrary set of numbers (specific to each font) for the internal world of OpenType glyphs: the Glyph ID. So the question arises  how and where are these two universes joined together? The answer is that the magic glue is contained within the OpenType font itself: the so-called cmap table or, to give its full name, the Character To Glyph Index Mapping Table.

As the specification says

"This table defines the mapping of character codes to the glyph index values used in the font."

Even a brief perusal of the OpenType specification will make it clear that it's a complex beast and certainly not a topic for detailed discussion here. However, the cmap table is the "secret sauce" within an OpenType font which glues together the Unicode world of characters with the OpenType world of glyphs.

Note: OpenType fonts can contain multiple cmap tables for different encodings and may also contain a significant number of glyphs which are not covered by the cmap table. OpenType fonts may contain many different glyphs (representations) for a particular character and these visual variations fall outside the remit of the Unicode standard. For example, small caps, oldstyle numbers, swash characters etc differ only in visual design, they do not bring additional semantic meaning.

One excellent Windows utility for inspecting cmap tables is the free SIL ViewGlyph — Font Viewing Program. The following screenshot displays the cmap table from arabtype.ttf shipped with Windows.

Open a font and choose Options --> View cmap.

The screenshot clearly shows Unicode character code points in the first column, with the second column displaying the Glyph ID mapped via the cmap table.

The following screenshot from FontLab Studio displays some glyphs in arabtype.ttf listed in order of Glyph ID (or "index" as FontLab Studio refers to it).

Whilst FontLab Studio is a very nice piece of software it is quite expensive. A free alternative to FontLab Studio is the excellent FontForge.

## Digging deeper

Another superb resource for exploring the low-level details of OpenType fonts is the Adobe Font Development Kit for OpenType which is a free download for Windows and Macintosh. One of the utilities it provides is an excellent command line tool called TTX which will generate an XML text file representation of an entire OpenType font file (or just those parts you are interested in).

One extremely useful TTX command line option is -s which will dump the "components" of an OpenType font to individual XML files. For example, the exquisite OpenType Arabic font shipped with Windows, Arabic Typesetting, by Mamoun Sakkal, Paul C. Nelson (sorry could not find a link!) and John Hudson can be exported to XML via

ttx -s arabtype.ttf

which will produce more than 20 XML files containing data from numerous tables within the font.

 Dumping "arabtype.ttf" to "arabtype.ttx"... Dumping 'GlyphOrder' table... Dumping 'head' table... Dumping 'hhea' table... Dumping 'maxp' table... Dumping 'OS/2' table... Dumping 'hmtx' table... Dumping 'LTSH' table... Dumping 'VDMX' table... Dumping 'hdmx' table... Dumping 'cmap' table... Dumping 'fpgm' table... Dumping 'prep' table... Dumping 'cvt ' table... Dumping 'loca' table... Dumping 'glyf' table... Dumping 'name' table... Dumping 'post' table... Dumping 'gasp' table... Dumping 'GDEF' table... Dumping 'GPOS' table... Dumping 'GSUB' table... Dumping 'DSIG' table...

The ones of interest here are the GlyphOrder table and the cmap table. The GlyphOrder table will show you the complete list of glyphs, including ther names, ordered by Glyph ID and the cmap table shows you the character to glyph mappings (using the glyph names).