Introduction to logical vs display order, and shaping engines

Introduction

In a previous post I promised to write a short introduction to libotf; however, before discussing libotf I need to “set the scene” and write something about logical vs display (visual) order and “shaping engines”. This post covers a lot of ground and I’ve tried to favour being brief rather than providing excessive detail: I hope I have not sacrificed accuracy in the process.

Logical order, display order and “shaping engines”

Logical order and display order

Imagine that you are tapping the screen of your iDevice, or sitting at your computer, writing an e-mail text in some language of your choice. Each time you enter a character it goes into the device’s memory and, of course, ultimately ends up on the display or stored in a file. The order in which characters are entered and stored by your device, or written to a text file, is called the logical order. For simple text written in left-to-right languages (e.g., English) this is, of course, the same order in which they are displayed: the display order (also called the visual order). However, for right-to-left languages, such as Arabic, the order in which the characters are displayed or rendered for reading is reversed: the display (visual) order is not the same as the logical order.

Arabic Unicode ranges

The Unicode 6.1 Standard allocates several ranges to Arabic (ignoring the latest Unicode 6.1 additions for Arabic maths). These are:

  1. Arabic
  2. Arabic Supplement
  3. Arabic Extended-A (new for Unicode 6.1)
  4. Arabic Presentation Forms-A
  5. Arabic Presentation Forms-B

The important point here is that for text storage and transfer, Arabic should be encoded/saved using the “base” Arabic Unicode range of 0600 to 06FF. (Caveat: I believe this is true but, of course, am not 100% certain. I’d be interested to know if this is indeed not the case.) However, I’ll assume this principle is broadly correct.

If you look at the charts you’ll see that the range 0600 to 06FF contains the isolated versions of each character; i.e., none of the glyph variations used for fully joined Arabic. So, looking at this from, say, LuaTeX’s viewpoint, how is it that the series of isolated forms of Arabic characters sitting in a TeX file in logical order can be transformed into typeset Arabic in proper visual display order? The answer is that the incoming stream of bytes representing Arabic text has to be “transformed” or “shaped” into the correct series of glyphs.

A little experiment

Just for the moment, assume that you are using the LuaTeX engine but with the plain TeX macro package: you have no other “packages” loaded but you have setup a font which has Arabic glyphs. All the font will do is allow you to display the characters that LuaTeX sees within its input stream. If you enter some Arabic text into a TeX file, what do you think will happen? Without any additional help or support to process the incoming Arabic text LuaTeX will simply output what is gets: a series of isolated glyphs in logical (not display) order:

But it looked perfect in my text editor…:

Shaping engines

What’s happening is that the text editor itself is applying a shaping engine to the Arabic text in order to render it according to the rules of Arabic text processing and typography: you can think of this as a layer of software which sits between the underlying text and the screen display: transforming characters to glyphs. The key point is that the core LuaTeX engine does not provide a “shaping engine”: i.e., it does not have an automatic or built-in set of functions to automatically apply the rules of Arabic typesetting and typography. The LuaTeX engine needs some help and those rules have to be provided or implemented through packages, plug-ins or through Lua code such as the ConTeXt package provides. Without this additional help LuaTeX will simply render the raw, logical order, text stream of isolated (non-joined) characters.

Incidentally, the text editor here is BabelPad which uses the Windows shaping engine called Uniscribe, which provides a set of “operating system services” for rendering complex scripts.

Contextual analysis

Because Arabic is a cursive script, with letters that change shape according to position in a word and adjacent characters (the “context”), part of the job of a “shaping engine” is to perform “contextual analysis”: looking at each character and working out which glyph should be used, i.e., the form it should take: initial, medial, final or isolated. The Unicode standard (Chapter 8: Middle Eastern Scripts ) explores this process in much more detail.

If you look the Unicode code charts for Arabic Presentation Forms-B you’ll see that this Unicode range contains the joining forms of Arabic characters; and one way to perform contextual analysis involves mapping input characters (i.e., in their isolated form) to the appropriate joining form version from within the Unicode range Arabic Presentation Forms-B. After the contextual analysis is complete the next step is to apply an additional layer of shaping to ensure that certain ligatures are produced (such as lam alef). In addition, you then apply more advanced typographic features defined within the particular (OpenType) font you are using: such as accurate vowel placement, cursive positioning and so forth. This latter stage is often referred to as OpenType shaping.

The key point is that OpenType font technology is designed to encapsulate advanced typographic rules within the font files themselves, using so-called OpenType tables: creating so-called “intelligent fonts”. You can think of these tables as containing an extensive set of “rules” which can be applied to a series of input glyphs in order to achieve a specific typographic objective. This is actually a fairly large and complex topic, which I’ll cover in a future post (“features” and “lookups”).

Despite OpenType font technology supporting advanced typography, you should note that the creators of specific fonts choose which features they want to support: some fonts are packed with advanced features, others may contain just a few basic rules. In short, OpenType fonts vary enormously so you should never assume “all fonts are created equal”, they are not.

And finally: libotf

The service provided by libotf is to apply the rules contained in an OpenType font: it is an OpenType shaping library, especially useful with complex text/scripts. You pass it a Unicode string and call various functions to “drive” the application of the OpenType tables contained within the font. Of course, your font needs to support or provide the feature you are asking for but libotf has a way to “ask” the font “Do you support this feature?” I’ll provide further information in a future post, together with some sample C code.

And finally, some screenshots

We’ve covered a lot of ground so, hopefully, the following screenshots might help to clarify the ideas presented above. These screenshots are also from BabelPad which has the ability to switch off the shaping engine to let you see the raw characters in logical order before any contextual analysis or shaping is applied.

The first screenshot shows the raw text in logical order before any shaping has been applied. This is the text that would be entered at the keyboard and saved into a TeX file. It is the sequence of input characters that LuaTeX would see if it read some TeX input containing Arabic text.

The following screenshot is the result of applying the Windows system shaping engine called Uniscribe. Of course, different operating systems have their own system services to perform complex shaping but the tasks are essentially the same: process the input text to render the display version, through the application of rules contained in the OpenType font.

One more screenshot, this time from a great little tool that I found on this excellent site.

The top section of the screenshot shows the fully shaped (i.e., via Uniscribe) and rendered version of the Unicode text shown in the lower part of the screen. Note carefully that the text stream is in logical order (see the highlighted character) and that the text is stored using the Unicode range 0600 to 06FF.

Summary

These posts are a lot of work and take quite some hours to write; so I hope that the above has provided a useful, albeit brief, overview of some important topics which are central to rendering/displaying or typesetting complex scripts.

Typesetting Arabic with LuaTeX: Part 2 (documentation, tools and libraries)

Introduction

I’ve been thinking about the next article in this series and what should it address so I’ve decided to skip ahead and give a summary of the documentation, tools and libraries which made it possible for me to experiment with typesetting Arabic. I’m listing these because it actually took a long time to assemble the reading materials and tools required, so it may just save somebody, somewhere, the many hours I spent hunting it all down. For sure, there’s a ton of stuff I want to write about, in an attempt to piece together the various concepts and ideas involved in gaining a better understanding of Unicode, OpenType and Arabic text typesetting/display. However, I’m soon to start a new job, which means I’ll have less time to devote to this blog so I’ll try to post as much as I can over the next couple of weeks.

Just for completeness, I should say that, for sure, you can implement Arabic layout/typesetting for LuaTeX in pure Lua code, as the ConTeXt distribution has done, through the quite incredible work of Idris Hamid and Hans Hagen.

Documentation

There is a lot to read. Here are some resources that are either essential or helpful.

Unicode

Clearly, you’ll need to read relevant parts of the Unicode Standard. Here’s my suggested minimal reading list.

  • Chapter 8: Middle Eastern Scripts . This gives an extremely useful description of cursive joining and a model for implementing contextual analysis.
  • Unicode ranges for Arabic (see also these posts). You’ll need the Unicode code charts for Arabic (PDFs downloadable and listed under Middle Eastern Scripts, here)
  • Unicode Bidirectional Algorithm. Can’t say that I’ve really read this properly, and certainly not yet implemented anything to handle mixed runs of text, but you certainly need it.

OpenType

Whether you are interested in eBooks, conventional typesetting or the WOFF standard, these days a working knowledge of OpenType font technology is very useful. If you want to explore typesetting Arabic then it’s simply essential.

C libraries

It’s always a good idea to leverage the work of true experts, especially if it is provided for free as an open source library! I spent a lot of time hunting for libraries, so here is my summary of what I found and what I eventually settled on using.

  • IBM’s ICU: Initially, I looked at using IBM’s International Components for Unicode but, for my requirements, it was serious overkill. It is a truly vast and powerful open source library (for C/C++ and Java) if you need the wealth of features it provides.
  • HarfBuzz: This is an interesting and ongoing development. The HarfBuzz OpenType text shaping engine looks like it will become extremely useful; although I had a mixed experience trying to build it on Windows, which is almost certainly due to my limitations, not those of the library. If you’re Linux-based then no doubt it’ll be fine for you. As it matures to a stable release I’ll definitely take another look.
  • GNU FriBidi: As mentioned above, essential for a full implementation of displaying (eBooks, browsers etc) or typesetting mixed left-to-right and right-to-left scripts is the Unicode Bidirectional Algorithm. Fortunately, there’s a free and standard implementation of this available as a C library: GNU FriBidi I’ve not yet reached the point of being able to use it but it’s the one I’ll choose.

My libraries of choice

Eventually, I settled on FreeType and libotf. You need to use them together because libotf depends on FreeType. Both libraries are mature and easy to use and I simply cannot praise these libraries too highly. Clearly, this is my own personal bias and preference but ease of use rates extremely highly on my list of requirements. FreeType has superb documentation whereas libotf does not, although it has some detailed comments within the main #include file. I’ll definitely post a short “getting started with libotf” because it is not difficult to use (when you’ve worked it out!).

libotf: words are not enough!

Mindful that I’ve not yet explained how all these libraries work together, or what they do, but I just have to say that libotf is utterly superb. libotf provides a set of functions which “drive” the features and lookups contained in an OpenType font, allowing you to pass in a Unicode string and apply OpenType tables to generate the corresponding sequence of glyphs which you can subsequently render. Of course, for Arabic you also need to perform contextual analysis to select the appropriate joining forms but once that is done then libotf lets you take full advantage of any advanced typesetting features present in the font.

UTF-8 encoding/decoding

To pass Unicode strings between your C code and LuaTeX you’ll be using UTF-8 so you will need to encode and decode UTF-8 from within your C. Encoding is easy and has been covered elsewhere on this site. For decoding UTF-8 into codepoints I use the The Flexible and Economical UTF-8 Decoder.

Desktop software

In piecing together my current understanding of Unicode and OpenType I found the following software to be indespensible. Some of these are Windows-only applications.

  • VOLT: Microsoft’s excellent and free VOLT (Visual OpenType Labout Tool). I’ll certainly try to write an introduction to VOLT but you can also download the free Volt Training Video.
  • Font editors: Fontlab Studio 5 (commercial) or FontForge (free).
  • Adobe FDK: The Adobe Font Development Kit contains some excellent utilities and I highly recommend it.
  • Character browser: To assist with learning/exploring Unicode I used the Unibook character browser.
  • BabelPad: Absoutely superb Windows-based Unicode text editor. Packed with features that can assist with understanding Unicode and the rendering of complex scripts. For example, the ability to toggle complex rendering so that you can edit Arabic text without any Uniscribe shaping being applied.
  • BabelMap: Unicode Character Map for Windows is another great tool from the author of BabelPad.
  • High quality Arabic fonts. By “high quality” I don’t just mean the design and hinting but also the number of OpenType features implemented or contained in the font itself, such as cursive positioning, ligatures, vowel placement (mark to base, mark to ligature, mark to mark etc). My personal favourite is Arabic Typesetting (shipped with Windows) but SIL International also provide free Arabic fonts provide one called Scheherazade.

TIP: Microsoft VOLT and the Arabic Typesetting or Scheherazade fonts. I’ll talk about VOLT in more detail later but Microsoft and SIL provide “VOLT versions” of their respective Arabic fonts. These are absolutely invaluable resources for understanding advanced OpenType concepts and if you are interested to learn more I strongly recommend taking a look at them.

  • The VOLT version of the Arabic Typesetting font is shipped with the VOLT installer and is contained within a file called “VoltSupplementalFiles.exe”, so just run that to extract the VOLT version.
  • The VOLT version of Scheherazade is made available as a download from SIL.

I can only offer my humble thanks to the people who created these resources and made them available for free: a truly substantial amount of work is involved in creating them.

From typeset Arabic directly to SVG with LuaTeX

Just a brief post

With the explosive growth of interest in “eBooks” and the use of SVG in EPUB3, I thought it would be worth experimenting to see how “easy” it was to produce SVG directly from typeset Arabic: using LuaTeX. Turns out it is certainly quite possible and an inline SVG example is shown below (OK, it should be displayed on the right-hand side, I know ;-)). This SVG was created using a point size of 100 for all calculations of the SVG “width” and “height” values. No hand editing was done at all, it is exactly as output. Need to finish kerning, vowel placement and cursive positioning in the SVG export functions but I think that should be OK.

Very likely that full mathematical formulae could also be exported directly to SVG using LuaTeX’s node structures: but they are deeply nested and complex so it could be tricky. Quite possibly, LuaTeX offers excellent potential for fully automated eBook production and, of course, print PDF production, from a single TeX source file suitably marked-up.

Typesetting Arabic with LuaTeX [via a C plug-in] (Part 1)

Introduction

In this new series of posts I’m going to attempt an overview of the topics, concepts, ideas and technologies involved in typesetting Arabic with LuaTeX, via a DLL I’m writing in C. Actually, the C code is very substantially platform-independent so it should compile on non-Windows machines… one day, when it’s “finished”…

Up until 2 years ago I was teaching myself Arabic (see my Amazon book reviews) and had reached the point where I wanted to write-up my notes and worked exercises: I needed to typeset Arabic and wanted to use a TeX-based solution. Having looked around I stumbled upon some truly amazing video presentations of Arabic typesetting work being undertaken by Idris Hamid and Hans Hagen, using a tool called LuaTeX: something I’d never heard of. I was truly stunned by what I saw, the quality of their Arabic typesetting was (is) incredible, so I had to find out more. A few hours later I’d worked out that the typesetting was being achieved through Hans Hagen’s ConTeXt package, with LuaTeX as the underlying TeX engine. However, I’m personally not a user of ConTeXt, but the LuaTeX engine was just so interesting that I had to explore it. Well, two years later and I’ve not done any further learning of Arabic, having replaced that activity with plenty of explorations into LuaTeX and a host of other technologies, particularly OpenType and Unicode.

Coming up to the present day, I’ve finally reached the point where I have puzzled out enough detail of the “big picture” to attempt a home-grown Arabic typesetting solution for LuaTeX, but one where most of the “heavy lifting” is done in C, with Lua code to interface with and talk to LuaTeX. For sure, there are ready-made options such as XeTeX or the range of Arabic typesetting solutions created by the TeX community. However, my interest is creating a solution that will just as easily output SVG or other non-PDF formats, plus allow the automated production of new and novel “typeset structures” and diagrams that will really help with learning Arabic: things I wish had been present in the many books I have bought and studied but which may just be too time-consuming, or difficult/expensive, to produce by “conventional” applications. These are big goals, but definitely achievable, albeit over a year or two of further work.

Sample

Just by way of an early example, see the following PDF, as usual, through the Google Docs viewer or download PDF here. The trained eye will certainly spot a few issues that need fixing but so far it’s not looking too bad :-). But there is a long, long way to go yet. The font used is Microsoft’s “Arabic Typesetting” because it is contains a substantial number of OpenType features including cursive positioning, mark-to-base positioning, an enormous range of ligatures plus many other features which make it an ideal choice of font to work with (in my opinion). In the example (the made-up words) you can see the non-horizontal baseline achieved with cursive positioning plus the ability to control vowel placement with great flexibility.

But it’s still far from perfect, I’ll readily admit. I hope I can finish this work, and find the time to complete these articles. I’ll certainly try!

Unicode for the impatient (Part 3: UTF-8 bits, bytes and C code)

I promised to finish the series on Unicode and UTF-8 so here is the final instalment, better late than never. Before reading this article I suggest that you read Part 1 and Part 2 which cover some important background. As usual, I’m trying to avoid simply repeating the huge wealth of information already published on this topic, but (hopefully) it will provide a few additional details which may assist with understanding. Additionally, I’m missing out a lot of detail and not taking a “rigorous” approach in my explanations, so I’d be grateful to know if readers feel whether or not it is useful.

Reminder on code points: The Unicode encoding scheme assigns each character with a unique integer in the range 0 to 1,114,111; each integer is called a code point.

The “TF” in UTF-8 stands for Transformation Format so, in essence, you can think of UTF-8 as a “recipe” for converting (transforming) a Unicode code point value into a sequence of 1 to 4 byte-sized chunks. Converting the smallest code points (00 to 7F) to UTF-8 generates 1 byte whilst the higher code point values (10000 to 10FFFF) generate 4 bytes.

For example, the Arabic letter ش (“sheen”) is allocated the Unicode code point value 0634 (hex) and its representation in UTF-8 is the two byte sequence D8 B4 (hex). In the remainder of this article I will use examples from the Unicode encoding for Arabic, which is split into 4 blocks within the Basic Multilingual Plane.

Aside: refresher on hexadecimal: In technical literature discussing computer storage of numbers you will likely come across binary, octal and hexadecimal number systems.  Consider the decimal number 251 which can be written as 251 = 2 x 102 + 5 x 101 + 1 x 100 = 200 + 50 + 1. Here we are breaking 251 down into powers of 10: 102, 101 and 100. We call 10 the base. However, we can use other bases including 2 (binary), 8 (octal) and 16 (hex). Note: x0 = 1 for any value of x not equal to 0.

Starting with binary (base 2) we can write 251 as

27 26 25 24 23 22 21 20
1 1 1 1 1 0 1 1

If we use 8 as the base (called octal), 251 can be written as

82 81 80
3 7 3

= 3 x 82 + 7 x 81 + 3 x 80
= 3 x 64 + 7 x 8 + 3 x 1

If we use 16 as the base (called hexidecimal), 251 can be written as

161 160
15 11

Ah, but writing 251 as “1511” in hex (= 15 x 161 + 11 x 160) is very confusing and problematic. Consequently, for numbers between 10 and 15 we choose to represent them in hex as follows

  • A=10
  • B=11
  • C=12
  • D=13
  • E=14
  • F=15

Consequently, 251 written in hex, is represented as F x 161 + B x 160, so that 251 = FB in hex. Each byte can be represented by a pair of hex digits.

So where do we start?

To convert code points into UTF-8 byte sequences the code points are divided up into the following ranges and use the UTF-8 conversion pattern shown in the following table to map each code point value into a series of bytes.

Code point range Code point binary sequences UTF-8 bytes
00 to7F 0xxxxxxx 0xxxxxxx
0080 to 07FF 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
0800 to  FFFF zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
010000 to 10FFFF 000wwwzz zzzzyyyy yyxxxxxx 11110www 10zzzzzz 10yyyyyy 10xxxxxx

Source: Wikipedia

Just a small point but you’ll note that the code points in the table have a number of leading zeros, for example 0080. Recalling that a byte is a pair of hex digits, the leading zeros help to indicate the number of bytes being used to represent the numbers. For example, 0080 is two bytes (00 and 80) and you’ll see that in the second column where the code point is written out in its binary representation.

A note on storage of integers: An extremely important topic, but not one I’m not going to address in detail, is the storage of different integer types on various computer platforms: issues include the lengths of integer storage units and endianness. The interested reader can start with these articles on Wikipedia:

  1. Integer (computer science)
  2. Short integer
  3. Endianness

For simplicity, I will assume that the code point range 0080 to 07FF is stored in a 16-bit storage unit called an unsigned short integer.

The code point range 010000 to 10FFFF contains code points that need a maximum of 21 bits of storage (100001111111111111111 for 10FFFF) but in practice they would usually be stored in a 32-bit unsigned integer.

Let’s walk through the process for the Arabic letter ش (“sheen”) which is allocated the Unicode code point of 0634 (in hex). Looking at our table, 0634 is in the range 0080 to 07FF so we need to transform 0634 into 2 UTF-8 bytes.

Tip for Windows users: The calculator utility shipped with Windows will generate bit patterns for you from decimal, hex and octal numbers.

Looking back at the table, we note that the UTF-8 bytes are constructed from ranges of bits contained in our code points. For example, referring to the code point range 0080 to 07FF, the first UTF-8 byte 110yyyyy contains the bit range yyyyy from our code point. Recalling our (simplifying) assumption that we are storing numbers 0080 to 07FF in 16-bit integers, the first step is to write 0634 (hex) as a pattern of bits, which is the 16-bit pattern 0000011000110100.

Our task is to “extract” the bit patterns yyyyy and xxxxxx so we place the appropriate bit pattern from the table next to our code point value:

0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0
0 0 0 0 0 y y y y y x x x x x x

By doing this we can quickly see that

yyyyy = 11000

xxxxxx= 110100

The UTF-8 conversion “template” for this code point value yields two separate bytes according to the pattern

110yyyyy 10xxxxxx

Hence we can write the UTF-8 bytes as 11011000 10110100 which, in hex notation, is D8 B4.

So, to transform the code point value 0634 into UTF-8 we have to generate 2 bytes by isolating the individual bit patterns of our code point value and using those bit patterns to construct two individual UTF-8 bytes. And the same general principle applies whether we need to create 2, 3 or 4 UTF-8 bytes for a particular code point: just follow the appropriate conversion pattern in the table. Of course, the conversion is trivial for 00 to 7F and is just the value of the code point itself.

How do we do this programmatically?

In C this is achieved by “bit masking” and “bit shifting”, which are fast, low-level operations. One simple algorithm could be:

  1. Apply a bit mask to the code point to isolate the bits of interest.
  2. If required, apply a right shift operator (>>) to “shuffle” the bit pattern to the right.
  3. Add the appropriate quantity to give the UTF-8 value.
  4. Store the result in a byte.

Bit masking

Bit masking uses the binary AND operator (&) which has the following properties:

1 & 1 = 1
1 & 0 = 0
0 & 1 = 0
0 & 0 = 0

We can use this property of the & operator to isolate individual bit patterns in a number by using a suitable bit mask which zeros out all but the bits we want to keep. From our table, code point values in the range 0080 to 07FF have a general 16-bit pattern represented as

00000yyyyyxxxxxx

We want to extract the two series of bit patterns: yyyyy and xxxxxx from our code point value so that we can use them to create two separate UTF-8 bytes:

UTF-8 byte 1 = 110yyyyy
UTF-8 byte 2 = 10xxxxxx

Isolating yyyyy

To isolate yyyyy we can use the following bit mask with the & operator

0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

This masking value is 0000011111000000 = 0x07C0 (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 Bit mask
0 0 0 0 0 y y y y y 0 0 0 0 0 0 Result of operation

Note that the result of the masking operation for yyyyy leaves this bit pattern “stranded” in the middle of the number. So, we need to “shuffle” yyyyy along to the right by 6 places. To do this in C we use the right shift operator >>.

Isolating xxxxxx

To isolate xxxxxx we can use the following bit mask with the & operator:

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

The masking value is 0000000000111111 = 0x003F (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Bit mask
0 0 0 0 0 0 0 0 0 0 x x x x x x Result of operation

The result of bit masking for xxxxxx leaves it at the right so we do not need to shuffle via the right shift operator >>.

Noting that
110yyyyy = 11000000 + 000yyyyy = 0xC0 + 000yyyyy

and that
10xxxxxx = 10000000 + 00xxxxxx = 0x80 + 00xxxxxx

we can summarize the process of transforming a code point between 0080 and 07FF into 2 bytes of UTF-8 data with a short snippet of C code.

unsigned char arabic_utf_byte1;
unsigned char arabic_utf_byte2;
unsigned short p; // our code point between 0080 and 07FF

arabic_utf_byte1= (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);

Which takes a lot less space than the explanation!

Other Arabic code point ranges

We have laboriously worked through the UTF-8 conversion process for code points which span the range 0080 to 07FF, a range which includes the “core” Arabic character code point range of 0600 to 06FF and the Arabic Supplement code point range of 0750 to 077F.

There are two further ranges we need to explore:

  • Arabic presentation forms A: FB50 to FDFF
  • Arabic presentation forms B: FE70 to FEFF

Looking back to our table, these two Arabic presentation form ranges fall within 0800 to FFFF so we need to generate 3 bytes to encode them into UTF-8. The principles follow the reasoning above so I will not repeat that here but simply offer some sample C code. Note that there is no error checking whatsoever in this code, it is simply meant to be an illustrative example and certainly needs to be improved for any form of production use.

You can download the C source and a file “arabic.txt” which contains the a sample of output from the code below. I hope it is useful.

#include <stdio.h>

void presentationforms(unsigned short min, unsigned short max, FILE* arabic);
void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic);

void main() {

	FILE * arabic= fopen("arabic.txt", "wb");

	coreandsupplement(0x600, 0x6FF, arabic);
	coreandsupplement(0x750, 0x77F, arabic);
	presentationforms(0xFB50, 0xFDFF, arabic);
	presentationforms(0xFE70, 0xFEFF, arabic);
	
	fclose(arabic);

  }

void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic)
{

	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1=  (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
		arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);
		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
	}
	
	return;

}


void presentationforms(unsigned short min, unsigned short max, FILE* arabic)
{
	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned char arabic_utf_byte3;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1 = (unsigned char)(((p & 0xF000) >> 12) + 0xE0);
		arabic_utf_byte2 = (unsigned char)(((p & 0x0FC0) >> 6) + 0x80);
		arabic_utf_byte3 = (unsigned char)((p & 0x003F)+ 0x80);

		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
		fwrite(&arabic_utf_byte3,1,1,arabic); 
	}

	return;

}