Unicode for the impatient (Part 3: UTF-8 bits, bytes and C code)

I promised to finish the series on Unicode and UTF-8 so here is the final instalment, better late than never. Before reading this article I suggest that you read Part 1 and Part 2 which cover some important background. As usual, I’m trying to avoid simply repeating the huge wealth of information already published on this topic, but (hopefully) it will provide a few additional details which may assist with understanding. Additionally, I’m missing out a lot of detail and not taking a “rigorous” approach in my explanations, so I’d be grateful to know if readers feel whether or not it is useful.

Reminder on code points: The Unicode encoding scheme assigns each character with a unique integer in the range 0 to 1,114,111; each integer is called a code point.

The “TF” in UTF-8 stands for Transformation Format so, in essence, you can think of UTF-8 as a “recipe” for converting (transforming) a Unicode code point value into a sequence of 1 to 4 byte-sized chunks. Converting the smallest code points (00 to 7F) to UTF-8 generates 1 byte whilst the higher code point values (10000 to 10FFFF) generate 4 bytes.

For example, the Arabic letter ش (“sheen”) is allocated the Unicode code point value 0634 (hex) and its representation in UTF-8 is the two byte sequence D8 B4 (hex). In the remainder of this article I will use examples from the Unicode encoding for Arabic, which is split into 4 blocks within the Basic Multilingual Plane.

Aside: refresher on hexadecimal: In technical literature discussing computer storage of numbers you will likely come across binary, octal and hexadecimal number systems.  Consider the decimal number 251 which can be written as 251 = 2 x 102 + 5 x 101 + 1 x 100 = 200 + 50 + 1. Here we are breaking 251 down into powers of 10: 102, 101 and 100. We call 10 the base. However, we can use other bases including 2 (binary), 8 (octal) and 16 (hex). Note: x0 = 1 for any value of x not equal to 0.

Starting with binary (base 2) we can write 251 as

27 26 25 24 23 22 21 20
1 1 1 1 1 0 1 1

If we use 8 as the base (called octal), 251 can be written as

82 81 80
3 7 3

= 3 x 82 + 7 x 81 + 3 x 80
= 3 x 64 + 7 x 8 + 3 x 1

If we use 16 as the base (called hexidecimal), 251 can be written as

161 160
15 11

Ah, but writing 251 as “1511” in hex (= 15 x 161 + 11 x 160) is very confusing and problematic. Consequently, for numbers between 10 and 15 we choose to represent them in hex as follows

  • A=10
  • B=11
  • C=12
  • D=13
  • E=14
  • F=15

Consequently, 251 written in hex, is represented as F x 161 + B x 160, so that 251 = FB in hex. Each byte can be represented by a pair of hex digits.

So where do we start?

To convert code points into UTF-8 byte sequences the code points are divided up into the following ranges and use the UTF-8 conversion pattern shown in the following table to map each code point value into a series of bytes.

Code point range Code point binary sequences UTF-8 bytes
00 to7F 0xxxxxxx 0xxxxxxx
0080 to 07FF 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
0800 to  FFFF zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
010000 to 10FFFF 000wwwzz zzzzyyyy yyxxxxxx 11110www 10zzzzzz 10yyyyyy 10xxxxxx

Source: Wikipedia

Just a small point but you’ll note that the code points in the table have a number of leading zeros, for example 0080. Recalling that a byte is a pair of hex digits, the leading zeros help to indicate the number of bytes being used to represent the numbers. For example, 0080 is two bytes (00 and 80) and you’ll see that in the second column where the code point is written out in its binary representation.

A note on storage of integers: An extremely important topic, but not one I’m not going to address in detail, is the storage of different integer types on various computer platforms: issues include the lengths of integer storage units and endianness. The interested reader can start with these articles on Wikipedia:

  1. Integer (computer science)
  2. Short integer
  3. Endianness

For simplicity, I will assume that the code point range 0080 to 07FF is stored in a 16-bit storage unit called an unsigned short integer.

The code point range 010000 to 10FFFF contains code points that need a maximum of 21 bits of storage (100001111111111111111 for 10FFFF) but in practice they would usually be stored in a 32-bit unsigned integer.

Let’s walk through the process for the Arabic letter ش (“sheen”) which is allocated the Unicode code point of 0634 (in hex). Looking at our table, 0634 is in the range 0080 to 07FF so we need to transform 0634 into 2 UTF-8 bytes.

Tip for Windows users: The calculator utility shipped with Windows will generate bit patterns for you from decimal, hex and octal numbers.

Looking back at the table, we note that the UTF-8 bytes are constructed from ranges of bits contained in our code points. For example, referring to the code point range 0080 to 07FF, the first UTF-8 byte 110yyyyy contains the bit range yyyyy from our code point. Recalling our (simplifying) assumption that we are storing numbers 0080 to 07FF in 16-bit integers, the first step is to write 0634 (hex) as a pattern of bits, which is the 16-bit pattern 0000011000110100.

Our task is to “extract” the bit patterns yyyyy and xxxxxx so we place the appropriate bit pattern from the table next to our code point value:

0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0
0 0 0 0 0 y y y y y x x x x x x

By doing this we can quickly see that

yyyyy = 11000

xxxxxx= 110100

The UTF-8 conversion “template” for this code point value yields two separate bytes according to the pattern

110yyyyy 10xxxxxx

Hence we can write the UTF-8 bytes as 11011000 10110100 which, in hex notation, is D8 B4.

So, to transform the code point value 0634 into UTF-8 we have to generate 2 bytes by isolating the individual bit patterns of our code point value and using those bit patterns to construct two individual UTF-8 bytes. And the same general principle applies whether we need to create 2, 3 or 4 UTF-8 bytes for a particular code point: just follow the appropriate conversion pattern in the table. Of course, the conversion is trivial for 00 to 7F and is just the value of the code point itself.

How do we do this programmatically?

In C this is achieved by “bit masking” and “bit shifting”, which are fast, low-level operations. One simple algorithm could be:

  1. Apply a bit mask to the code point to isolate the bits of interest.
  2. If required, apply a right shift operator (>>) to “shuffle” the bit pattern to the right.
  3. Add the appropriate quantity to give the UTF-8 value.
  4. Store the result in a byte.

Bit masking

Bit masking uses the binary AND operator (&) which has the following properties:

1 & 1 = 1
1 & 0 = 0
0 & 1 = 0
0 & 0 = 0

We can use this property of the & operator to isolate individual bit patterns in a number by using a suitable bit mask which zeros out all but the bits we want to keep. From our table, code point values in the range 0080 to 07FF have a general 16-bit pattern represented as

00000yyyyyxxxxxx

We want to extract the two series of bit patterns: yyyyy and xxxxxx from our code point value so that we can use them to create two separate UTF-8 bytes:

UTF-8 byte 1 = 110yyyyy
UTF-8 byte 2 = 10xxxxxx

Isolating yyyyy

To isolate yyyyy we can use the following bit mask with the & operator

0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

This masking value is 0000011111000000 = 0x07C0 (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 Bit mask
0 0 0 0 0 y y y y y 0 0 0 0 0 0 Result of operation

Note that the result of the masking operation for yyyyy leaves this bit pattern “stranded” in the middle of the number. So, we need to “shuffle” yyyyy along to the right by 6 places. To do this in C we use the right shift operator >>.

Isolating xxxxxx

To isolate xxxxxx we can use the following bit mask with the & operator:

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

The masking value is 0000000000111111 = 0x003F (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Bit mask
0 0 0 0 0 0 0 0 0 0 x x x x x x Result of operation

The result of bit masking for xxxxxx leaves it at the right so we do not need to shuffle via the right shift operator >>.

Noting that
110yyyyy = 11000000 + 000yyyyy = 0xC0 + 000yyyyy

and that
10xxxxxx = 10000000 + 00xxxxxx = 0x80 + 00xxxxxx

we can summarize the process of transforming a code point between 0080 and 07FF into 2 bytes of UTF-8 data with a short snippet of C code.

unsigned char arabic_utf_byte1;
unsigned char arabic_utf_byte2;
unsigned short p; // our code point between 0080 and 07FF

arabic_utf_byte1= (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);

Which takes a lot less space than the explanation!

Other Arabic code point ranges

We have laboriously worked through the UTF-8 conversion process for code points which span the range 0080 to 07FF, a range which includes the “core” Arabic character code point range of 0600 to 06FF and the Arabic Supplement code point range of 0750 to 077F.

There are two further ranges we need to explore:

  • Arabic presentation forms A: FB50 to FDFF
  • Arabic presentation forms B: FE70 to FEFF

Looking back to our table, these two Arabic presentation form ranges fall within 0800 to FFFF so we need to generate 3 bytes to encode them into UTF-8. The principles follow the reasoning above so I will not repeat that here but simply offer some sample C code. Note that there is no error checking whatsoever in this code, it is simply meant to be an illustrative example and certainly needs to be improved for any form of production use.

You can download the C source and a file “arabic.txt” which contains the a sample of output from the code below. I hope it is useful.

#include <stdio.h>

void presentationforms(unsigned short min, unsigned short max, FILE* arabic);
void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic);

void main() {

	FILE * arabic= fopen("arabic.txt", "wb");

	coreandsupplement(0x600, 0x6FF, arabic);
	coreandsupplement(0x750, 0x77F, arabic);
	presentationforms(0xFB50, 0xFDFF, arabic);
	presentationforms(0xFE70, 0xFEFF, arabic);
	
	fclose(arabic);

  }

void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic)
{

	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1=  (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
		arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);
		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
	}
	
	return;

}


void presentationforms(unsigned short min, unsigned short max, FILE* arabic)
{
	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned char arabic_utf_byte3;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1 = (unsigned char)(((p & 0xF000) >> 12) + 0xE0);
		arabic_utf_byte2 = (unsigned char)(((p & 0x0FC0) >> 6) + 0x80);
		arabic_utf_byte3 = (unsigned char)((p & 0x003F)+ 0x80);

		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
		fwrite(&arabic_utf_byte3,1,1,arabic); 
	}

	return;

}

Unicode for the impatient (Part 2: UTF-X, what is it?)

Introduction

As mentioned in the previous post, a vast amount of information on Unicode is widely available so I won’t waste your reading time by repeating it here. What I will try to add is some additional explanations to fill in a few gaps, and reflecting the parts which I found a bit tricky to “pull together”. I am deliberately keeping these discussions “simple”, with due apologies to any experts reading this material.

Unicode: characters!

A fundamental idea/concept you will need to feel comfortable with, and which is at the heart of Unicode, is the notion of “characters”, at least the Unicode way of thinking about them. “characters” is one of those words for which most of us have some form of “natural understanding”, as used in day-to-day conversations. But any formal specification has to be rather more precise and provide definitions which can sometimes lead to confusion, because it complicates something which we think of as quite natural and simple and whose meaning we take for granted. For example, it would be quite natural to think of a and a as different “characters”: bold ‘a’ and italic ‘a’. But not so in Unicode: these are merely different visual representations of the same character, which Unicode calls LATIN SMALL LETTER A.

Unicode defines character as

“The smallest component of written language that has semantic value…”

You can think of a character as the fundamental unit, or building block, of a language; what it actually looks like when displayed on some device such as a screen or printout using a particular font is not relevant to Unicode: only the meaning is of interest.

Characters: names, numbers and properties

Each character listed in the Unicode Standard is assigned a formal name and assigned a number. However, although it is tempting to think of Unicode as just a long list of character names with a number attached to each one, the Unicode Standard is far more than this. If giving a character name and number is all that Unicode Standard offered, it would be a nothing more than a very large encoding of the world’s list of characters. The Unicode Standard goes far beyond this simple model.

The characters of a lanuage don’t all perform the same role; for example, in English (Latin script), there are characters for punctuation, digits as well as letters. So, in addition to a name and a number, characters are assigned properties, which are fully described in references listed in the Unicode Character Database (UCD). These properties are of course essential for use in computerised text processing operations such as seaching, sorting, spell-checking, bidirectional text algorithms and so forth.

Not all characters are designed to be displayed

This may sound strange but it’s true. Within the Unicode Standard are characters whose job is “to control the interpretation or display of text” so they do not have a visual form: they are there to provide information, typically for text-processing operations. With Arabic, for example, you may wish to force or prevent characters from joining, so Unicode provides special control characters to do this: the so-called ZERO WIDTH JOINER and the ZERO WIDTH NON-JOINER.

Unicode: the numbers game

OK, let’s get closer to the goal of this post: UTF. As mentioned above, Unicode allocates each character a number and at present these numbers range from 0 through to 1,114,111 (0 to 10FFFF in hexadecimal), giving a maximum of 1,114,112 characters covered by the Standard. In Unicode-speak, each value between 0 through to 1,114,111 is called a code point. This range of 1,114,112 values (code points) is divided into 17 so-called planes, each plane containing 65536 code points (17 × 65536 = 1,114,112). Each plane is further divided into groups of code points called blocks. Full details are covered very nicely on Wikipedia.

For example, using the excellent UniBook™ Character Browser, you can see that the code point range 0000 to 00FF (hex) contains the blocks:

  • 0000..007F = Basic Latin
  • 0080..00FF = Latin-1 Supplement

A full list of the blocks can be downloaded from the Unicode web site.

UTF or Unicode Transformation Format

As the section title says, UTF stands for Unicode Transformation Format (or UCS Transformation Format). If you want to dive straight into the detail then Chapter 2 of the Unicode Standard General Structure (section 2.5) is the place for you. In this section I’ll try to condense this down into a simplified explanation.

As we’ve seen, Unicode assigns each character to a number in the range 0 through to 1,114,111. Now, one pretty obvious point, but I’ll mention it anyway, is that computers are in the business of storing and processing numbers. What we see as text is, ultimately, stored in memory or in a file as a sequence of numbers. Here, I’m going to simplify the discussion and only talk about whole numbers, called integers – actually just positive integers like 23, 675, not negative ones such as -56, -2323 etc. When a computer stores an integer it has a few choices, depending on how big that integer is: the smallest basic storage unit is the byte, but there are other storage types the computer can use. Again, simplifying… most common desktop computers can use storage units of 2 bytes, 4 bytes or more, but we’ll stick to storage units that are 1, 2 or 4 bytes in size. Here, when I refer to 2 bytes or 4 bytes I’m referring to indivisible units of 2 bytes or 4 bytes in size – not 2 or 4 consecutive 1-byte units stored end to end.

The point to note is that the maximum positive integer you can store in a 1-byte unit is 255, in a 2-byte unit it is 65535 and in a 4-byte unit it is 4,294,967,295. Hmmm, that’s interesting because the biggest number we need to store is 1,114,111, which clearly will not fit in 1 byte, and it is much bigger than 65536 so it won’t fit into 2 bytes. However, 1,114,111 is much smaller than 4,294,967,295, which is the next option, in 4 bytes. So, if we chose 4 bytes as our storage unit we certainly have more than enough room to store all the Unicode values, each one as a 4-byte integer. But this is very wasteful of space for a couple of reasons. Partly because the characters for many of the world’s languages and scripts are encoded in the first of the Unicode planes: Plane 0, or to give its proper name, the Basic Multilingual Plane or BMP, which needs just 2 bytes (see below) per character. To see a list of Unicode planes, check out this Wikipedia article. Storing Unicode code-points as 4-byte integers is also wasteful for all other planes because even the largest code points need a maximum of 21 bits, which means that 11 bits out of 32 would never be used.

As we said, each of the 17 Unicode planes, including the Basic Multilingual Plane, contains 65536 code points (Unicode character numbers) so the Basic Multilingual Plane, being the first one, contains the range 0 to 65535 or to use the conventional Unicode hexadecimal (base 16) notation: 0000 through to FFFF. Now, of course, a 2-byte storage unit is big enough to hold all those values and indeed you can work with that.

So, the real goal here is to find a way to store the Unicode code points (all 1,114,112 of them) such that you can save the Unicode text data values to a text file or process them in the computer’s memory. Clearly, you can store every code point as a 4-byte value. This is the simplest way to do it but wasteful of space. If you choose the method of 4-byte storage units (32 bits) then what you are doing is using a method called UTF-32. I’m not going to explore the 2-byte (16-bit) method called UTF-16 (as I have not studied it!) but skip straight to the next one down called UTF-8 because that is what LuaTeX uses.

Now, it may seem like there is a contradiction here. I just said that the biggest number you can store in 1 byte is 255, but the biggest Unicode code point is 1,114,112 and no way can you store that in 1 byte! However, through some clever thinking, you can still use a sequence of 1-byte storage units to represent numbers bigger than 255. And here’s where the Transformation in Unicode Transformation Format really becomes much more apparent: UTF-8.

From Unicode code points to Arabic text

In Unicode, the range (in hex) 0600 to 06FF is used for Arabic characters. Each value in the range 0600 to 06FF is referred to as a code point. In simple terms, just think of it as a number allocated to an Arabic character. For example, 0630 is allocated to ARABIC LETTER THAL (ذ). In advance of more detailed step-by-step tutorials I thought I would post a small C code program which will convert the 256 values 0600 to 06FF into UTF-8 encoding. The following C code will create a 512 byte UTF-8 encoded text file that you can open with BabelPad, for example. You can download the text file and C source here. I would not enter this code into a beauty contest but it is simple and works.

#include <stdio.h>
void main() {

	unsigned short unicode_min = 0x0600;
	unsigned short unicode_max = 0x06FF;
	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;

	FILE * arabic = fopen("arabic.txt", "wb");

	for(unsigned short p = unicode_min; p <= unicode_max; p++)
	{
		arabic_utf_byte1 = (unsigned char)(((p & 0x07c0) >>6) + 0xC0);
		arabic_utf_byte2 = (unsigned char)((p & 0x003F) + 0x80);
		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic);
	}
	fclose(arabic);
  }

If you open arabic.txt in BabelPad you should see something like the following. Note, from within BabelPad you need to switch off complex rendering otherwise Windows’ Uniscribe shaping engine will be activated. In BabelPad, choose Options --> Simple Rendering. What you see will depend on the font you choose in BabelPad, the following uses the OpenType font “Arabic Typesetting” (shipped with Windows Vista). Of course, some code points do not correspond to actual characters or the Arabic Typesetting font does not have the appropriate glyphs: these are shown by a question mark (?) in a box.

Unicode for the impatient (Part 1: updated)

Over the last few weeks I have been reading extensively about Unicode, OpenType, UTF-8 plus a whole set of technologies related to typesetting complex scripts (Arabic) with LuaTeX. It is, for sure, a pretty complex picture so I thought I would try to summarise some of what I have learnt through a series of “mini tutorials” to (hopefully) help others save some time. The most time-consuming part of the learning process is piecing together the “landscape”, gathering an awareness of the components you need, and understanding how they fit together – before you can start to explore any one of them in detail. Building an appreciation of the interdependent concepts is also the most frustrating part because you feel that you are not actually “getting anywhere” and it takes a lot of time.

There is already a vast wealth of information published on Unicode so I will try not to simply repeat that but attempt to “paint a picture” of the basic ideas and concepts in the order that makes sense to me, at least.

Great free tools for Windows users

To get started, here are a couple of free tools that will help to learn and explore Unicode. I’ll start the proper tutorials in Part 2 of this series.

BabelPad

This is far more than just an excellent free Unicode text editor for Windows. It provides a wide range of tools and utilities that will help you get to grips with Unicode.

The Unibook™ Character Browser

Invaluable resource to explore information about the characters defined in the Unicode Standard and the International Standard ISO/IEC 10646.

… and free tools for Mac users

Many thanks to Patrick Gundlach for the following update:

“For Mac users, I would like to recommend the unicode checker at http://earthlingsoft.net/UnicodeChecker/ – I use it very often to look up the code point of a character and to look behind the scenes (composition, alternatives, variants).”