{"id":1284,"date":"2011-10-17T18:05:51","date_gmt":"2011-10-17T18:05:51","guid":{"rendered":"http:\/\/www.readytext.co.uk\/?p=1284"},"modified":"2017-02-20T13:02:58","modified_gmt":"2017-02-20T13:02:58","slug":"unicode-for-the-impatient-part-3-utf-8-bits-bytes-and-c-code","status":"publish","type":"post","link":"https:\/\/www.readytext.co.uk\/?p=1284","title":{"rendered":"Unicode for the impatient (Part 3: UTF-8 bits, bytes and C code)"},"content":{"rendered":"<p>I promised to finish the series on Unicode and UTF-8 so here is the final instalment, better late than never. Before reading this article I suggest that you read <a href=\"https:\/\/www.readytext.co.uk\/?p=800\">Part 1<\/a> and <a href=\"https:\/\/www.readytext.co.uk\/?p=745\">Part 2<\/a> which cover some important background. As usual, I&#8217;m trying to avoid simply repeating the huge wealth of information already published on this topic, but (hopefully) it will provide a few additional details which may assist with understanding. Additionally, I&#8217;m missing out a lot of detail and not taking a &#8220;rigorous&#8221; approach in my explanations, so I&#8217;d be grateful to know if readers feel whether or not it is useful.<\/p>\n<blockquote><p><strong>Reminder on code points:<\/strong> The Unicode encoding scheme assigns each character with a unique integer in the range 0 to 1,114,111; each integer is called a <em>code point<\/em>.<\/p><\/blockquote>\n<p>The &#8220;TF&#8221; in UTF-8 stands for <em>Transformation Format <\/em>so, in essence, you can think of UTF-8 as a &#8220;recipe&#8221; for converting (transforming) a Unicode code point value into a sequence of 1 to 4 byte-sized chunks. Converting the smallest code points (<code>00<\/code> to <code>7F<\/code>) to UTF-8 generates 1 byte whilst the higher code point values (<code>10000 <\/code>to <code>10FFFF<\/code>) generate 4 bytes.<\/p>\n<p>For example, the Arabic letter \u0634 (&#8220;sheen&#8221;) is allocated the Unicode code point value <code>0634<\/code> (hex) and its representation in UTF-8 is the two byte sequence <code>D8 B4<\/code> (hex). In the remainder of this article I will use examples from the Unicode encoding for Arabic, which is split into 4 blocks within the Basic Multilingual Plane.<\/p>\n<ul>\n<li>&#8220;core&#8221; Arabic:\u00a0 <a href=\"http:\/\/unicode.org\/charts\/PDF\/U0600.pdf\">0600 to 06FF<\/a><\/li>\n<li>Arabic Supplement:\u00a0 <a href=\"http:\/\/unicode.org\/charts\/PDF\/U0750.pdf\">0750 to 077F<\/a><\/li>\n<li>Arabic presentation forms A: <a href=\"http:\/\/unicode.org\/charts\/PDF\/UFB50.pdf\"> FB50 to FDFF<\/a><\/li>\n<li>Arabic presentation forms B: <a href=\"http:\/\/unicode.org\/charts\/PDF\/UFE70.pdf\">FE70 to FEFF<\/a><\/li>\n<\/ul>\n<blockquote><p><strong>Aside: refresher on hexadecimal:<\/strong> In technical literature discussing computer storage of numbers you will likely come across binary, octal and hexadecimal number systems.\u00a0 Consider the decimal number 251 which can be written as 251 = 2 x 10<sup>2<\/sup> + 5 x 10<sup>1<\/sup> + 1 x 10<sup>0<\/sup> = 200 + 50 + 1. Here we are breaking 251 down into powers of 10: 10<sup>2<\/sup>, 10<sup>1<\/sup> and 10<sup>0<\/sup>. We call 10 the <em>base<\/em>. However, we can use other bases including 2 (binary), 8 (octal) and 16 (hex). Note: x<sup>0<\/sup> = 1 for any value of x not equal to 0.<\/p>\n<p>Starting with binary (base 2) we can write 251 as<\/p>\n<table>\n<tbody>\n<tr>\n<td>2<sup>7<\/sup><\/td>\n<td>2<sup>6<\/sup><\/td>\n<td>2<sup>5<\/sup><\/td>\n<td>2<sup>4<\/sup><\/td>\n<td>2<sup>3<\/sup><\/td>\n<td>2<sup>2<\/sup><\/td>\n<td>2<sup>1<\/sup><\/td>\n<td>2<sup>0<\/sup><\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>If we use 8 as the base (called octal), 251 can be written as<\/p>\n<table>\n<tbody>\n<tr>\n<td>8<sup>2<\/sup><\/td>\n<td>8<sup>1<\/sup><\/td>\n<td>8<sup>0<\/sup><\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>7<\/td>\n<td>3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>= 3 x 8<sup>2<\/sup> + 7 x 8<sup>1<\/sup> + 3 x 8<sup>0<\/sup><br \/>\n= 3 x 64 + 7 x 8 + 3 x 1<\/p>\n<p>If we use 16 as the base (called hexidecimal), 251 can be written as<\/p>\n<table>\n<tbody>\n<tr>\n<td>16<sup>1<\/sup><\/td>\n<td>16<sup>0<\/sup><\/td>\n<\/tr>\n<tr>\n<td>15<\/td>\n<td>11<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Ah, but writing 251 as &#8220;1511&#8221; in hex (= 15 x 16<sup>1<\/sup> + 11 x 16<sup>0<\/sup>) is very confusing and problematic. Consequently, for numbers between 10 and 15 we choose to represent them in hex as follows<\/p>\n<ul>\n<li>A=10<\/li>\n<li>B=11<\/li>\n<li>C=12<\/li>\n<li>D=13<\/li>\n<li>E=14<\/li>\n<li>F=15<\/li>\n<\/ul>\n<p>Consequently, 251 written in hex, is represented as F x 16<sup>1<\/sup> + B x 16<sup>0<\/sup>, so that 251 = FB in hex. Each byte can be represented by a pair of hex digits.<\/p><\/blockquote>\n<h2>So where do we start?<\/h2>\n<p>To convert code points into UTF-8 byte sequences the code points are divided up into the following ranges and use the UTF-8 conversion pattern shown in the following table to map each code point value into a series of bytes.<\/p>\n<table width=\"100%\">\n<colgroup>\n<col width=\"30%\" \/>\n<col width=\"30%\" \/>\n<col width=\"40%\" \/> <\/colgroup>\n<tbody>\n<tr>\n<th>Code point range<\/th>\n<th>Code point binary sequences<\/th>\n<th>UTF-8 bytes<\/th>\n<\/tr>\n<tr>\n<td><code>00<\/code> to<code>7F<\/code><\/td>\n<td><code>0<span style=\"color: red;\">xxxxxxx<\/span><\/code><\/td>\n<td><code>0<span style=\"color: red;\">xxxxxxx<\/span><\/code><\/td>\n<\/tr>\n<tr>\n<td><code>0080<\/code> to<code> 07FF<\/code><\/td>\n<td><code>00000<span style=\"color: green;\">yyy<\/span> <span style=\"color: red;\">yyxxxxxx<\/span><\/code><\/td>\n<td><code>110<span style=\"color: green;\">yyy<\/span><span style=\"color: red;\">yy <\/span><\/code><code>10<span style=\"color: red;\">xxxxxx<\/span><\/code><\/td>\n<\/tr>\n<tr>\n<td><code>0800<\/code> to\u00a0 <code>FFFF<\/code><\/td>\n<td><code><span style=\"color: green;\">zzzzyyyy<\/span> <span style=\"color: red;\">yyxxxxxx<\/span><\/code><\/td>\n<td><code>1110<span style=\"color: green;\">zzzz<\/span><\/code><code> 10<span style=\"color: green;\">yyyy<\/span><span style=\"color: red;\">yy<\/span><\/code><code> 10<span style=\"color: red;\">xxxxxx<\/span><\/code><\/td>\n<\/tr>\n<tr>\n<td><code>010000<\/code> to <code>10FFFF<\/code><\/td>\n<td><code>000<span style=\"color: blue;\">wwwzz<\/span> <span style=\"color: green;\">zzzzyyyy<\/span> <span style=\"color: red;\">yyxxxxxx<\/span><\/code><\/td>\n<td><code>11110<span style=\"color: blue;\">www<\/span><\/code> <code>10<span style=\"color: blue;\">zz<\/span><span style=\"color: green;\">zzzz<\/span> 10<span style=\"color: green;\">yyyy<\/span><span style=\"color: red;\">yy<\/span><\/code><code> 10<span style=\"color: red;\">xxxxxx<\/span><\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Source: <a href=\"http:\/\/en.wikipedia.org\/wiki\/UTF-8\">Wikipedia<\/a><\/p>\n<p>Just a small point but you\u2019ll note that the code points in the table have a number of leading zeros, for example <code>0080<\/code>. Recalling that a byte is a pair of hex digits, the leading zeros help to indicate the number of bytes being used to represent the numbers. For example, <code>0080<\/code> is two bytes (<code>00<\/code> and <code>80<\/code>) and you\u2019ll see that in the second column where the code point is written out in its binary representation.<\/p>\n<blockquote><p><strong>A note on storage of integers:<\/strong> An extremely important topic, but not one I&#8217;m not going to address in detail, is the storage of different integer types on various computer platforms: issues include the lengths of integer storage units and endianness. The interested reader can start with these articles on Wikipedia:<\/p>\n<ol>\n<li><a href=\" http:\/\/en.wikipedia.org\/wiki\/Integer_%28computer_science%29\">Integer (computer science)<\/a><\/li>\n<li><a href=\"http:\/\/en.wikipedia.org\/wiki\/Short_integer\">Short integer<\/a><\/li>\n<li><a href=\"http:\/\/en.wikipedia.org\/wiki\/Endianness\">Endianness<\/a><\/li>\n<\/ol>\n<p>For simplicity, I will assume that the code point range <code>0080<\/code> to <code>07FF<\/code> is stored in a 16-bit storage unit called an unsigned short integer.<\/p>\n<p>The code point range <code>010000<\/code> to <code>10FFFF<\/code> contains code points that need a maximum of 21 bits of storage (<code>100001111111111111111<\/code> for <code>10FFFF<\/code>) but in practice they would usually be stored in a 32-bit unsigned integer.<\/p><\/blockquote>\n<p>Let\u2019s walk through the process for the Arabic letter \u0634 (&#8220;sheen&#8221;) which is allocated the Unicode code point of <code>0634<\/code> (in hex). Looking at our table, <code>0634 <\/code>is in the range <code>0080<\/code> to <code>07FF<\/code> so we need to transform <code>0634<\/code> into 2 UTF-8 bytes.<\/p>\n<blockquote><p><strong>Tip for Windows users: <\/strong>The calculator utility shipped with Windows will generate bit patterns for you from decimal, hex and octal numbers.<\/p><\/blockquote>\n<p>Looking back at the table, we note that the UTF-8 bytes are constructed from ranges of bits contained in our code points. For example, referring to the code point range <code>0080<\/code> to <code>07FF<\/code>, the first UTF-8 byte <code>110<\/code><span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> contains the bit range <span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> from our code point. Recalling our (simplifying) assumption that we are storing numbers <code>0080<\/code> to <code>07FF<\/code> in 16-bit integers, the first step is to write <code>0634<\/code> (hex) as a pattern of bits, which is the 16-bit pattern <code>0000011000110100<\/code>.<\/p>\n<p>Our task is to &#8220;extract&#8221; the bit patterns <span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> and <span style=\"color: red;\"><code>xxxxxx<\/code><\/span> so we place the appropriate bit pattern from the table next to our code point value:<\/p>\n<table>\n<tbody>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>By doing this we can quickly see that<\/p>\n<p><span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> = <code>11000 <\/code><\/p>\n<p><span style=\"color: red;\"><code>xxxxxx<\/code><\/span>= <code>110100<\/code><\/p>\n<p>The UTF-8 conversion &#8220;template&#8221; for this code point value yields two separate bytes according to the pattern<\/p>\n<p><code>110<span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span><\/code> <code> 10<\/code><span style=\"color: red;\"><code>xxxxxx<\/code><\/span><\/p>\n<p>Hence we can write the UTF-8 bytes as <code>11011000 10110100<\/code> which, in hex notation, is <code>D8 B4<\/code>.<\/p>\n<p>So, to transform the code point value <code>0634<\/code> into UTF-8 we have to generate 2 bytes by isolating the individual bit patterns of our code point value and using those bit patterns to construct two individual UTF-8 bytes. And the same general principle applies whether we need to create 2, 3 or 4 UTF-8 bytes for a particular code point: just follow the appropriate conversion pattern in the table. Of course, the conversion is trivial for <code>00<\/code> to <code>7F<\/code> and is just the value of the code point itself.<\/p>\n<h1>How do we do this programmatically?<\/h1>\n<p>In C this is achieved by &#8220;bit masking&#8221; and &#8220;bit shifting&#8221;, which are fast, low-level operations. One simple algorithm could be:<\/p>\n<ol>\n<li>Apply a bit mask to the code point to isolate the bits of interest.<\/li>\n<li>If required, apply a right shift operator (<code>&gt;&gt;<\/code>) to &#8220;shuffle&#8221; the bit pattern to the right.<\/li>\n<li>Add the appropriate quantity to give the UTF-8 value.<\/li>\n<li>Store the result in a byte.<\/li>\n<\/ol>\n<h2>Bit masking<\/h2>\n<p>Bit masking uses the binary AND operator (<code>&amp;<\/code>) which has the following properties:<\/p>\n<p><code>1 &amp; 1 = 1<br \/>\n1 &amp; 0 = 0<br \/>\n0 &amp; 1 = 0<br \/>\n0 &amp; 0 = 0<br \/>\n<\/code><\/p>\n<p>We can use this property of the <code>&amp;<\/code> operator to isolate individual bit patterns in a number by using a suitable <em>bit mask<\/em> which zeros out all but the bits we want to keep. From our table, code point values in the range <code>0080<\/code> to <code>07FF<\/code> have a general 16-bit pattern represented as<\/p>\n<p><code>00000<span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yyxxxxxx<\/code><\/span><\/code><\/p>\n<p>We want to extract the two series of bit patterns: <span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> and <span style=\"color: red;\"><code>xxxxxx<\/code><\/span> from our code point value so that we can use them to create two separate UTF-8 bytes:<\/p>\n<p>UTF-8 byte 1 = <code>110<\/code><span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span><br \/>\nUTF-8 byte 2 = <code>10<\/code><span style=\"color: red;\"><code>xxxxxx<\/code><\/span><\/p>\n<h3>Isolating <span style=\"color: green;\"><code>yyy<\/code><span style=\"color: red;\"><code>yy<\/code><\/span> <\/span><\/h3>\n<p>To isolate <span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> we can use the following bit mask with the <code>&amp;<\/code> operator<\/p>\n<table>\n<tbody>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This masking value is <code>0000011111000000 = 0x07C0<\/code> (hex number in C notation).<\/p>\n<table>\n<tbody>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td>Generic bit pattern<\/td>\n<\/tr>\n<tr>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>Binary AND operator<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>Bit mask<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>Result of operation<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note that the result of the masking operation for <span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> leaves this bit pattern &#8220;stranded&#8221; in the middle of the number. So, we need to &#8220;shuffle&#8221; <span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> along to the right by 6 places. To do this in C we use the right shift operator <code>&gt;&gt;<\/code>.<\/p>\n<h3>Isolating <span style=\"color: red;\"><code>xxxxxx<\/code><\/span><\/h3>\n<p>To isolate <span style=\"color: red;\"><code>xxxxxx<\/code><\/span> we can use the following bit mask with the <code>&amp;<\/code> operator:<\/p>\n<table>\n<tbody>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The masking value is <code>0000000000111111 = 0x003F<\/code> (hex number in C notation).<\/p>\n<table>\n<tbody>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: green;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td><span style=\"color: red;\">y<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td>Generic bit pattern<\/td>\n<\/tr>\n<tr>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>&amp;<\/td>\n<td>Binary AND operator<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<td>Bit mask<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td><span style=\"color: red;\">x<\/span><\/td>\n<td>Result of operation<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The result of bit masking for <span style=\"color: red;\"><code>xxxxxx<\/code><\/span> leaves it at the right so we do not need to shuffle via the right shift operator <code>&gt;&gt;.<\/code><\/p>\n<p>Noting that<br \/>\n<code>110<\/code><span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> = <code>11000000<\/code> + <code>000<\/code><span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span> = <code>0xC0<\/code> + <code>000<\/code><span style=\"color: green;\"><code>yyy<\/code><\/span><span style=\"color: red;\"><code>yy<\/code><\/span><\/p>\n<p>and that<br \/>\n<code>10<\/code><span style=\"color: red;\"><code>xxxxxx<\/code><\/span> = <code>10000000<\/code> + <code>00<\/code><span style=\"color: red;\"><code>xxxxxx<\/code><\/span> = <code>0x80<\/code> + <code>00<\/code><span style=\"color: red;\"><code>xxxxxx<\/code><\/span><\/p>\n<p>we can summarize the process of transforming a code point between\u00a0<code>0080<\/code> and\u00a0<code>07FF<\/code> into 2 bytes of UTF-8 data with a short snippet of C code.<\/p>\n<pre class=\"brush: cpp; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\nunsigned char arabic_utf_byte1;\r\nunsigned char arabic_utf_byte2;\r\nunsigned short p; \/\/ our code point between 0080 and 07FF\r\n\r\narabic_utf_byte1= (unsigned char)(((p &amp; 0x07c0) &gt;&gt; 6) + 0xC0);\r\narabic_utf_byte2= (unsigned char)((p &amp; 0x003F) + 0x80);\r\n<\/pre>\n<p>Which takes a lot less space than the explanation!<\/p>\n<h1>Other Arabic code point ranges<\/h1>\n<p>We have laboriously worked through the UTF-8 conversion process for code points which span the range <code>0080<\/code> to <code>07FF<\/code>, a range which includes the &#8220;core&#8221; Arabic character code point range of <code>0600<\/code> to <code>06FF<\/code> and the Arabic Supplement code point range of <code>0750<\/code> to <code>077F<\/code>.<\/p>\n<p>There are two further ranges we need to explore:<\/p>\n<ul>\n<li>Arabic presentation forms A: <code>FB50<\/code> to <code>FDFF<\/code><\/li>\n<li>Arabic presentation forms B: <code>FE70<\/code> to <code>FEFF<\/code><\/li>\n<\/ul>\n<p>Looking back to our table, these two Arabic presentation form ranges fall within <code>0800<\/code> to <code>FFFF<\/code> so we need to generate 3 bytes to encode them into UTF-8. The principles follow the reasoning above so I will not repeat that here but simply offer some sample C code. Note that there is no error checking whatsoever in this code, it is simply meant to be an illustrative example and certainly needs to be improved for any form of production use.<\/p>\n<p>You can <a href=\"http:\/\/readytext.co.uk\/files\/arabicunicode.zip\">download the C source and a file &#8220;arabic.txt&#8221;<\/a> which contains the a sample of output from the code below. I hope it is useful.<\/p>\n<pre class=\"brush: cpp; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n#include &lt;stdio.h&gt;\r\n\r\nvoid presentationforms(unsigned short min, unsigned short max, FILE* arabic);\r\nvoid coreandsupplement(unsigned short min, unsigned short max, FILE* arabic);\r\n\r\nvoid main() {\r\n\r\n\tFILE * arabic= fopen(&quot;arabic.txt&quot;, &quot;wb&quot;);\r\n\r\n\tcoreandsupplement(0x600, 0x6FF, arabic);\r\n\tcoreandsupplement(0x750, 0x77F, arabic);\r\n\tpresentationforms(0xFB50, 0xFDFF, arabic);\r\n\tpresentationforms(0xFE70, 0xFEFF, arabic);\r\n\t\r\n\tfclose(arabic);\r\n\r\n  }\r\n\r\nvoid coreandsupplement(unsigned short min, unsigned short max, FILE* arabic)\r\n{\r\n\r\n\tunsigned char arabic_utf_byte1;\r\n\tunsigned char arabic_utf_byte2;\r\n\tunsigned short p;\r\n\r\n\tfor(p = min; p &lt;= max; p++)\r\n\t{\r\n\t\tarabic_utf_byte1=  (unsigned char)(((p &amp; 0x07c0) &gt;&gt; 6) + 0xC0);\r\n\t\tarabic_utf_byte2= (unsigned char)((p &amp; 0x003F) + 0x80);\r\n\t\tfwrite(&amp;arabic_utf_byte1,1,1,arabic);\r\n\t\tfwrite(&amp;arabic_utf_byte2,1,1,arabic); \r\n\t}\r\n\t\r\n\treturn;\r\n\r\n}\r\n\r\n\r\nvoid presentationforms(unsigned short min, unsigned short max, FILE* arabic)\r\n{\r\n\tunsigned char arabic_utf_byte1;\r\n\tunsigned char arabic_utf_byte2;\r\n\tunsigned char arabic_utf_byte3;\r\n\tunsigned short p;\r\n\r\n\tfor(p = min; p &lt;= max; p++)\r\n\t{\r\n\t\tarabic_utf_byte1 = (unsigned char)(((p &amp; 0xF000) &gt;&gt; 12) + 0xE0);\r\n\t\tarabic_utf_byte2 = (unsigned char)(((p &amp; 0x0FC0) &gt;&gt; 6) + 0x80);\r\n\t\tarabic_utf_byte3 = (unsigned char)((p &amp; 0x003F)+ 0x80);\r\n\r\n\t\tfwrite(&amp;arabic_utf_byte1,1,1,arabic);\r\n\t\tfwrite(&amp;arabic_utf_byte2,1,1,arabic); \r\n\t\tfwrite(&amp;arabic_utf_byte3,1,1,arabic); \r\n\t}\r\n\r\n\treturn;\r\n\r\n}\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I promised to finish the series on Unicode and UTF-8 so here is the final instalment, better late than never. Before reading this article I suggest that you read Part 1 and Part 2 which cover some important background. As usual, I&#8217;m trying to avoid simply repeating the huge wealth of information already published on [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12,28,17,10],"tags":[],"class_list":["post-1284","post","type-post","status-publish","format-standard","hentry","category-arabic","category-c-programming-miscellaneous","category-unicode-arabic","category-unicode"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/1284","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1284"}],"version-history":[{"count":94,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/1284\/revisions"}],"predecessor-version":[{"id":3941,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/1284\/revisions\/3941"}],"wp:attachment":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1284"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1284"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1284"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}