One way to compile GNU Fribidi as a static library (.lib) using Visual Studio

Introduction and caveat reader

Yesterday I spent about half an hour seeing if I could get GNU Fribidi C library (version 0.19.2) to build as a static library (.lib) under Windows, using Visual Studio. Well, I cheated a bit and used my MinGW/MSYS install (which I use to build LuaTeX) in order to create the config.h header. However, it built OK so I thought I’d share what I did; but do please be aware that I’ve not yet fully tested the .lib I built so use these notes with care. I merely provide them as a starting point.

config.h

If you’ve ever used MinGW/MSYS or Linux build tools you’ll know that config.h is a header file created through the standard Linux-based build process. In essence, config.h sets a number of #defines based on your MinGW/MSYS build environment: you need to transfer the resulting config.h to include it within your Visual Studio project. However, the point to note is that the config.h generated by the MinGW/MSYS build process may create #defines which “switch on” certain headers etc that are “not available” to your Visual Studio setup. What I do is comment out a few of the config.h #defines to get a set that works. This is a bit kludgy, but to date it has usually worked out for me. If you don’t have MinGW/MSYS installed, you can download the config.h I generated and tweaked. Again, I make no guarantees it’ll work for you.

An important Preprocessor Definition

Within the Preprocessor Definitions options of your Visual Studio project you need to add one called HAVE_CONFIG_H which basically enables the use of config.h.

Two minor changes to the source code

Because I’m building a static library (.lib) I made two tiny edits to the source code. Again, there are better ways to do this properly. The change is to the definition of FRIBIDI_ENTRY. Within common.h and fribidi-common.h there are tests for WIN32 which end up setting:

#define FRIBIDI_ENTRY __declspec(dllexport)

For example, in common.h


...
#if (defined(WIN32)) || (defined(_WIN32_WCE))
#define FRIBIDI_ENTRY __declspec(dllexport)
#endif /* WIN32 */
...

I edited this to

#if (defined(WIN32)) || (defined(_WIN32_WCE))
#define FRIBIDI_ENTRY
#endif /* WIN32 */

i.e., remove the __declspec(dllexport). Similarly in fribidi-common.h.

One more setting

Within fribidi-config.h I ensured that the FRIBIDI_CHARSETS was set to 1:

#define FRIBIDI_CHARSETS 1

And finally

You simply need to create a new static library project and make sure that all the relevant include paths are set correctly and then try the edits and settings suggested above to see if they work for you. Here is a screenshot of my project showing the C code files I added to the project. The C files are included in the …\charset and …\lib folders of the C source distribution.

With the above steps the library built with just 2 level 4 compiler warnings (that is, after I had included the _CRT_SECURE_NO_WARNINGS directive to disable deprecation). I hope these notes are useful, but do please note that I have not thoroughly tested the resulting .lib file so please be sure that you perform your own due diligence.

Full getopt Port for Unicode and Multibyte Microsoft Visual C, C++, or MFC Projects

If you have ever tried to port Linux apps to Windows you may find this useful:

http://www.codeproject.com/Articles/157001/Full-getopt-Port-for-Unicode-and-Multibyte-Microso

From the codeproject site: “This software was written after hours of searching for a robust Microsoft C and C++ implementation of getopt, which led to devoid results. This software is a modification of the Free Software Foundation, Inc. getopt library for parsing command line arguments and its purpose is to provide a Microsoft Visual C friendly derivative.”

Former commercial Windows TeX distribution (C source code) released under GNU GPL

The C source code of the once commercial Y&Y TeX distribution (for Windows) was donated to the TeX User Group after Y&Y TeX Inc ceased trading. I bought a copy the Y&Y TeX system in the late 1990s and certainly found it to be an excellent distribution. The source C code is free to download from http://code.google.com/p/yytex/. If you are interested to explore the inner workings of TeX and want an easy to build/compile Windows code base then this should be of real interest. Note that the DVI viewer, DVIWindo, makes use of Adobe Type Manager libraries which are not included in the download; in addition, binaries are not provided so you’ll need to compile them.

Unicode for the impatient (Part 3: UTF-8 bits, bytes and C code)

I promised to finish the series on Unicode and UTF-8 so here is the final instalment, better late than never. Before reading this article I suggest that you read Part 1 and Part 2 which cover some important background. As usual, I’m trying to avoid simply repeating the huge wealth of information already published on this topic, but (hopefully) it will provide a few additional details which may assist with understanding. Additionally, I’m missing out a lot of detail and not taking a “rigorous” approach in my explanations, so I’d be grateful to know if readers feel whether or not it is useful.

Reminder on code points: The Unicode encoding scheme assigns each character with a unique integer in the range 0 to 1,114,111; each integer is called a code point.

The “TF” in UTF-8 stands for Transformation Format so, in essence, you can think of UTF-8 as a “recipe” for converting (transforming) a Unicode code point value into a sequence of 1 to 4 byte-sized chunks. Converting the smallest code points (00 to 7F) to UTF-8 generates 1 byte whilst the higher code point values (10000 to 10FFFF) generate 4 bytes.

For example, the Arabic letter ش (“sheen”) is allocated the Unicode code point value 0634 (hex) and its representation in UTF-8 is the two byte sequence D8 B4 (hex). In the remainder of this article I will use examples from the Unicode encoding for Arabic, which is split into 4 blocks within the Basic Multilingual Plane.

Aside: refresher on hexadecimal: In technical literature discussing computer storage of numbers you will likely come across binary, octal and hexadecimal number systems.  Consider the decimal number 251 which can be written as 251 = 2 x 102 + 5 x 101 + 1 x 100 = 200 + 50 + 1. Here we are breaking 251 down into powers of 10: 102, 101 and 100. We call 10 the base. However, we can use other bases including 2 (binary), 8 (octal) and 16 (hex). Note: x0 = 1 for any value of x not equal to 0.

Starting with binary (base 2) we can write 251 as

27 26 25 24 23 22 21 20
1 1 1 1 1 0 1 1

If we use 8 as the base (called octal), 251 can be written as

82 81 80
3 7 3

= 3 x 82 + 7 x 81 + 3 x 80
= 3 x 64 + 7 x 8 + 3 x 1

If we use 16 as the base (called hexidecimal), 251 can be written as

161 160
15 11

Ah, but writing 251 as “1511” in hex (= 15 x 161 + 11 x 160) is very confusing and problematic. Consequently, for numbers between 10 and 15 we choose to represent them in hex as follows

  • A=10
  • B=11
  • C=12
  • D=13
  • E=14
  • F=15

Consequently, 251 written in hex, is represented as F x 161 + B x 160, so that 251 = FB in hex. Each byte can be represented by a pair of hex digits.

So where do we start?

To convert code points into UTF-8 byte sequences the code points are divided up into the following ranges and use the UTF-8 conversion pattern shown in the following table to map each code point value into a series of bytes.

Code point range Code point binary sequences UTF-8 bytes
00 to7F 0xxxxxxx 0xxxxxxx
0080 to 07FF 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
0800 to  FFFF zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
010000 to 10FFFF 000wwwzz zzzzyyyy yyxxxxxx 11110www 10zzzzzz 10yyyyyy 10xxxxxx

Source: Wikipedia

Just a small point but you’ll note that the code points in the table have a number of leading zeros, for example 0080. Recalling that a byte is a pair of hex digits, the leading zeros help to indicate the number of bytes being used to represent the numbers. For example, 0080 is two bytes (00 and 80) and you’ll see that in the second column where the code point is written out in its binary representation.

A note on storage of integers: An extremely important topic, but not one I’m not going to address in detail, is the storage of different integer types on various computer platforms: issues include the lengths of integer storage units and endianness. The interested reader can start with these articles on Wikipedia:

  1. Integer (computer science)
  2. Short integer
  3. Endianness

For simplicity, I will assume that the code point range 0080 to 07FF is stored in a 16-bit storage unit called an unsigned short integer.

The code point range 010000 to 10FFFF contains code points that need a maximum of 21 bits of storage (100001111111111111111 for 10FFFF) but in practice they would usually be stored in a 32-bit unsigned integer.

Let’s walk through the process for the Arabic letter ش (“sheen”) which is allocated the Unicode code point of 0634 (in hex). Looking at our table, 0634 is in the range 0080 to 07FF so we need to transform 0634 into 2 UTF-8 bytes.

Tip for Windows users: The calculator utility shipped with Windows will generate bit patterns for you from decimal, hex and octal numbers.

Looking back at the table, we note that the UTF-8 bytes are constructed from ranges of bits contained in our code points. For example, referring to the code point range 0080 to 07FF, the first UTF-8 byte 110yyyyy contains the bit range yyyyy from our code point. Recalling our (simplifying) assumption that we are storing numbers 0080 to 07FF in 16-bit integers, the first step is to write 0634 (hex) as a pattern of bits, which is the 16-bit pattern 0000011000110100.

Our task is to “extract” the bit patterns yyyyy and xxxxxx so we place the appropriate bit pattern from the table next to our code point value:

0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0
0 0 0 0 0 y y y y y x x x x x x

By doing this we can quickly see that

yyyyy = 11000

xxxxxx= 110100

The UTF-8 conversion “template” for this code point value yields two separate bytes according to the pattern

110yyyyy 10xxxxxx

Hence we can write the UTF-8 bytes as 11011000 10110100 which, in hex notation, is D8 B4.

So, to transform the code point value 0634 into UTF-8 we have to generate 2 bytes by isolating the individual bit patterns of our code point value and using those bit patterns to construct two individual UTF-8 bytes. And the same general principle applies whether we need to create 2, 3 or 4 UTF-8 bytes for a particular code point: just follow the appropriate conversion pattern in the table. Of course, the conversion is trivial for 00 to 7F and is just the value of the code point itself.

How do we do this programmatically?

In C this is achieved by “bit masking” and “bit shifting”, which are fast, low-level operations. One simple algorithm could be:

  1. Apply a bit mask to the code point to isolate the bits of interest.
  2. If required, apply a right shift operator (>>) to “shuffle” the bit pattern to the right.
  3. Add the appropriate quantity to give the UTF-8 value.
  4. Store the result in a byte.

Bit masking

Bit masking uses the binary AND operator (&) which has the following properties:

1 & 1 = 1
1 & 0 = 0
0 & 1 = 0
0 & 0 = 0

We can use this property of the & operator to isolate individual bit patterns in a number by using a suitable bit mask which zeros out all but the bits we want to keep. From our table, code point values in the range 0080 to 07FF have a general 16-bit pattern represented as

00000yyyyyxxxxxx

We want to extract the two series of bit patterns: yyyyy and xxxxxx from our code point value so that we can use them to create two separate UTF-8 bytes:

UTF-8 byte 1 = 110yyyyy
UTF-8 byte 2 = 10xxxxxx

Isolating yyyyy

To isolate yyyyy we can use the following bit mask with the & operator

0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

This masking value is 0000011111000000 = 0x07C0 (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 Bit mask
0 0 0 0 0 y y y y y 0 0 0 0 0 0 Result of operation

Note that the result of the masking operation for yyyyy leaves this bit pattern “stranded” in the middle of the number. So, we need to “shuffle” yyyyy along to the right by 6 places. To do this in C we use the right shift operator >>.

Isolating xxxxxx

To isolate xxxxxx we can use the following bit mask with the & operator:

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

The masking value is 0000000000111111 = 0x003F (hex number in C notation).

0 0 0 0 0 y y y y y x x x x x x Generic bit pattern
& & & & & & & & & & & & & & & & Binary AND operator
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Bit mask
0 0 0 0 0 0 0 0 0 0 x x x x x x Result of operation

The result of bit masking for xxxxxx leaves it at the right so we do not need to shuffle via the right shift operator >>.

Noting that
110yyyyy = 11000000 + 000yyyyy = 0xC0 + 000yyyyy

and that
10xxxxxx = 10000000 + 00xxxxxx = 0x80 + 00xxxxxx

we can summarize the process of transforming a code point between 0080 and 07FF into 2 bytes of UTF-8 data with a short snippet of C code.

unsigned char arabic_utf_byte1;
unsigned char arabic_utf_byte2;
unsigned short p; // our code point between 0080 and 07FF

arabic_utf_byte1= (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);

Which takes a lot less space than the explanation!

Other Arabic code point ranges

We have laboriously worked through the UTF-8 conversion process for code points which span the range 0080 to 07FF, a range which includes the “core” Arabic character code point range of 0600 to 06FF and the Arabic Supplement code point range of 0750 to 077F.

There are two further ranges we need to explore:

  • Arabic presentation forms A: FB50 to FDFF
  • Arabic presentation forms B: FE70 to FEFF

Looking back to our table, these two Arabic presentation form ranges fall within 0800 to FFFF so we need to generate 3 bytes to encode them into UTF-8. The principles follow the reasoning above so I will not repeat that here but simply offer some sample C code. Note that there is no error checking whatsoever in this code, it is simply meant to be an illustrative example and certainly needs to be improved for any form of production use.

You can download the C source and a file “arabic.txt” which contains the a sample of output from the code below. I hope it is useful.

#include <stdio.h>

void presentationforms(unsigned short min, unsigned short max, FILE* arabic);
void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic);

void main() {

	FILE * arabic= fopen("arabic.txt", "wb");

	coreandsupplement(0x600, 0x6FF, arabic);
	coreandsupplement(0x750, 0x77F, arabic);
	presentationforms(0xFB50, 0xFDFF, arabic);
	presentationforms(0xFE70, 0xFEFF, arabic);
	
	fclose(arabic);

  }

void coreandsupplement(unsigned short min, unsigned short max, FILE* arabic)
{

	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1=  (unsigned char)(((p & 0x07c0) >> 6) + 0xC0);
		arabic_utf_byte2= (unsigned char)((p & 0x003F) + 0x80);
		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
	}
	
	return;

}


void presentationforms(unsigned short min, unsigned short max, FILE* arabic)
{
	unsigned char arabic_utf_byte1;
	unsigned char arabic_utf_byte2;
	unsigned char arabic_utf_byte3;
	unsigned short p;

	for(p = min; p <= max; p++)
	{
		arabic_utf_byte1 = (unsigned char)(((p & 0xF000) >> 12) + 0xE0);
		arabic_utf_byte2 = (unsigned char)(((p & 0x0FC0) >> 6) + 0x80);
		arabic_utf_byte3 = (unsigned char)((p & 0x003F)+ 0x80);

		fwrite(&arabic_utf_byte1,1,1,arabic);
		fwrite(&arabic_utf_byte2,1,1,arabic); 
		fwrite(&arabic_utf_byte3,1,1,arabic); 
	}

	return;

}

Extending LuaTeX on Windows with plugins (DLLs)

About 6 months ago I came across an article and presentation by Luigi Scarso called “LuaTEX lunatic”, with a subtitle And Now for Something Completely Different. And different it was because, for me, it opened my eyes to some of the real power of LuaTeX: extending it via C/C++ libraries. Luigi’s truly excellent paper is Linux-centric but the general ideas hold true for any platform, including Windows.

The power of Lua’s require(...) function

The Lua language provides a function called require(...) which allows you to load and run libraries – that can be written in pure Lua or the Lua C API. Refer to the Libraries And Bindings page on lua-users.org for more details.

Using require(...) with LuaTeX: a primer
Once again, the secret ingredient is the LuaTeX command \directlua{...} which, as discussed in previous posts, lets you run Lua code from within documents you process with LuaTeX. Suppose you have a DLL which you, or someone else, have written with a Lua binding and you want to use it with LuaTeX. How do you do it?

Firstly, within texmf.cnf you need to define a variable called CLUAINPUTS, which tells Kpathsea where to search for files with extension .dll and .so (shared object file, on Linux). For example, in my hand-rolled texmf.cnf the setting is

CLUAINPUTS=$TEXMF/dlls

The LuaTeX Reference Manual notes the default setting of

CLUAINPUTS=.:$SELFAUTOLOC/lib/{$progname,$engine,}/lua//

World’s most pointless DLL code?

Just for completeness, and by way of an ultra-minimal example, here is probably the world’s most pointless C code for a DLL that you can call from LuaTeX. To compile this you will, of course, need to ensure that you link to the Lua libraries (note that I use Microsoft’s Visual Studio for this)


#include ‹windows.h›
#include "lauxlib.h"
#include "lua.h"

#define LUA_LIB   int __declspec(dllexport) 

static int helloluatex_greetings(lua_State *L){

	printf("Hello to LuaTeX from the world's smallest DLL!");
	return 0;
}


static const luaL_reg helloluatex[] = {
{"greetings", helloluatex_greetings},
	{NULL, NULL}
};

LUA_LIB luaopen_helloluatex (lua_State *L) {
  luaL_register(L, "helloluatex", helloluatex);
  return 1;
}

You need to compile the above C code into a DLL called helloluatex.dll and copy it to the directory or path pointed to by CLUAINPUTS.

LuaTeX code to use our new DLL

Here is a minimal (LaTeX) file to load helloluatex.dll and call the greetings function we defined via the Lua C API. We'll call the file dlltest.tex.

\documentclass[11pt,twoside]{article}
\begin{document}
\pagestyle{empty}
\directlua{

	require("helloluatex")
	helloluatex.greetings()
}
\end{document}

Running this as luatex --fmt=lualatex dlltest.tex gives the output

This is LuaTeX, Version beta-0.65.0-2010122301
(c:/.../dlltest.tex
LaTeX2e <2009/09/24>
(c:/.../formats/pdflatex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(c:/.../formats/pdflatex/base/size11.clo))
No file dlltest.aux.
Hello to LuaTeX from the world's smallest DLL!(./dlltest.aux) )
 262 words of node memory still in use:
   2 hlist, 1 vlist, 1 rule, 2 glue, 39 glue_spec, 2 write nodes
   avail lists: 2:12,3:1,6:3,7:1,9:1
No pages of output.
Transcript written on dlltest.log.

Note that you see Hello to LuaTeX from the world's smallest DLL! printed out to the DOS window.

This is, of course, a rather simple example so I'll try to provide more useful examples over the coming weeks and months. I have integrated a number of libraries into LuaTeX, including FreeType and GhostScript, and many others, so I'll try to cover some of these wonderful C libraries as time permits. Stay tuned!