Adding a UTF-8-capable regular expression library to LuaTeX

Introduction

In this post I’m going to sketch out adding the free PCRE C library to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I’ve not tried this in a production environment. So, do please undertake all necessary testing and due diligence in your own code!

PCRE: Perl Compatible Regular Expressions

PCRE is a mature C library which provides a very powerful regular expression engine. It is also capable of working with UTF-8 encoded strings, which is, of course, very useful because LuaTeX uses UTF-8 input. I’m not going to cover the entire PCRE build process in this post because, frankly, it’ll take too long. But in outline…

Building PCRE as a static library (.lib)

  1. I used CMake to create a Visual Studio 2008 project via the PCRE-supplied CMakeLists.txt file. Using the CMake tool you can set the appropriate compile-time flags for UFT-8 support: PCRE_SUPPORT_UTF and PCRE_SUPPORT_UNICODE_PROPERTIES. The latter is very useful for seaching UTF-8 strings based on their Unicode character properties. Full details are in the PCRE documentation.
  2. After you finish configuring the PCRE build, and have selected your build environment, press Generate and CMake will output a complete Visual Studio project that you can open and start working on. Wonderful!
  3. However, getting PCRE to build as a static library was fine but I did have a few hassles getting the library to correctly link against the DLL I was building. It took me a bit of time to figure out which additional PCRE preprocessor directives I needed to set in the DLL C code to ensure everything was #define‘d properly.

Building a DLL for LuaTeX

I wrote a very brief overview of building DLLs for LuaTeX in this post so I won’t repeat the details here. Instead, I’ll give a summary indicating how you can get PCRE to call LuaTeX. One word of advice, PCRE comes with a lot of documentation and you’ll need to read through it very carefully! Asking PCRE to call LuaTeX sounds strange but indeed you can do it because PCRE provides the ability to register a callback function it will call each time it matches a string. Perl has a similar ability to execute Perl code on matching a string. From the PCRE documentation:

“PCRE provides a feature called ‘callout’, which is a means of temporarily passing control to the caller of PCRE in the middle of pattern matching. The caller of PCRE provides an external function by putting its entry point in the global variable pcre_callout.”

Calling LuaTeX

OK, so how do we do that? There are two parts to this story: create a Lua function you want to call from C and create the C function which calls the Lua function.

  1. From within LuaTeX, use \directlua{...} to create a simple Lua function printy that we are going to call from PCRE. This Lua function takes a string and sends it to LuaTeX via tex.print(). In these examples I sent LuaTeX a simple text string "Yo! I was called!", which LuaTeX then typeset. Of course, you could also send LuaTeX the string that was matched by PCRE!
           \directlua{
                  function printy (str)
                  tex.print(str)
                  end
           }
    
  2. The next part is to create the C code to call a Lua function. This C function is the callout that PCRE will call when it matches a string.
           int mycallout(pcre_callout_block *cb){
           lua_State *L;
           L = cb->callout_data;
           if (L){
                  lua_getglobal(cb->callout_data, "printy");
                  if(!lua_isfunction(L,-1)) {
                         lua_pop(L,1);
                         return 0;
                   }
    
                  lua_pushstring(L, "Yo! I was called!");   /* push 1st argument */
                  /* Now make the call to printy with 1 argument and 0 results*/
                  if (lua_pcall(L, 1, 0, 0) != 0) {
                  // report your error 
                   return 0;
                  }
        }
        return 0;
    }
    

    A few points here are worth noting.

    • From the PCRE documentation:

      “The external callout function returns an integer to PCRE. If the value is zero, matching proceeds as normal. If the value is greater than zero, matching fails at the current point, but the testing of other matching possibilities goes ahead, just as if a lookahead assertion had failed. If the value is less than zero, the match is abandoned, the matching function returns the negative value”

    • The lua_State variable, *L, is passed in via a mechanism I’ll outline below.
    • The line lua_getglobal(cb->callout_data, "printy") does the main work of pushing the value of the gloabal variable printy onto Lua’s stack. Of course, in effect this is a pointer to the function we defined in LuaTeX, and which we call through lua_pcall(...). Further details in the Lua documentation.
    • The above code does near-zero error checking, it is purely to demonstrate the ideas!

Other PCRE bits and pieces

There are a few other points to consider, namely how do you setup the callout and how do you pass lua_State *L to the callout? I’m not going to explain in great detail how all these parts hang together in a full application, simply point out some key pieces.

  1. You have to set the PCRE global variable pcre_callout, a function pointer, to your callout function. Simply, pcre_callout = mycallout; Yes, it does work. Here, re represents our compiled regular expression pattern. Note that you must use the PCRE_UTF8 option if you are searching UTF-8 encoded text.
  2. Before you can start searching, you need to “compile” your regular expression pattern.
                  re = pcre_compile(pattern,
    		      PCRE_UTF8|PCRE_UCP,
    		      &err_msg,
    		      &err,
    		      NULL);
    
  3. Note, to use PCRE callouts you need to use the appropriate syntax in your regular expression; from the PCRE documentation, “Within a regular expression, (?C) indicates the points at which the external function is to be called.” Once you have compiled your search pattern, and done your error checking, you need to run the search engine using the compiled pattern and your target string (s) in the code below.
  4. The next step is to create a pointer to something called a pcre_callout_block, which is a struct. This struct has a field called callout_data which is a pointer into which you can store whatever you want to pass into the mycallout function: here, I’m setting it to the lua_State variable, L. By doing this, each time PCRE matches a string and calls the callout funtion, the lua_State variable, L will be available for our use! Clearly, you’ll need to do this from within the appropriate function you call from LuaTeX. Once this is done you are ready to begin your searching using pcre_exec(...).

                  pcre_extra *p;
                  p = (pcre_extra*) malloc(sizeof(pcre_extra));
                  memset(p,0, sizeof(pcre_extra));
                  p->callout_data = L;
                  p->flags=PCRE_EXTRA_CALLOUT_DATA;
                         res = pcre_exec(re,
                                p,
                                s,
                                len,
                                0,
                                0,
                                offsets,
                         OVECMAX);
    

Summary

PCRE is a marvellous and powerful C library – with copious documentation that you’ll need to read very carefully! The ability to provide LuaTeX with a UTF-8-enabled regex engine could open the way to some useful applications, particularly when combined with LuaTeX’s own callback mechanism. In particular, the process_input_buffer callback which allows you to change the contents of the line input buffer just before LuaTeX actually starts looking at it. The mind boggles at the possibilities!

Typesetting Arabic with LuaTeX: Part 2 (documentation, tools and libraries)

Introduction

I’ve been thinking about the next article in this series and what should it address so I’ve decided to skip ahead and give a summary of the documentation, tools and libraries which made it possible for me to experiment with typesetting Arabic. I’m listing these because it actually took a long time to assemble the reading materials and tools required, so it may just save somebody, somewhere, the many hours I spent hunting it all down. For sure, there’s a ton of stuff I want to write about, in an attempt to piece together the various concepts and ideas involved in gaining a better understanding of Unicode, OpenType and Arabic text typesetting/display. However, I’m soon to start a new job, which means I’ll have less time to devote to this blog so I’ll try to post as much as I can over the next couple of weeks.

Just for completeness, I should say that, for sure, you can implement Arabic layout/typesetting for LuaTeX in pure Lua code, as the ConTeXt distribution has done, through the quite incredible work of Idris Hamid and Hans Hagen.

Documentation

There is a lot to read. Here are some resources that are either essential or helpful.

Unicode

Clearly, you’ll need to read relevant parts of the Unicode Standard. Here’s my suggested minimal reading list.

  • Chapter 8: Middle Eastern Scripts . This gives an extremely useful description of cursive joining and a model for implementing contextual analysis.
  • Unicode ranges for Arabic (see also these posts). You’ll need the Unicode code charts for Arabic (PDFs downloadable and listed under Middle Eastern Scripts, here)
  • Unicode Bidirectional Algorithm. Can’t say that I’ve really read this properly, and certainly not yet implemented anything to handle mixed runs of text, but you certainly need it.

OpenType

Whether you are interested in eBooks, conventional typesetting or the WOFF standard, these days a working knowledge of OpenType font technology is very useful. If you want to explore typesetting Arabic then it’s simply essential.

C libraries

It’s always a good idea to leverage the work of true experts, especially if it is provided for free as an open source library! I spent a lot of time hunting for libraries, so here is my summary of what I found and what I eventually settled on using.

  • IBM’s ICU: Initially, I looked at using IBM’s International Components for Unicode but, for my requirements, it was serious overkill. It is a truly vast and powerful open source library (for C/C++ and Java) if you need the wealth of features it provides.
  • HarfBuzz: This is an interesting and ongoing development. The HarfBuzz OpenType text shaping engine looks like it will become extremely useful; although I had a mixed experience trying to build it on Windows, which is almost certainly due to my limitations, not those of the library. If you’re Linux-based then no doubt it’ll be fine for you. As it matures to a stable release I’ll definitely take another look.
  • GNU FriBidi: As mentioned above, essential for a full implementation of displaying (eBooks, browsers etc) or typesetting mixed left-to-right and right-to-left scripts is the Unicode Bidirectional Algorithm. Fortunately, there’s a free and standard implementation of this available as a C library: GNU FriBidi I’ve not yet reached the point of being able to use it but it’s the one I’ll choose.

My libraries of choice

Eventually, I settled on FreeType and libotf. You need to use them together because libotf depends on FreeType. Both libraries are mature and easy to use and I simply cannot praise these libraries too highly. Clearly, this is my own personal bias and preference but ease of use rates extremely highly on my list of requirements. FreeType has superb documentation whereas libotf does not, although it has some detailed comments within the main #include file. I’ll definitely post a short “getting started with libotf” because it is not difficult to use (when you’ve worked it out!).

libotf: words are not enough!

Mindful that I’ve not yet explained how all these libraries work together, or what they do, but I just have to say that libotf is utterly superb. libotf provides a set of functions which “drive” the features and lookups contained in an OpenType font, allowing you to pass in a Unicode string and apply OpenType tables to generate the corresponding sequence of glyphs which you can subsequently render. Of course, for Arabic you also need to perform contextual analysis to select the appropriate joining forms but once that is done then libotf lets you take full advantage of any advanced typesetting features present in the font.

UTF-8 encoding/decoding

To pass Unicode strings between your C code and LuaTeX you’ll be using UTF-8 so you will need to encode and decode UTF-8 from within your C. Encoding is easy and has been covered elsewhere on this site. For decoding UTF-8 into codepoints I use the The Flexible and Economical UTF-8 Decoder.

Desktop software

In piecing together my current understanding of Unicode and OpenType I found the following software to be indespensible. Some of these are Windows-only applications.

  • VOLT: Microsoft’s excellent and free VOLT (Visual OpenType Labout Tool). I’ll certainly try to write an introduction to VOLT but you can also download the free Volt Training Video.
  • Font editors: Fontlab Studio 5 (commercial) or FontForge (free).
  • Adobe FDK: The Adobe Font Development Kit contains some excellent utilities and I highly recommend it.
  • Character browser: To assist with learning/exploring Unicode I used the Unibook character browser.
  • BabelPad: Absoutely superb Windows-based Unicode text editor. Packed with features that can assist with understanding Unicode and the rendering of complex scripts. For example, the ability to toggle complex rendering so that you can edit Arabic text without any Uniscribe shaping being applied.
  • BabelMap: Unicode Character Map for Windows is another great tool from the author of BabelPad.
  • High quality Arabic fonts. By “high quality” I don’t just mean the design and hinting but also the number of OpenType features implemented or contained in the font itself, such as cursive positioning, ligatures, vowel placement (mark to base, mark to ligature, mark to mark etc). My personal favourite is Arabic Typesetting (shipped with Windows) but SIL International also provide free Arabic fonts provide one called Scheherazade.

TIP: Microsoft VOLT and the Arabic Typesetting or Scheherazade fonts. I’ll talk about VOLT in more detail later but Microsoft and SIL provide “VOLT versions” of their respective Arabic fonts. These are absolutely invaluable resources for understanding advanced OpenType concepts and if you are interested to learn more I strongly recommend taking a look at them.

  • The VOLT version of the Arabic Typesetting font is shipped with the VOLT installer and is contained within a file called “VoltSupplementalFiles.exe”, so just run that to extract the VOLT version.
  • The VOLT version of Scheherazade is made available as a download from SIL.

I can only offer my humble thanks to the people who created these resources and made them available for free: a truly substantial amount of work is involved in creating them.

LuaCOM: connecting LuaTeX to Windows automation

Introduction

The Windows operating system provides a technology called COM, which stands for Component Object Model. In essence, it provides a way for software components and applications to “talk to each other”. That’s a gross oversimplification but it gives the general idea. It’s now an old technology but nevertheless it is still very powerful; over the years I’ve used it quite extensively for automating various publishing/production tasks. In those days it was with Perl using a module called Win32::OLE.

Of course, applications have to be written to support COM so you can think of COM-aware applications as offering a “set of services” that you can call — many applications provide the ability to call those services from scripting languages which have support for COM (via modules/plugins etc), such as Perl, Ruby and, of course, Lua via LuaCOM. A combination of COM-aware applications and scripting languages with COM support provides a very flexible way to “glue together” all sorts of different applications to create novel automated workflows/processes.

Using COM from within scripting languages is fairly straightforward but under the surface COM is, to me anyway, a complex beast indeed. The best low-level COM programming tutorials I have ever read are published on codeproject.com, written by Michael Dunn. Here’s one such tutorial Introduction to COM – What It Is and How to Use It.

LuaCOM

LuaCOM lets you use COM in your Lua scripts, i.e., it is a binding to COM. I don’t know if there are freely available builds of the latest version (probably with Windows distributions of Lua), but you can download and compile the latest version from Github.

LuaCOM is a DLL (Dynamic Link Library) that you load using the standard “require” feature of Lua. For example, to start Microsoft Word from within your Lua code, using LuaCOM, you would do something like this:

com = require("luacom")
-- should be CreateObject not GetObject!
Word =com.CreateObject("Word.Application")
Word.Visible=1
doc = Word.Documents:Open("g:/x.docx")

Naturally, the Microsoft Office applications have very extensive support for COM and offer a huge number of functions that you can call should you wish to automate a workflow process via COM from within Lua. For example, you can access all the native equation objects within a Word document (read, write, create and convert equations…). If you have watched this video and wondered how I got LuaTeX and Word to talk to each other, now you know: LuaCOM provided the glue.

Typesetting Arabic with LuaTeX [via a C plug-in] (Part 1)

Introduction

In this new series of posts I’m going to attempt an overview of the topics, concepts, ideas and technologies involved in typesetting Arabic with LuaTeX, via a DLL I’m writing in C. Actually, the C code is very substantially platform-independent so it should compile on non-Windows machines… one day, when it’s “finished”…

Up until 2 years ago I was teaching myself Arabic (see my Amazon book reviews) and had reached the point where I wanted to write-up my notes and worked exercises: I needed to typeset Arabic and wanted to use a TeX-based solution. Having looked around I stumbled upon some truly amazing video presentations of Arabic typesetting work being undertaken by Idris Hamid and Hans Hagen, using a tool called LuaTeX: something I’d never heard of. I was truly stunned by what I saw, the quality of their Arabic typesetting was (is) incredible, so I had to find out more. A few hours later I’d worked out that the typesetting was being achieved through Hans Hagen’s ConTeXt package, with LuaTeX as the underlying TeX engine. However, I’m personally not a user of ConTeXt, but the LuaTeX engine was just so interesting that I had to explore it. Well, two years later and I’ve not done any further learning of Arabic, having replaced that activity with plenty of explorations into LuaTeX and a host of other technologies, particularly OpenType and Unicode.

Coming up to the present day, I’ve finally reached the point where I have puzzled out enough detail of the “big picture” to attempt a home-grown Arabic typesetting solution for LuaTeX, but one where most of the “heavy lifting” is done in C, with Lua code to interface with and talk to LuaTeX. For sure, there are ready-made options such as XeTeX or the range of Arabic typesetting solutions created by the TeX community. However, my interest is creating a solution that will just as easily output SVG or other non-PDF formats, plus allow the automated production of new and novel “typeset structures” and diagrams that will really help with learning Arabic: things I wish had been present in the many books I have bought and studied but which may just be too time-consuming, or difficult/expensive, to produce by “conventional” applications. These are big goals, but definitely achievable, albeit over a year or two of further work.

Sample

Just by way of an early example, see the following PDF, as usual, through the Google Docs viewer or download PDF here. The trained eye will certainly spot a few issues that need fixing but so far it’s not looking too bad :-). But there is a long, long way to go yet. The font used is Microsoft’s “Arabic Typesetting” because it is contains a substantial number of OpenType features including cursive positioning, mark-to-base positioning, an enormous range of ligatures plus many other features which make it an ideal choice of font to work with (in my opinion). In the example (the made-up words) you can see the non-horizontal baseline achieved with cursive positioning plus the ability to control vowel placement with great flexibility.

But it’s still far from perfect, I’ll readily admit. I hope I can finish this work, and find the time to complete these articles. I’ll certainly try!

Extending LuaTeX on Windows with plugins (DLLs)

About 6 months ago I came across an article and presentation by Luigi Scarso called “LuaTEX lunatic”, with a subtitle And Now for Something Completely Different. And different it was because, for me, it opened my eyes to some of the real power of LuaTeX: extending it via C/C++ libraries. Luigi’s truly excellent paper is Linux-centric but the general ideas hold true for any platform, including Windows.

The power of Lua’s require(...) function

The Lua language provides a function called require(...) which allows you to load and run libraries – that can be written in pure Lua or the Lua C API. Refer to the Libraries And Bindings page on lua-users.org for more details.

Using require(...) with LuaTeX: a primer
Once again, the secret ingredient is the LuaTeX command \directlua{...} which, as discussed in previous posts, lets you run Lua code from within documents you process with LuaTeX. Suppose you have a DLL which you, or someone else, have written with a Lua binding and you want to use it with LuaTeX. How do you do it?

Firstly, within texmf.cnf you need to define a variable called CLUAINPUTS, which tells Kpathsea where to search for files with extension .dll and .so (shared object file, on Linux). For example, in my hand-rolled texmf.cnf the setting is

CLUAINPUTS=$TEXMF/dlls

The LuaTeX Reference Manual notes the default setting of

CLUAINPUTS=.:$SELFAUTOLOC/lib/{$progname,$engine,}/lua//

World’s most pointless DLL code?

Just for completeness, and by way of an ultra-minimal example, here is probably the world’s most pointless C code for a DLL that you can call from LuaTeX. To compile this you will, of course, need to ensure that you link to the Lua libraries (note that I use Microsoft’s Visual Studio for this)


#include ‹windows.h›
#include "lauxlib.h"
#include "lua.h"

#define LUA_LIB   int __declspec(dllexport) 

static int helloluatex_greetings(lua_State *L){

	printf("Hello to LuaTeX from the world's smallest DLL!");
	return 0;
}


static const luaL_reg helloluatex[] = {
{"greetings", helloluatex_greetings},
	{NULL, NULL}
};

LUA_LIB luaopen_helloluatex (lua_State *L) {
  luaL_register(L, "helloluatex", helloluatex);
  return 1;
}

You need to compile the above C code into a DLL called helloluatex.dll and copy it to the directory or path pointed to by CLUAINPUTS.

LuaTeX code to use our new DLL

Here is a minimal (LaTeX) file to load helloluatex.dll and call the greetings function we defined via the Lua C API. We'll call the file dlltest.tex.

\documentclass[11pt,twoside]{article}
\begin{document}
\pagestyle{empty}
\directlua{

	require("helloluatex")
	helloluatex.greetings()
}
\end{document}

Running this as luatex --fmt=lualatex dlltest.tex gives the output

This is LuaTeX, Version beta-0.65.0-2010122301
(c:/.../dlltest.tex
LaTeX2e <2009/09/24>
(c:/.../formats/pdflatex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(c:/.../formats/pdflatex/base/size11.clo))
No file dlltest.aux.
Hello to LuaTeX from the world's smallest DLL!(./dlltest.aux) )
 262 words of node memory still in use:
   2 hlist, 1 vlist, 1 rule, 2 glue, 39 glue_spec, 2 write nodes
   avail lists: 2:12,3:1,6:3,7:1,9:1
No pages of output.
Transcript written on dlltest.log.

Note that you see Hello to LuaTeX from the world's smallest DLL! printed out to the DOS window.

This is, of course, a rather simple example so I'll try to provide more useful examples over the coming weeks and months. I have integrated a number of libraries into LuaTeX, including FreeType and GhostScript, and many others, so I'll try to cover some of these wonderful C libraries as time permits. Stay tuned!