STM publishing: tools, technologies and change A WordPress site for STM Publishing

23Nov/12Off

Adding a UTF-8-capable regular expression library to LuaTeX

Posted by Graham Douglas

Introduction

In this post I'm going to sketch out adding the free PCRE C library to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I've not tried this in a production environment. So, do please undertake all necessary testing and due diligence in your own code!

PCRE: Perl Compatible Regular Expressions

PCRE is a mature C library which provides a very powerful regular expression engine. It is also capable of working with UTF-8 encoded strings, which is, of course, very useful because LuaTeX uses UTF-8 input. I'm not going to cover the entire PCRE build process in this post because, frankly, it'll take too long. But in outline...

Building PCRE as a static library (.lib)

  1. I used CMake to create a Visual Studio 2008 project via the PCRE-supplied CMakeLists.txt file. Using the CMake tool you can set the appropriate compile-time flags for UFT-8 support: PCRE_SUPPORT_UTF and PCRE_SUPPORT_UNICODE_PROPERTIES. The latter is very useful for seaching UTF-8 strings based on their Unicode character properties. Full details are in the PCRE documentation.
  2. After you finish configuring the PCRE build, and have selected your build environment, press Generate and CMake will output a complete Visual Studio project that you can open and start working on. Wonderful!
  3. However, getting PCRE to build as a static library was fine but I did have a few hassles getting the library to correctly link against the DLL I was building. It took me a bit of time to figure out which additional PCRE preprocessor directives I needed to set in the DLL C code to ensure everything was #define'd properly.

Building a DLL for LuaTeX

I wrote a very brief overview of building DLLs for LuaTeX in this post so I won't repeat the details here. Instead, I'll give a summary indicating how you can get PCRE to call LuaTeX. One word of advice, PCRE comes with a lot of documentation and you'll need to read through it very carefully! Asking PCRE to call LuaTeX sounds strange but indeed you can do it because PCRE provides the ability to register a callback function it will call each time it matches a string. Perl has a similar ability to execute Perl code on matching a string. From the PCRE documentation:

"PCRE provides a feature called 'callout', which is a means of temporarily passing control to the caller of PCRE in the middle of pattern matching. The caller of PCRE provides an external function by putting its entry point in the global variable pcre_callout."

Calling LuaTeX

OK, so how do we do that? There are two parts to this story: create a Lua function you want to call from C and create the C function which calls the Lua function.

  1. From within LuaTeX, use \directlua{...} to create a simple Lua function printy that we are going to call from PCRE. This Lua function takes a string and sends it to LuaTeX via tex.print(). In these examples I sent LuaTeX a simple text string "Yo! I was called!", which LuaTeX then typeset. Of course, you could also send LuaTeX the string that was matched by PCRE!
           \directlua{
                  function printy (str)
                  tex.print(str)
                  end
           }
    
  2. The next part is to create the C code to call a Lua function. This C function is the callout that PCRE will call when it matches a string.
           int mycallout(pcre_callout_block *cb){
           lua_State *L;
           L = cb->callout_data;
           if (L){
                  lua_getglobal(cb->callout_data, "printy");
                  if(!lua_isfunction(L,-1)) {
                         lua_pop(L,1);
                         return 0;
                   }
    
                  lua_pushstring(L, "Yo! I was called!");   /* push 1st argument */
                  /* Now make the call to printy with 1 argument and 0 results*/
                  if (lua_pcall(L, 1, 0, 0) != 0) {
                  // report your error
                   return 0;
                  }
        }
        return 0;
    }
    

    A few points here are worth noting.

    • From the PCRE documentation:

      "The external callout function returns an integer to PCRE. If the value is zero, matching proceeds as normal. If the value is greater than zero, matching fails at the current point, but the testing of other matching possibilities goes ahead, just as if a lookahead assertion had failed. If the value is less than zero, the match is abandoned, the matching function returns the negative value"

    • The lua_State variable, *L, is passed in via a mechanism I'll outline below.
    • The line lua_getglobal(cb->callout_data, "printy") does the main work of pushing the value of the gloabal variable printy onto Lua's stack. Of course, in effect this is a pointer to the function we defined in LuaTeX, and which we call through lua_pcall(...). Further details in the Lua documentation.
    • The above code does near-zero error checking, it is purely to demonstrate the ideas!

Other PCRE bits and pieces

There are a few other points to consider, namely how do you setup the callout and how do you pass lua_State *L to the callout? I'm not going to explain in great detail how all these parts hang together in a full application, simply point out some key pieces.

  1. You have to set the PCRE global variable pcre_callout, a function pointer, to your callout function. Simply, pcre_callout = mycallout; Yes, it does work. Here, re represents our compiled regular expression pattern. Note that you must use the PCRE_UTF8 option if you are searching UTF-8 encoded text.
  2. Before you can start searching, you need to "compile" your regular expression pattern.
                  re = pcre_compile(pattern,
    		      PCRE_UTF8|PCRE_UCP,
    		      &err_msg,
    		      &err,
    		      NULL);
    
  3. Note, to use PCRE callouts you need to use the appropriate syntax in your regular expression; from the PCRE documentation, "Within a regular expression, (?C) indicates the points at which the external function is to be called." Once you have compiled your search pattern, and done your error checking, you need to run the search engine using the compiled pattern and your target string (s) in the code below.
  4. The next step is to create a pointer to something called a pcre_callout_block, which is a struct. This struct has a field called callout_data which is a pointer into which you can store whatever you want to pass into the mycallout function: here, I'm setting it to the lua_State variable, L. By doing this, each time PCRE matches a string and calls the callout funtion, the lua_State variable, L will be available for our use! Clearly, you'll need to do this from within the appropriate function you call from LuaTeX. Once this is done you are ready to begin your searching using pcre_exec(...).

                  pcre_extra *p;
                  p = (pcre_extra*) malloc(sizeof(pcre_extra));
                  memset(p,0, sizeof(pcre_extra));
                  p->callout_data = L;
                  p->flags=PCRE_EXTRA_CALLOUT_DATA;
                         res = pcre_exec(re,
                                p,
                                s,
                                len,
                                0,
                                0,
                                offsets,
                         OVECMAX);
    

Summary

PCRE is a marvellous and powerful C library – with copious documentation that you'll need to read very carefully! The ability to provide LuaTeX with a UTF-8-enabled regex engine could open the way to some useful applications, particularly when combined with LuaTeX's own callback mechanism. In particular, the process_input_buffer callback which allows you to change the contents of the line input buffer just before LuaTeX actually starts looking at it. The mind boggles at the possibilities!

11Feb/12Off

Typesetting Arabic with LuaTeX: Part 2 (documentation, tools and libraries)

Posted by Graham Douglas

Introduction

I've been thinking about the next article in this series and what should it address so I've decided to skip ahead and give a summary of the documentation, tools and libraries which made it possible for me to experiment with typesetting Arabic. I'm listing these because it actually took a long time to assemble the reading materials and tools required, so it may just save somebody, somewhere, the many hours I spent hunting it all down. For sure, there's a ton of stuff I want to write about, in an attempt to piece together the various concepts and ideas involved in gaining a better understanding of Unicode, OpenType and Arabic text typesetting/display. However, I'm soon to start a new job, which means I'll have less time to devote to this blog so I'll try to post as much as I can over the next couple of weeks.

Just for completeness, I should say that, for sure, you can implement Arabic layout/typesetting for LuaTeX in pure Lua code, as the ConTeXt distribution has done, through the quite incredible work of Idris Hamid and Hans Hagen.

Documentation

There is a lot to read. Here are some resources that are either essential or helpful.

Unicode

Clearly, you'll need to read relevant parts of the Unicode Standard. Here's my suggested minimal reading list.

  • Chapter 8: Middle Eastern Scripts . This gives an extremely useful description of cursive joining and a model for implementing contextual analysis.
  • Unicode ranges for Arabic (see also these posts). You'll need the Unicode code charts for Arabic (PDFs downloadable and listed under Middle Eastern Scripts, here)
  • Unicode Bidirectional Algorithm. Can't say that I've really read this properly, and certainly not yet implemented anything to handle mixed runs of text, but you certainly need it.

OpenType

Whether you are interested in eBooks, conventional typesetting or the WOFF standard, these days a working knowledge of OpenType font technology is very useful. If you want to explore typesetting Arabic then it's simply essential.

C libraries

It's always a good idea to leverage the work of true experts, especially if it is provided for free as an open source library! I spent a lot of time hunting for libraries, so here is my summary of what I found and what I eventually settled on using.

  • IBM's ICU: Initially, I looked at using IBM's International Components for Unicode but, for my requirements, it was serious overkill. It is a truly vast and powerful open source library (for C/C++ and Java) if you need the wealth of features it provides.
  • HarfBuzz: This is an interesting and ongoing development. The HarfBuzz OpenType text shaping engine looks like it will become extremely useful; although I had a mixed experience trying to build it on Windows, which is almost certainly due to my limitations, not those of the library. If you're Linux-based then no doubt it'll be fine for you. As it matures to a stable release I'll definitely take another look.
  • GNU FriBidi: As mentioned above, essential for a full implementation of displaying (eBooks, browsers etc) or typesetting mixed left-to-right and right-to-left scripts is the Unicode Bidirectional Algorithm. Fortunately, there's a free and standard implementation of this available as a C library: GNU FriBidi I've not yet reached the point of being able to use it but it's the one I'll choose.

My libraries of choice

Eventually, I settled on FreeType and libotf. You need to use them together because libotf depends on FreeType. Both libraries are mature and easy to use and I simply cannot praise these libraries too highly. Clearly, this is my own personal bias and preference but ease of use rates extremely highly on my list of requirements. FreeType has superb documentation whereas libotf does not, although it has some detailed comments within the main #include file. I'll definitely post a short "getting started with libotf" because it is not difficult to use (when you've worked it out!).

libotf: words are not enough!

Mindful that I've not yet explained how all these libraries work together, or what they do, but I just have to say that libotf is utterly superb. libotf provides a set of functions which "drive" the features and lookups contained in an OpenType font, allowing you to pass in a Unicode string and apply OpenType tables to generate the corresponding sequence of glyphs which you can subsequently render. Of course, for Arabic you also need to perform contextual analysis to select the appropriate joining forms but once that is done then libotf lets you take full advantage of any advanced typesetting features present in the font.

UTF-8 encoding/decoding

To pass Unicode strings between your C code and LuaTeX you'll be using UTF-8 so you will need to encode and decode UTF-8 from within your C. Encoding is easy and has been covered elsewhere on this site. For decoding UTF-8 into codepoints I use the The Flexible and Economical UTF-8 Decoder.

Desktop software

In piecing together my current understanding of Unicode and OpenType I found the following software to be indespensible. Some of these are Windows-only applications.

TIP: Microsoft VOLT and the Arabic Typesetting or Scheherazade fonts. I'll talk about VOLT in more detail later but Microsoft and SIL provide "VOLT versions" of their respective Arabic fonts. These are absolutely invaluable resources for understanding advanced OpenType concepts and if you are interested to learn more I strongly recommend taking a look at them.

  • The VOLT version of the Arabic Typesetting font is shipped with the VOLT installer and is contained within a file called "VoltSupplementalFiles.exe", so just run that to extract the VOLT version.
  • The VOLT version of Scheherazade is made available as a download from SIL.

I can only offer my humble thanks to the people who created these resources and made them available for free: a truly substantial amount of work is involved in creating them.

9Feb/120

LuaCOM: connecting LuaTeX to Windows automation

Posted by Graham Douglas

Introduction

The Windows operating system provides a technology called COM, which stands for Component Object Model. In essence, it provides a way for software components and applications to "talk to each other". That's a gross oversimplification but it gives the general idea. It's now an old technology but nevertheless it is still very powerful; over the years I've used it quite extensively for automating various publishing/production tasks. In those days it was with Perl using a module called Win32::OLE.

Of course, applications have to be written to support COM so you can think of COM-aware applications as offering a "set of services" that you can call --- many applications provide the ability to call those services from scripting languages which have support for COM (via modules/plugins etc), such as Perl, Ruby and, of course, Lua via LuaCOM. A combination of COM-aware applications and scripting languages with COM support provides a very flexible way to "glue together" all sorts of different applications to create novel automated workflows/processes.

Using COM from within scripting languages is fairly straightforward but under the surface COM is, to me anyway, a complex beast indeed. The best low-level COM programming tutorials I have ever read are published on codeproject.com, written by Michael Dunn. Here's one such tutorial Introduction to COM - What It Is and How to Use It.

LuaCOM

LuaCOM lets you use COM in your Lua scripts, i.e., it is a binding to COM. I don't know if there are freely available builds of the latest version (probably with Windows distributions of Lua), but you can download and compile the latest version from Github.

LuaCOM is a DLL (Dynamic Link Library) that you load using the standard "require" feature of Lua. For example, to start Microsoft Word from within your Lua code, using LuaCOM, you would do something like this:

com = require("luacom")
-- should be CreateObject not GetObject!
Word =com.CreateObject("Word.Application")
Word.Visible=1
doc = Word.Documents:Open("g:/x.docx")

Naturally, the Microsoft Office applications have very extensive support for COM and offer a huge number of functions that you can call should you wish to automate a workflow process via COM from within Lua. For example, you can access all the native equation objects within a Word document (read, write, create and convert equations...). If you have watched this video and wondered how I got LuaTeX and Word to talk to each other, now you know: LuaCOM provided the glue.