Adding a UTF-8-capable regular expression library to LuaTeX

Introduction

In this post I’m going to sketch out adding the free PCRE C library to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I’ve not tried this in a production environment. So, do please undertake all necessary testing and due diligence in your own code!

PCRE: Perl Compatible Regular Expressions

PCRE is a mature C library which provides a very powerful regular expression engine. It is also capable of working with UTF-8 encoded strings, which is, of course, very useful because LuaTeX uses UTF-8 input. I’m not going to cover the entire PCRE build process in this post because, frankly, it’ll take too long. But in outline…

Building PCRE as a static library (.lib)

  1. I used CMake to create a Visual Studio 2008 project via the PCRE-supplied CMakeLists.txt file. Using the CMake tool you can set the appropriate compile-time flags for UFT-8 support: PCRE_SUPPORT_UTF and PCRE_SUPPORT_UNICODE_PROPERTIES. The latter is very useful for seaching UTF-8 strings based on their Unicode character properties. Full details are in the PCRE documentation.
  2. After you finish configuring the PCRE build, and have selected your build environment, press Generate and CMake will output a complete Visual Studio project that you can open and start working on. Wonderful!
  3. However, getting PCRE to build as a static library was fine but I did have a few hassles getting the library to correctly link against the DLL I was building. It took me a bit of time to figure out which additional PCRE preprocessor directives I needed to set in the DLL C code to ensure everything was #define‘d properly.

Building a DLL for LuaTeX

I wrote a very brief overview of building DLLs for LuaTeX in this post so I won’t repeat the details here. Instead, I’ll give a summary indicating how you can get PCRE to call LuaTeX. One word of advice, PCRE comes with a lot of documentation and you’ll need to read through it very carefully! Asking PCRE to call LuaTeX sounds strange but indeed you can do it because PCRE provides the ability to register a callback function it will call each time it matches a string. Perl has a similar ability to execute Perl code on matching a string. From the PCRE documentation:

“PCRE provides a feature called ‘callout’, which is a means of temporarily passing control to the caller of PCRE in the middle of pattern matching. The caller of PCRE provides an external function by putting its entry point in the global variable pcre_callout.”

Calling LuaTeX

OK, so how do we do that? There are two parts to this story: create a Lua function you want to call from C and create the C function which calls the Lua function.

  1. From within LuaTeX, use \directlua{...} to create a simple Lua function printy that we are going to call from PCRE. This Lua function takes a string and sends it to LuaTeX via tex.print(). In these examples I sent LuaTeX a simple text string "Yo! I was called!", which LuaTeX then typeset. Of course, you could also send LuaTeX the string that was matched by PCRE!
           \directlua{
                  function printy (str)
                  tex.print(str)
                  end
           }
    
  2. The next part is to create the C code to call a Lua function. This C function is the callout that PCRE will call when it matches a string.
           int mycallout(pcre_callout_block *cb){
           lua_State *L;
           L = cb->callout_data;
           if (L){
                  lua_getglobal(cb->callout_data, "printy");
                  if(!lua_isfunction(L,-1)) {
                         lua_pop(L,1);
                         return 0;
                   }
    
                  lua_pushstring(L, "Yo! I was called!");   /* push 1st argument */
                  /* Now make the call to printy with 1 argument and 0 results*/
                  if (lua_pcall(L, 1, 0, 0) != 0) {
                  // report your error 
                   return 0;
                  }
        }
        return 0;
    }
    

    A few points here are worth noting.

    • From the PCRE documentation:

      “The external callout function returns an integer to PCRE. If the value is zero, matching proceeds as normal. If the value is greater than zero, matching fails at the current point, but the testing of other matching possibilities goes ahead, just as if a lookahead assertion had failed. If the value is less than zero, the match is abandoned, the matching function returns the negative value”

    • The lua_State variable, *L, is passed in via a mechanism I’ll outline below.
    • The line lua_getglobal(cb->callout_data, "printy") does the main work of pushing the value of the gloabal variable printy onto Lua’s stack. Of course, in effect this is a pointer to the function we defined in LuaTeX, and which we call through lua_pcall(...). Further details in the Lua documentation.
    • The above code does near-zero error checking, it is purely to demonstrate the ideas!

Other PCRE bits and pieces

There are a few other points to consider, namely how do you setup the callout and how do you pass lua_State *L to the callout? I’m not going to explain in great detail how all these parts hang together in a full application, simply point out some key pieces.

  1. You have to set the PCRE global variable pcre_callout, a function pointer, to your callout function. Simply, pcre_callout = mycallout; Yes, it does work. Here, re represents our compiled regular expression pattern. Note that you must use the PCRE_UTF8 option if you are searching UTF-8 encoded text.
  2. Before you can start searching, you need to “compile” your regular expression pattern.
                  re = pcre_compile(pattern,
    		      PCRE_UTF8|PCRE_UCP,
    		      &err_msg,
    		      &err,
    		      NULL);
    
  3. Note, to use PCRE callouts you need to use the appropriate syntax in your regular expression; from the PCRE documentation, “Within a regular expression, (?C) indicates the points at which the external function is to be called.” Once you have compiled your search pattern, and done your error checking, you need to run the search engine using the compiled pattern and your target string (s) in the code below.
  4. The next step is to create a pointer to something called a pcre_callout_block, which is a struct. This struct has a field called callout_data which is a pointer into which you can store whatever you want to pass into the mycallout function: here, I’m setting it to the lua_State variable, L. By doing this, each time PCRE matches a string and calls the callout funtion, the lua_State variable, L will be available for our use! Clearly, you’ll need to do this from within the appropriate function you call from LuaTeX. Once this is done you are ready to begin your searching using pcre_exec(...).

                  pcre_extra *p;
                  p = (pcre_extra*) malloc(sizeof(pcre_extra));
                  memset(p,0, sizeof(pcre_extra));
                  p->callout_data = L;
                  p->flags=PCRE_EXTRA_CALLOUT_DATA;
                         res = pcre_exec(re,
                                p,
                                s,
                                len,
                                0,
                                0,
                                offsets,
                         OVECMAX);
    

Summary

PCRE is a marvellous and powerful C library – with copious documentation that you’ll need to read very carefully! The ability to provide LuaTeX with a UTF-8-enabled regex engine could open the way to some useful applications, particularly when combined with LuaTeX’s own callback mechanism. In particular, the process_input_buffer callback which allows you to change the contents of the line input buffer just before LuaTeX actually starts looking at it. The mind boggles at the possibilities!