{"id":2428,"date":"2012-11-23T19:40:49","date_gmt":"2012-11-23T19:40:49","guid":{"rendered":"http:\/\/www.readytext.co.uk\/?p=2428"},"modified":"2013-03-24T09:11:28","modified_gmt":"2013-03-24T09:11:28","slug":"adding-a-utf-8-capable-regular-expression-library-to-luatex","status":"publish","type":"post","link":"https:\/\/www.readytext.co.uk\/?p=2428","title":{"rendered":"Adding a UTF-8-capable regular expression library to LuaTeX"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>In this post I&#8217;m going to sketch out adding the free <a href=\"http:\/\/www.pcre.org\/\">PCRE C library<\/a> to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I&#8217;ve not tried this in a production environment. So, do please undertake all necessary testing and due diligence in your own code!<\/p>\n<h2>PCRE: Perl Compatible Regular Expressions<\/h2>\n<p><P>PCRE is a mature C library which provides a very powerful regular expression engine. It is also capable of working with UTF-8 encoded strings, which is, of course, very useful because LuaTeX uses UTF-8 input. I&#8217;m not going to cover the entire PCRE build process in this post because, frankly, it&#8217;ll take too long. But in outline&#8230;<\/p>\n<h3>Building PCRE as a static library (.lib)<\/h3>\n<ol>\n<li>I used <a href=\"http:\/\/www.cmake.org\/\">CMake<\/a> to create a Visual Studio 2008 project via the PCRE-supplied CMakeLists.txt file. Using the CMake tool you can set the appropriate compile-time flags for UFT-8 support: PCRE_SUPPORT_UTF and PCRE_SUPPORT_UNICODE_PROPERTIES. The latter is very useful for seaching UTF-8 strings based on their Unicode character properties. Full details are in the PCRE documentation. <\/li>\n<p><img decoding=\"async\" src=\"http:\/\/readytext.co.uk\/files\/cmake.png\" width=\"100%\"\/><\/p>\n<li>After you finish configuring the PCRE build, and have selected your build environment, press <code>Generate<\/code> and CMake will output a complete Visual Studio project that you can open and start working on. Wonderful!<\/li>\n<li>However, getting PCRE to build as a static library was fine but I did have a few hassles getting the library to correctly link against the DLL I was building. It took me a bit of time to figure out which additional PCRE preprocessor directives I needed to set in the DLL C code to ensure everything was <code>#define<\/code>&#8216;d properly. <\/li>\n<\/ol>\n<h1>Building a DLL for LuaTeX<\/h1>\n<p>I wrote a very brief overview of building DLLs for LuaTeX in <a href=\"https:\/\/www.readytext.co.uk\/?p=489\">this post<\/a> so I won&#8217;t repeat the details here. Instead, I&#8217;ll give a summary indicating how you can get PCRE to call LuaTeX. One word of advice, PCRE comes with <em>a lot<\/em> of documentation and you&#8217;ll need to read through it very carefully! Asking PCRE to call LuaTeX sounds strange but indeed you can do it because PCRE provides the ability to register a callback function it will call each time it matches a string. Perl has a similar ability to execute Perl code on matching a string. From the PCRE documentation:<\/p>\n<blockquote><p>&#8220;PCRE provides a feature called &#8216;callout&#8217;, which is a means of temporarily passing control to the caller of PCRE in the middle of pattern matching. The caller of PCRE provides an external function by putting its entry point in the global variable <code>pcre_callout<\/code>.&#8221; <\/p><\/blockquote>\n<h2>Calling LuaTeX<\/h2>\n<p>OK, so how do we do that? There are two parts to this story: create a Lua function you want to call from C and create the C function which calls the Lua function.<\/p>\n<ol>\n<li>From within LuaTeX, use <code>\\directlua{...}<\/code> to create a simple Lua function <code>printy<\/code> that we are going to call from PCRE. This Lua function takes a string and sends it to LuaTeX via tex.print(). In these examples I sent LuaTeX a simple text string <code>\"Yo! I was called!\"<\/code>, which LuaTeX then typeset. Of course, you could also send LuaTeX the string that was matched by PCRE!\n<pre class=\"brush: cpp; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n       \\directlua{\r\n              function printy (str)\r\n              tex.print(str)\r\n              end\r\n       }\r\n<\/pre>\n<\/li>\n<li>The next part is to create the C code to call a Lua function. This C function is the callout that PCRE will call when it matches a string.\n<pre class=\"brush: cpp; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n       int mycallout(pcre_callout_block *cb){\r\n       lua_State *L;\r\n       L = cb-&gt;callout_data;\r\n       if (L){\r\n              lua_getglobal(cb-&gt;callout_data, &quot;printy&quot;);\r\n              if(!lua_isfunction(L,-1)) {\r\n                     lua_pop(L,1);\r\n                     return 0;\r\n               }\r\n\r\n              lua_pushstring(L, &quot;Yo! I was called!&quot;);   \/* push 1st argument *\/\r\n              \/* Now make the call to printy with 1 argument and 0 results*\/\r\n              if (lua_pcall(L, 1, 0, 0) != 0) {\r\n              \/\/ report your error \r\n               return 0;\r\n              }\r\n    }\r\n    return 0;\r\n}\r\n<\/pre>\n<blockquote><p>A few points here are worth noting. <\/p>\n<ul>\n<li>From the PCRE documentation:<br \/>\n<blockquote><p>&#8220;The external callout function returns an integer to PCRE. If the value is zero, matching proceeds as normal. If the value is greater than zero, matching fails at the current point, but the testing of other matching possibilities goes ahead, just as if a lookahead assertion had failed. If the value is less than zero, the match is abandoned, the matching function returns the negative value&#8221;<\/p><\/blockquote>\n<\/li>\n<li>The <code>lua_State<\/code> variable, <code>*L<\/code>, is passed in via a mechanism I&#8217;ll outline below.<\/li>\n<li>The line <code>lua_getglobal(cb->callout_data, \"printy\")<\/code> does the main work of pushing the value of the gloabal variable <code>printy<\/code> onto Lua&#8217;s stack. Of course, in effect this is a pointer to the function we defined in LuaTeX, and which we call through <code>lua_pcall(...)<\/code>. Further details in the Lua documentation. <\/li>\n<li>The above code does near-zero error checking, it is purely to demonstrate the ideas!<\/li>\n<\/blockquote>\n<\/ul>\n<\/li>\n<\/ol>\n<h2>Other PCRE bits and pieces<\/h2>\n<p>There are a few other points to consider, namely how do you setup the callout and how do you pass <code>lua_State *L<\/code> to the callout? I&#8217;m not going to explain in great detail how all these parts hang together in a full application, simply point out some key pieces.<\/p>\n<ol>\n<li>You have to set the PCRE global variable <code>pcre_callout<\/code>, a function pointer, to your callout function. Simply, <code>pcre_callout = mycallout;<\/code> Yes, it does work. Here, <code>re<\/code> represents our compiled regular expression pattern. Note that you must use the <code>PCRE_UTF8<\/code> option if you are searching UTF-8 encoded text.<\/li>\n<li>Before you can start searching, you need to &#8220;compile&#8221; your regular expression pattern.\n<pre class=\"brush: cpp; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n              re = pcre_compile(pattern,\r\n\t\t      PCRE_UTF8|PCRE_UCP,\r\n\t\t      &amp;err_msg,\r\n\t\t      &amp;err,\r\n\t\t      NULL);\r\n<\/pre>\n<\/li>\n<li>Note, to use PCRE callouts you need to use the appropriate syntax in your regular expression; from the PCRE documentation, &#8220;Within a regular expression, (?C) indicates the points at which the external function is to be called.&#8221; Once you have compiled your search pattern, and done your error checking, you need to run the search engine using the compiled pattern and your target string (<code>s<\/code>) in the code below.\n<\/li>\n<li>\nThe next step is to create a pointer to something called a <code>pcre_callout_block<\/code>, which is a struct. This struct has a field called <code>callout_data<\/code> which is a pointer into which you can store whatever you want to pass into the <code>mycallout<\/code> function: here, I&#8217;m setting it to the <code>lua_State<\/code> variable, <code>L<\/code>. By doing this, each time PCRE matches a string and calls the callout funtion, the <code>lua_State<\/code> variable, <code>L<\/code> will be available for our use! Clearly, you&#8217;ll need to do this from within the appropriate function you call from LuaTeX. Once this is done you are ready to begin your searching using <code>pcre_exec(...)<\/code>. <\/p>\n<pre class=\"brush: cpp; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n              pcre_extra *p;\r\n              p = (pcre_extra*) malloc(sizeof(pcre_extra));\r\n              memset(p,0, sizeof(pcre_extra));\r\n              p-&gt;callout_data = L;\r\n              p-&gt;flags=PCRE_EXTRA_CALLOUT_DATA;\r\n                     res = pcre_exec(re,\r\n                            p,\r\n                            s,\r\n                            len,\r\n                            0,\r\n                            0,\r\n                            offsets,\r\n                     OVECMAX);\r\n<\/pre>\n<\/li>\n<\/ol>\n<h1>Summary<\/h1>\n<p>PCRE is a marvellous and powerful C library &ndash; with copious documentation that you&#8217;ll need to read very carefully! The ability to provide LuaTeX with a UTF-8-enabled regex engine could open the way to some useful applications, particularly when combined with LuaTeX&#8217;s own callback mechanism. In particular, the <code>process_input_buffer<\/code> callback which allows you to change the contents of the line input buffer just before LuaTeX actually starts looking at it. The mind boggles at the possibilities!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In this post I&#8217;m going to sketch out adding the free PCRE C library to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I&#8217;ve not tried this in a production environment. So, do [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,3],"tags":[],"class_list":["post-2428","post","type-post","status-publish","format-standard","hentry","category-luatex-c-code-windows-dlls","category-luatex"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/2428","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2428"}],"version-history":[{"count":41,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/2428\/revisions"}],"predecessor-version":[{"id":2695,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/2428\/revisions\/2695"}],"wp:attachment":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}