STM publishing: tools, technologies and change A WordPress site for STM Publishing


Lua-scriptable PATGEN – i.e., PATGEN 2.4 with a Lua binding…

Posted by Graham Douglas

PATGEN: from WEB to C

I recently became curious about TeX's hyphenation patterns and started to read about how they are created – usually using PATGEN though, from what I've read, some brave souls do actually hand-craft hyphenation patterns! I decided to build PATGEN 2.4 from source code which, of course, means converting the PATGEN WEB source to C code via Web2C. Some time ago I went through the process of building my own Web2C executable for Windows (see this article for more details). I won't go into the specifics of doing the conversion but I was able to create patgen.c – the resulting C code is less than 2,000 lines long. I also spent some time re-formatting the C code simply because the Web2C process of machine-generated C does not aim for beauty, just functionality. I removed all dependencies on Kpathsea and generally tidied the code to create clean, stripped-down code that is easy to compile.

Understanding PATGEN: not so easy

PATGEN is, of course, a very highly specialized program and one that is designed for expert users who really need to use it. As a non-expert looking to understand just the basics I found that there was very little step-by-step "beginners" material – although a search on provided some useful "snippets" and the tutorial "A small tutorial on the multilingual features of PatGen2" by Yannis Haralambous was very helpful. There are, of course, a number of articles, by luminaries and experts, on specific uses of the PATGEN program; however, for me anyway, it was a case of piecing together the puzzle... reading the PATGEN documentation, source code plus some parts of Frank Liang's thesis Word Hy-phen-a-tion by Com-put-er which describes the hyphenation algorithms that PATGEN implements.

Running PATGEN

To run PATGEN you need to provide it with the names/paths of (up to) four files (some can be "nul" if you are not using them):

PATGEN dictionary_file starting_patterns translate_file output_patterns

TIP: I created a PDF file of PATGEN's documentation that you can download here. Some information on the files you provide to PATGEN are discussed in sections 1 to 6 in the first few pages of the documentation.

In very brief outline, the files you provide on the command line are

  • dictionary_file: A pre-prepared list of hyphenated words from which you want to generate hyphenation patterns for TeX to use.
  • starting_patterns: (can be "nul", i.e., it is not mandatory) Best to read the description(s) in the documentation (link above).
  • translate_file: (can be "nul", i.e., it is not mandatory) From the documentation "The translate file may specify the values of \lefthyphenmin and \righthyphenmin as well as the external representation and collating sequence of the `letters' used by the language. It also specifies other information – see the documentation for further details (section 54).
  • output_patterns: the output from PATGEN – a file of hyphenation patterns for use with TeX.

PATGEN: questions, questions...

In order to work its magic, PATGEN makes multiple passes through the dictionary_file as it builds the list of hyphenation patterns. As it performs the processing PATGEN stops to ask you for input: it needs your help at various stages of the processing. Now, I'm not going to go into the details of those questions simply because I'm not sufficiently experienced with the program to be sure that I'd be giving sensible advice. Sorry :-( .

Answering questions via Lua

So, finally, to the main topic of this post. As noted, during processing PATGEN asks you to provide it with some information to guide the pattern-generation process: those details concern the hyphenation levels, pattern lengths plus some heuristics data that assist PATGEN to choose patterns. Ultimately, the answers you give to PATGEN are integer values that you enter at the command line. However, it's a bit frustrating to keep answering PATGEN's questions so I wondered if it would be possible to "automate" providing those answers and, in addition, create a Dynamic Link Library (DLL) that I could use with LuaTeX – perhaps something very basic to start with, like this:


local pgen=require("patgen")


In the above code, require("patgen") will load a DLL (patgen.dll) and return a table of functions that let you set various parameters for PATGEN and then run it to return the pattern list as a string that you can subsequently use with LuaTeX. Note, LuaTeX does NOT require INITEX mode to use hyphenation patterns.

Calling Lua code (functions) from patgen.dll

The above simple scenario does indeed work and it's quite easy to implement this. Firstly, within PATGEN's void mainbody(void) routine you can replace the code that stops to ask you questions – such as the request for the start/finish pattern lengths:

Fputs(output, "pat_start, pat_finish: ");
input2ints(&n1, &n2);

The above code uses a function input2ints (int *a, int *b) to request two integers:

void input2ints (int *a, int *b)
int ch;
while (scanf (SCAN2INT, a, b) != 2)
while ((ch = getchar ()) != EOF && ch != '\n');
if (ch == EOF) return;
fprintf (stderr, "Please enter two integers.\n");

while ((ch = getchar ()) != EOF && ch != '\n');

You can replace this with your own function, say get_pattern_start_finish(&n1, &n2) which can, for example, call a function in your Lua script to work out the values you want to return for n1 and n2 (values for pat_start, pat_finish). Perhaps you might store those values in Lua as a table. At the time of writing I've not yet written that part but, at the moment, from the Lua/C module I just return some hardcoded answers. The next step simply requires making a call from the Lua C API to a named Lua script function that works out the values you want to provide. This gives the most flexibility because the logic is all contained in your Lua code which, of course, makes it very quick and easy to experiment with different settings to generate different patterns. This technique can also be used for other parameters that PATGEN asks for.

Returning the generated pattern(s)

Within PATGEN, there is a function called zoutputpatterns(...) which generates the hyphenation patterns and writes them out to a file. I'm experimenting with a function which "wrappers" this into another function that uses a C++ stringstream object to capture/save the pattern text – rather than writing it to a file. To do this simply required modifying zoutputpatterns(...) to pass in the stringstream object and output patterns (character data) to the stringstream rather write the data than a physical file. Once finished, you can then get access to the stringstream's stored data as a C-style string (containing the generated hyphenation patterns) which you can return to Lua, thus to LuaTeX.

using namespace std;
void do_output_patterns(int i, int j)
std::stringstream *ss;
ss=new std::stringstream();
zoutputpatterns (i , j, ss) ;
std::cout << ss->str().c_str() << endl; //you can then pass the string of patterns back to Lua and thus LuaTeX
delete ss;

In conclusion

This is just a quick summary of a work-in-progress but it looks like it will provide a nice way for fast/rapid experimentation with PATGEN. It seems to offer dynamic generation of hyphenation patterns and provides a method to fully script PATGEN's activities and thus very quickly understand the effect of the parameters PATGEN asks you to provide. If there is any interest I might (eventually) release it once I'm happy that it's good enough.


Building LuaTeX 0.80 on Windows and debugging with Eclipse IDE

Posted by Graham Douglas

Long time, no posts!

It's been a very long time since my last post, some 8 months, so I thought it was about time I posted something new. At the moment, I'm currently looking for new contract work (or employment opportunities) within STM publishing so, for a while, I have some time to devote to my blog.

A new LuaTeX beta (version 0.80) was released on 13 June 2015 and, as usual, I wanted to compile LuaTeX from source code. I grabbed a copy of LuaTeX's source from the subversion repository – on Windows I use the excellent, free, TortoiseSVN software to create my local repository. To create a local repository with TortoiseSVN you use the URL

Compilation failed: time for an update of MinGW/MSYS

At first I could not get a successful compilation of LuaTeX 0.80 even though the prior release ( compiled perfectly. Note that this failure to build LuaTeX 0.80 could simply be due to a problem with my local setup and others might not experience it: I'm merely documenting what I did to fix my own issues with the build. After some discussions with a member of the LuaTeX development team I decided it was time to do a fresh/updated install of the tools you need to compile LuaTeX – MinGW and MSYS, which provide the compiler, libraries, Bash shell and other tools/utilities.

Which ones did I use?

I decided to use the following versions of MinGW and MSYS:

Notes on installation

Installing mingw-w64 just requires running the .exe provided. To install MSYS you simply unpack the file I chose to install mingw-w64 and MSYS on my E: drive in directories called MinGW64 and MSYS respectively. Once you've installed MSYS you need to run a small "post installation" batch file called pi.bat which is located in the postinstall subdirectory of your MSYS folder (e.g,. e:\msys\postinstall\pi.bat). This batch file asks a couple of simple questions to "link up" your MinGW installation and your MSYS installation. After installing mingw-w64 and MSYS I was able to build LuaTeX 0.80 without any difficulties. Note that you will probably need to update your system's PATH environment variable to include the location of the directories which contain the numerous executables provided by mingw-w64 and MSYS.

Next step: Grab the LuaTeX code

As noted, you'll need an SVN client to checkout your own local copy of the LuaTeX repository. I used TortioseSVN and the aforementioned URL: Let's assume you successfully downloaded LuaTeX's source code into a repository directory called, say, e:\luatex\beta-0.80.0. The next step is to start the MSYS Bash shell by double-clicking on the batch file msys.bat located in the root of your MSYS folder. With the Bash shell running, change your current directory by issuing the command cd e:/luatex/beta-0.80.0

Running the build script:

Located within the e:\luatex\beta-0.80.0 directory is a file (Bash shell script) called which you execute to perform the compilation process (i.e., it calls configure and make). If you look inside you'll observe there are several command-line options you can give to the script but I'm not going to cover those here – apart from the --debug option which I'll discuss in a moment. To execute the script you just need to type ./ press return and, hopefully, the build will start. Depending on the options you give to the build script the build process can take quite a long time. On my Intel i7 (6 core) machine (with 16GB memory) it can take as long as 20 minutes for a full build.

Using --debug

As you might have guessed, running the script with the --debug option (./ --debug) creates a version of the luatex.exe executable that contains a wealth of additional information which provides GNU's debugger (gdb) with the information it needs in order to run the executable for debugging purposes. Just to note that, at the time of writing, the non-debug luatex.exe file (on Windows) is approximately 8MB, but the debug version explodes in size to something like 325MB! (again, on Windows). This has been reported to the LuaTeX team and is presently being investigated.

Now the fun stuff: debugging luatex.exe

LuaTeX is a large and very complex piece of software which makes use of many C/C++ libraries, including: FontForge, Cairo, MetaPost, GNU numerical libraries, libpng, zlib and others – all in addition to its own code base plus, of course, the Lua scripting language. It's really quite an amazing feat of programming to glue all these libraries together. If, like me, you are interested to see how LuaTeX works "under the hood", the only way to really achieve that is to create a debug version of LuaTeX (noted above) and run it using the GNU debugger (gdb) – the GNU debugger is installed as part of the mingw-w64 distribution.

I prefer a Visual Debugger

Of course, it's quite possible to use the GNU debugger via a command line but after years of Using Microsost Visual Studio I very much prefer using a graphical interface to set breakpoints, single-step through code, examine variables etc – all the things you do as part of a debug session. However, we've built the luatex.exe (debug version) through a script, using GNU compilers, and we don't have a nice Visual Studio project we can use: so how can we have the pleasures of a GUI-based debugging session? Well, there's some great news: you can! The Eclipse IDE (Integrated Development Environment) has a fantastic feature that let's you import an executable (debug version) and automatically creates a project that lets you use GNU's gdb within a nice GUI world – you can single step through the original C/C++ code, and work just as you would in a typical Visual Studio world. It's really quite amazing and is possible because the debug version of luatex.exe is expanded to provide/include the additional information that lets you do this.

Installing Eclipse on Windows

Eclipse is built in Java so you'll first need to ensure you have Java (and the Java Development Kit) installed before you try to install Eclipse. I already had 32-bit Java installed but I decided to install the 64-bit version (keeping the 32-bit version) – all I did was to install the 64-bit Java version in a different directory. Again, you might need to update your system's PATH environment variable so that it can find Java executables.

Which Eclipse?

You need the Eclipse IDE for C/C++ Developers. The latest version is, at the time of writing, available here (again I opted for the 64-bit version). Now I must confess that I did encounter a few minor issues with trying to configure Eclipse and telling it to use to use the compiler setup provided by mingw-w64. Such issues can be very dependent on your local setup so I won't go into the details. However, if, like me, you do encounter difficulties trying test the Eclipse install (compiling a simple test C program with the GNU compiler) then be patient and Google for help + tips because most issues are likely to have been noted/discussed somewhere on the web.

A TIP I can offer: .w source file extensions

One particular issue you might hit when trying to debug LuaTeX with the GNU debugger (gdb) is the strange source file extension used by some source files in the LuaTeX code base. Much of core LuaTeX is written in CWEB, which is the C-code version of Knuth's venerable WEB (structured documentation) format. CWEB code is a mixture of C program code and TeX documentation code. During the build process a program called CTANGLE processes the .w files to generate the C source for compilation. However, the debug executable contains references to these .w source files but Eclipse needs to be told that files with a .w extension are source files, otherwise Eclipse and the GNU debugger (gdb) get "confused" and claim they can't find the .w source files – meaning you can't step into the source code. All I did was (within Eclipse) to set up .w as a source file type under Window --> Preferences --> C/C++--> File Types as shown in the screenshot below.

And finally: opening luatex.exe with Eclipse

From the Eclise menu, choose File --> Import and select C/C++ Executable. Click "Next" then "Browse" to locate the debug version of the LuaTeX executable you built earlier (Note: here I've named my executable file as luatex080debug.exe). From then on, just continue with the import process – for now, during the import process I just accepted the default options offered by Eclipse (you can read-up later). Once Eclipse is ready, click "Debug" in the final step and the debug version of the executable will be examined and parsed to extract all the information it contains and from that data Eclipse will build you a project for debugging LuaTeX, complete with access to all the source code for you to set breakpoints and step through at your leisure. How amazing is that!

The debug executable being parsed by gdb and Eclipse building your debugging project:

The Eclipse IDE debugging LuaTeX using GNU gdb debugger

Single-stepping through LuaTeX's source code in Eclipse :-)

The Eclipse IDE debugging LuaTeX using GNU gdb debugger


Testing embedding some Tweets

Posted by Graham Douglas


Looking inside TeX: strings and pool files

Posted by Graham Douglas


In this post we'll cover TeX's handing of strings and explain .pool files. Using Web2C to build (Knuthian) TeX from Knuth's TeX.WEB source code involves many steps as explained elsewhere on this site. One of the initial steps when building TeX is combining Knuth's master source file (TeX.WEB) with a "change file" (TeX.CH) to produce a modified WEB source file (let's call it TeXk.WEB) which can be processed via the Web2C process. The TeX.CH change file applies many modifications to the master TeX.WEB source code – i.e., in preparation for conversion to C code and adding support for the kpathsea file-seaching library. After the change file has been applied, the next step is to process our modified TeX.WEB (i.e., TeXk.WEB) via the TANGLE program. If TANGLE successfully parses our TeXk.WEB source code it will output two files (download links are provided for the inquisitive):

  • TeXk.p: the source code of TeX (in Pascal).
  • TeXk.pool: a file containing the string constants defined in TeXk.WEB

Here's a small fragment of TeXk.pool as produced during my Web2C process:

09save size
15grouping levels
28Incompatible magnification (
36 the previous value will be retained
58I can handle only one magnification ratio per job. So I've
59reverted to the magnification you used earlier on this run.
46Illegal magnification has been changed to 1000
52The magnification ratio must be between 1 and 32768.

TeXk.pool consists of many lines of the format [string length][string text][end_of_line] and final containing *CHECKSUM, where CHECKSUM in the above example is 413816964. Once upon a time, .pool files had to be preserved as an external file for use when building .fmt files via INITEX but in 2008 this was changed and the .pool file is now compiled into the TeX binaries – I'll explain this below. For example, the following note is contained in more recent texmf.cnf files:

As of 2008, pool files don't exist any more (the strings are compiled into the binaries), but just in case something expects to find these:
TEXPOOL = .;$TEXMF/web2c

As you can see from the above fragment, the TeXk.pool file contains string constants for TeX's primitive commands plus all the strings contained in help/error messages that TeX outputs to the terminal and/or log file.

TeX's internal handling of strings

In addition to the string constants defined in TeXk.pool, TeX will, of course, encounter new strings – for example, when you define new macro names; consequently, TeX needs a way to store the string constants in TeXk.pool and the strings it encounters during its run-time processing of your TeX files. It should not be a surprise that TeX's internal handling of strings is achieved through methods designed to ensure portability.

From TeX.WEB: The TEX system does nearly all of its own memory allocation, so that it can readily be transported into environments that do not have automatic facilities for strings, garbage collection, etc., and so that it can be in control of what error messages the user receives... Control sequence names and diagnostic messages are variable-length strings of eight-bit characters. Since PASCAL does not have a well-developed string mechanism, TeX does all of its string processing by homegrown methods.

How does TeX use/store strings?

In vanilla C, a simple 8-bit string is an array of characters terminated by the null character ('\0'). TeX does not store is strings as individually named string variables but allocates a single large array and uses integer offsets into that array to identify strings (and calculate lengths). Here's how it works.

From TeX.WEB: The array |str_pool| contains all of the (eight-bit) ASCII codes in all of the strings, and the array |str_start| contains indices of the starting points of each string. Strings are referred to by integer numbers, so that string number |s| comprises the characters |str_pool[j]| for |str_start[s]<=j<str_start[s+1]|. Additional integer variables |pool_ptr| and |str_ptr| indicate the number of entries used so far in |str_pool| and |str_start|, respectively; locations |str_pool[pool_ptr]| and |str_start[str_ptr]| are ready for the next string to be allocated.

It is worth noting that when TANGLE produces Pascal code (from the WEB source) it strips out all underscores from variables defined in the WEB code. For example, the |str_pool| variable mentioned above is called strpool in the final C code produced from the Pascal.

After processing via Web2C, the WEB variables |str_pool|, |str_start|, |pool_ptr| and |str_ptr| are global variables declared as follows (near the start of TeX.C):

packedASCIIcode * strpool ;
poolpointer * strstart ;
poolpointer poolptr ;
strnumber strptr

The types packedASCIIcode and poolpointer are simply typedefs:

typedef unsigned char packedASCIIcode ;
typedef int integer;
typedef integer poolpointer ;

Stripping away all typedefs introduced by Web2C gives:

unsigned char* strpool ;
int* strstart ;
int poolptr ;
int strptr ;

To see what's going on, i.e., how TeX identifies a string, let's first look at the global variable strpool (practically all key variables are declared with global scope in TeX.C...!). During initialization (in INITEX mode, and when TeX is reading/unpacking a .fmt file to initialize a particular format (plain.fmt, latex.fmt etc)) the strpool and strstart variables are initialized as follows:

strpool = xmallocarray (packedASCIIcode , poolsize) ;
strstart = xmallocarray (poolpointer , maxstrings) ;

where xmallocarray is a #define:

/* Allocate an array of a given type. Add 1 to size to account for the fact that Pascal arrays are used from [1..size], unlike C arrays which use [0..size]. */
#define xmallocarray(type,size) ((type*)xmalloc((size+1)*sizeof(type)))

and xmalloc(...) is a small utility function wrapped around the standard C function malloc(...).

A Pascal legacy: In many places within TeX.C you have to account for that fact that Pascal arrays start at index 1 but C arrays start at index 0. This is a consequence that Knuthian TeX is written in Pascal, not C.

The allocation of memory for strpool uses an integer variable called poolsize: the value of poolsize is calculated at run-time from the value of other variables – including some variables whose value can be defined by settings in texmf.cnf. So, in essence:

strpool = (char *) malloc(sizeof(unsigned char)*(poolsize +1));

– which looks very much like one huge C string. And, of course, it is. strpool stores all TeX's strings BUT within strpool all strings are contiguous (stored end-to-end) without any delimiter characters between them (such as NULL, ('\0'), space, etc). Clearly, there needs to be a mechanism to define where each individual string starts and stops: i.e., to partition strpool into individual strings. That mechanism is the task of the integer array variable called strstart. Perhaps an example will make this clearer.

We can declare a variable myfakestrpool as follows:

unsigned char fakestrpool[]="ThisismyfakeTeXstrpool";

Here, we have concatenated the 6 strings "This", "is", "my", "fake","TeX" and "strpool" into one long string. These 6 strings start at the following offsets in fakestrpool:

string 0 ("This"): offsets 0
string 1 ("is"): offset 4
string 2 ("my"): offset 6
string 3 ("fake"): offset 8
string 4 ("TeX"): offset 12
string 5 ("strpool") offset 15

So, if we define an array of integers, strstart, to record these offsets:

int strstart[6] ; // for 6 strings numbered 0 to 5


Then for some string identified by a number k (where 0 =< k <= 5), strstart[k] gives the offset into fakestrpool where the kth string starts. And this is exactly how TeX identifies strings: it identifies them using some integer value, k, say, where strstart[k] tells you where that string starts (in strpool) and allows the length (length(k), of string number k) to be easily be calculated using

length(k) = strstart[k + 1] - strstart[k]

For example, let us use this method to calculate the length of the string with number 4 (k=4) ("TeX" in our test array fakestrpool).

length(4) = strstart[5] - strstart[4]
length(5) = 15 - 12 = 3

Of course there is one minor complication – calculating the length of string 5, but we have other variables (poolptr and strptr) to solve issues like this.

Back to .pool files

We started this discussion by noting that running the TANGLE program on TeXk.WEB produces two output files:

  • TeXk.p: the source code of TeX (in Pascal).
  • TeXk.pool: a file containing the string constants defined in TeXk.WEB

The next stage in the discussion covers the mechanisms for processing .pool files – introduced in circa 2008. Prior to (circa) 2008, you needed to keep .pool files available (part of the TeX distribution) as separate files for use whenever you ran INITEX to generate a new .fmt file. As noted, the contents of the .pool files are string constants generated by TANGLE from string constants defined in main WEB source code to TeX. Given that those strings they don't change (they are constants), it makes more sense to build them into the TeX executable file rather than having to access them each time a new .fmt file created by INITEX. Part of the Web2C process now involves using a small utility program called makecpool.exe (on Windows) – makecpool.C was written by Taco Hoekwater. The input to makecpool.exe is the TeXk.pool file and the output is another C file (called texpool.C or similar) which defines a function called loadpoolstrings(...):

int loadpoolstrings (int spare_size)


If you just want to see the inputs/outputs you can download the files I produced during my private build of Knuthian TeX:

  • TeXk.pool: The .pool file input for makecpool.exe
  • texpool.C: The C file output by makecpool.exe, defining the function loadpoolstrings(...).

Once you have generated texpool.c you no longer need the original TeXk.pool file because the contents of TeXk.pool are now stored within texpool.C, stored as array of strings:

static const char *poolfilearr[] = {
  "buffer size",
  "pool size",
  "number of strings",
  "" "?" "?" "?",
  "End of file on the terminal!",
  "! ",
  "(That makes 100 errors; please try again.)",
  "" "? ",
  "Type <return> to proceed, S to scroll future error messages,",
  "R to run without stopping, Q to run quietly,",
  "I to insert something, ",

Of course, when you build TeX you will need to compile TeXk.C and texpool.C so that the function loadpoolstrings(...) is made available. The function loadpoolstrings(...) is called from TeX.C when TeX is in INITEX mode (i.e., the --ini option is set on the command line). Specifically, loadpoolstrings(...) function is called by the function getstringsstarted(...) just after it has initialized the first 256 strings in TeX's main string container: the strpool array discussed above.

Modifying loadpoolstrings (...) to see what it does

The function loadpoolstrings(...) depends on a few of TeX's internal global variables and the function makestring() (we'll discuss that shortly), notably we need to declare the following vaiables as extern to texpool.C:

extern int makestring ( void ) ;
extern unsigned char * strpool;
extern int poolptr;

Here is my slightly modified version of loadpoolstrings(...) which outputs a file called "datadump.txt" to list the strings and corresponding string numbers generated by makestring():

int loadpoolstrings (int spare_size) {
  const char *s;
  int g=0;
  FILE* dumpvals;
  int i=0,j=0;
  dumpvals=fopen("datadump.txt", "wb");

  while ((s = poolfilearr[j++])) {
    int l = strlen (s);
	fprintf(dumpvals, "//string \"%s\" = number ", s);
    i += l;
    if (i>=spare_size) return 0;
    while (l-- > 0) strpool[poolptr++] = *s++;
    g = makestring();
	fprintf(dumpvals, "%ld\n", g);
  return g;


Those who might be interested to see the contents of datadump.txt can download it here. In any case, here's a listing of the first few lines in datadump.txt:

//string "buffer size" = number 256
//string "pool size" = number 257
//string "number of strings" = number 258
//string "???" = number 259
//string "m2d5c2l5x2v5i" = number 260
//string "End of file on the terminal!" = number 261
//string "! " = number 262
//string "Using character substitution: " = number 1329

As you can see, the string number of the first string is 256 (i.e., the first string originally contained in TeXk.pool). Assuming that the string numbers start at 0 (they do), TeX has already initialized strings 0..255 before loading the strings from the TeXk.pool file. I hate to do this to you, dear reader, but can you guess what those 256 strings (0..255) might be?

The function makestring()

Here is TeX's makestring() function which returns a string number after checking for overflows – i.e., TeX has enough space to store another string.

strnumber makestring (void)
  register strnumber Result; makestring_regmem
  if (strptr == maxstrings)
  overflow (258 , maxstrings - initstrptr) ;
  incr (strptr) ;
  strstart[strptr] = poolptr ;
  Result = strptr - 1 ;
  return Result ;

Time to stop

Dear reader, writing this post has absorbed the greater part of my Sunday (14 September 2014) so you'll forgive me if I call it a day and leave it here – I'll fix any typos tomorrow :-) . I hope it is of use, or interest, to someone "out there", somewhere.


RegexBuddy and RegexMagic: Truly superb regular expression tools

Posted by Graham Douglas

Regular expressions are part of many programmer's toolkit but they can be quite fiddly to get right. At the moment, I'm trying to "sanitize" the C code generated for TeX (via Web2C) by post-processing the TeX.c file to make the C source code far more readable. To do that I'm using the original definitions in TeX.WEB to generate C #define statements that I can use in TeX.c. For example, in TeX.WEB you see the following "WEB macros" related to entries in TeX's "equivalence table":

@d eq_level_field(#)==#.hh.b1
@d eq_type_field(#)==#.hh.b0
@d equiv_field(#)==#.hh.rh
@d eq_level(#)==eq_level_field(eqtb[#]) {level of definition}
@d eq_type(#)==eq_type_field(eqtb[#]) {command code for equivalent}
@d equiv(#)==equiv_field(eqtb[#]) {equivalent value}

When WEB expressions using the above macros are processed by TANGLE and Web2C the resulting C code contains many statements that look like the following:

eqtb [curval ].hh.b1 = 1 ;
eqtb [curval ].hh.b0 = c ;
eqtb [curval ].hh .v.RH = o ;

Not very readable but, of course, it is machine-generated C code so what would you expect. Through regular expressions I'm (slowly/carefully) replacing many raw C statements using #defines, such as the following:

#define equivalence_level(a) eqtb[a].hh.b1
#define command_code_equivalence(a) eqtb[a].hh.b0
#define set_value_of_equivalent(a) eqtb[a].hh.v.RH

As part of this work, I use two very useful tools for building and testing regular expressions: RegexBuddy and RegexMagic (the tools are compared/explained here). They help you build, test/develop regular expressions and support the syntax and options of many regular expression engines. Once you have a working regex, RegexBuddy and RegexMagic will generate code that allows you to use the regex in a language of your choice (many languages are supported), including C code to use the regex with PCRE – which is my favourite regex library. Again, this is not an advert for these tools, just some notes from someone who has found them to be extremely useful – and have saved me considerable amounts of time in building, testing/using powerful regular expressions with PCRE.

Screenshot: RegexBuddy

Processing INITEX's primitive(...) function code with RegexBuddy to extract data for preparing C #defines.


Tip: PCRE, how to fix the stack overflow problem (Windows)

Posted by Graham Douglas

I kept hitting a stack overflow problem when using the PCRE regular expression C library under windows. To fix it, put the following line in PCRE's config.h to tell it not to use recursion but to use the heap instead.

#define NO_RECURSE 1

Worked for me.