STM publishing: tools, technologies and change A WordPress site for STM Publishing


RegexBuddy and RegexMagic: Truly superb regular expression tools

Posted by Graham Douglas

Regular expressions are part of many programmer's toolkit but they can be quite fiddly to get right. At the moment, I'm trying to "sanitize" the C code generated for TeX (via Web2C) by post-processing the TeX.c file to make the C source code far more readable. To do that I'm using the original definitions in TeX.WEB to generate C #define statements that I can use in TeX.c. For example, in TeX.WEB you see the following "WEB macros" related to entries in TeX's "equivalence table":

@d eq_level_field(#)==#.hh.b1
@d eq_type_field(#)==#.hh.b0
@d equiv_field(#)==#.hh.rh
@d eq_level(#)==eq_level_field(eqtb[#]) {level of definition}
@d eq_type(#)==eq_type_field(eqtb[#]) {command code for equivalent}
@d equiv(#)==equiv_field(eqtb[#]) {equivalent value}

When WEB expressions using the above macros are processed by TANGLE and Web2C the resulting C code contains many statements that look like the following:

eqtb [curval ].hh.b1 = 1 ;
eqtb [curval ].hh.b0 = c ;
eqtb [curval ].hh .v.RH = o ;

Not very readable but, of course, it is machine-generated C code so what would you expect. Through regular expressions I'm (slowly/carefully) replacing many raw C statements using #defines, such as the following:

#define equivalence_level(a) eqtb[a].hh.b1
#define command_code_equivalence(a) eqtb[a].hh.b0
#define set_value_of_equivalent(a) eqtb[a].hh.v.RH

As part of this work, I use two very useful tools for building and testing regular expressions: RegexBuddy and RegexMagic (the tools are compared/explained here). They help you build, test/develop regular expressions and support the syntax and options of many regular expression engines. Once you have a working regex, RegexBuddy and RegexMagic will generate code that allows you to use the regex in a language of your choice (many languages are supported), including C code to use the regex with PCRE – which is my favourite regex library. Again, this is not an advert for these tools, just some notes from someone who has found them to be extremely useful – and have saved me considerable amounts of time in building, testing/using powerful regular expressions with PCRE.

Screenshot: RegexBuddy

Processing INITEX's primitive(...) function code with RegexBuddy to extract data for preparing C #defines.


TeX’s “badness” function in C

Posted by Graham Douglas

TeX uses the concept of "badness" as a measure of how much the glue in a box has to stretch or shrink. In the following C function, t is the difference between the total of the natural sizes (N) of the components in the box and the desired size of the box (d). So, t = N-d. If the total amount of glue available for stretching or shrinking is s, then the badness, according to the TeXbook, is $100(t/s)^3$ – note that t/s is also known as the glue-set-ratio (often denoted r). In reality, TeX uses an approximation to this calculation, as shown below – the C code is from the C output by Web2C.

typedef int scaled  ;
typedef int halfword  ;

halfword badness ( scaled t , scaled s )
  halfword Result;
  integer r  ;
  if ( t == 0 )
  Result = 0 ;
  else if ( s <= 0 )
  Result = 10000 ;
  else {
    if ( t <= 7230584L )
    r = ( t * 297 ) / s ;
    else if ( s >= 1663497L )
    r = t / ( s / 297 ) ;
    else r = t ;
    if ( r > 1290 )
    Result = 10000 ;
    else Result = ( r * r * r + 131072L ) / 262144L ;
  return Result ;

What is TeX’s memoryword structure in C?

Posted by Graham Douglas

Just a brief post, partly to record this for my own use. If you read the source code of TeX you will see references to a data structure called a memoryword. It is very carefully defined in the source file texmfmem.h, using various #ifdef blocks to account for endian-type and the "flavour" of TeX you are compiling. So, here is the memoryword, stripped to the very basics for my Windows-only build of TeX. On my machine, sizeof(memoryword) = 8 bytes – glueratio is defined as the type double (8 bytes) – TeX does use the type double for glue calculations. From section 109 of the TeX source code:

When TEX "packages" a list into a box, it needs to calculate the proportionality ratio by which the glue inside the box should stretch or shrink. This calculation does not affect TEX's decision making, so the precise details of rounding, etc., in the glue calculation are not of critical importance for the consistency of results on different computers.

#define glueratio double
typedef unsigned char quarterword  ;
typedef int integer;
typedef integer halfword  ;

typedef union
   struct  {
     int LH, RH;
  } v;

   struct {
     short B1, B0;
  } u;

} twohalves;

typedef struct
   struct {
     quarterword B3, B2, B1, B0;
  } u;

} fourquarters;

typedef union {

   glueratio gr;
   twohalves hh;

   struct {
     halfword junk;
     integer CINT;
   } u;

  struct {
     halfword junk;
     fourquarters QQQQ;
  } v;

} memoryword;

typedef union {

  struct {
    integer CINT;
  } u;

    fourquarters QQQQ;
  } v;

} fmemoryword;

Tip: PCRE, how to fix the stack overflow problem (Windows)

Posted by Graham Douglas

I kept hitting a stack overflow problem when using the PCRE regular expression C library under windows. To fix it, put the following line in PCRE's config.h to tell it not to use recursion but to use the heap instead.

#define NO_RECURSE 1

Worked for me.


Minimal FreeType program to dump PostScript font names (with file globbing)

Posted by Graham Douglas


I needed to create an updated font map for some work with dvipng/dvips and had to update to contain the mapping between tfm/pfb files and the corresponding PostScript name for each font. To do that I wrote a tiny C program (a simple throw-away utility using FreeType) to extract the PostScript font name from the .pfb files. To save time I used "file globbing" so that the utility's command line could use wildcards – e,g.,[path]\*.pfb to list all Type 1 fonts in [path]. To use file globbing with Windows you need to link your code with an object file called setargv.obj which takes care of the messy details and expands the wildcards on the command line. I use the now-ancient Visual Studio 2008 IDE (good enough for me!) and needed to add setargv.obj as an additional project dependency under "Additional Dependencies" in the project settings for the linker. With that in place, the following ultra-simple program (no error checking!!) prints the font's PostScript name and the full path name of the corresponding font file.

#include <stdio.h>
#include <ft2build.h>
#include FT_FREETYPE_H
#include FT_GLYPH_H
#include FT_OUTLINE_H

int main(int argc, char ** argv)

	FT_Library libfreetype;
	FT_Face     ftface;
	int i;

	FT_Init_FreeType( &libfreetype );

    for (i=1; i<argc; i++){

		FT_New_Face( libfreetype, argv[i], 0, &ftface );
		printf("%s %s\n", FT_Get_Postscript_Name(ftface), argv[i]);

    return 0;


Random test of a neat widget

Posted by Graham Douglas

Large Visitor Globe