25Oct/14Off

14Oct/14Off

MetaPost MPlib

I'm currently implementing a project built around the MetaPost library MPlib. I managed to build MPlib as a Windows .lib (library) file without "too much" difficulty... In order to understand the workings of the powerful, but complex, MPlib library I found it was very helpful to read parts of Hobby's original code – mainly in relation to generating output from the low-level MPlib/MetaPost edge structures. I also benefitted enormously from reading the C code of the Lua binding so a huge thank you to Taco Hoekwater for his utterly brilliant work on the MPlib/lmplib source code.

I tracked down the MetaPost 0.64 source code (the .web code) and ran TIE and WEAVE to generate the TeX documentation. After a few tiny fixes (for fonts I don't have) I produced a PDF file which I thought others might find useful. You can download it here. The MPlib API documentation (again by Taco) was also very helpful – documentation for version 1.800 of the MPlib API is available here.

6Oct/14Off

Just a random test

Just trying out conversion of fully-vowelled typeset Arabic into SVG: using three different fonts. Seems to work.

5Oct/14Off

A real gotcha! (well, it got me)

I recently built Cairo 1.12.16 as a Windows .lib file using Visual Studio. Somewhat of a painful process but it seems to work fine. One detail that caught me out (and took hours to track down) was that I did not set a critically important preprocessor setting: HAVE_FT_LOAD_SFNT_TABLE. This is important if you are using FreeType: Without setting HAVE_FT_LOAD_SFNT_TABLE Cairo uses a "fallback" process for embedding fonts, which is not ideal.

The C source files for a Windows build

Through trial-and-error I eventually reduced the C files I needed to the list below. So, the .lib file I built is a slightly cut-down build of Cairo but so far it seems to work OK, at least for what I need. You will also need to manually create a header file called cairo-features.h.

Preprocessor definitions

I used the following:

WIN32
_DEBUG
_LIB
CAIRO_WIN32_STATIC_BUILD
CAIRO_HAS_FT_FONT

FreeType
Pixman
libpng
ZLib

List of C source files required (for a cut-down build)

cairo-analysis-surface.c
cairo-arc.c
cairo-array.c
cairo-atomic.c
cairo-base64-stream.c
cairo-base85-stream.c
cairo-bentley-ottmann-rectangular.c
cairo-bentley-ottmann-rectilinear.c
cairo-bentley-ottmann.c
cairo-botor-scan-converter.c
cairo-boxes-intersect.c
cairo-boxes.c
cairo-cache.c
cairo-cff-subset.c
cairo-clip-boxes.c
cairo-clip-polygon.c
cairo-clip-region.c
cairo-clip-surface.c
cairo-clip-tor-scan-converter.c
cairo-clip.c
cairo-color.c
cairo-composite-rectangles.c
cairo-compositor.c
cairo-contour.c
cairo-damage.c
cairo-debug.c
cairo-default-context.c
cairo-deflate-stream.c
cairo-device.c
cairo-error.c
cairo-fallback-compositor.c
cairo-fixed.c
cairo-font-face-twin-data.c
cairo-font-face-twin.c
cairo-font-face.c
cairo-font-options.c
cairo-freed-pool.c
cairo-freelist.c
cairo-ft-font.c
cairo-gstate.c
cairo-hash.c
cairo-hull.c
cairo-image-compositor.c
cairo-image-info.c
cairo-image-source.c
cairo-image-surface.c
cairo-lzw.c
cairo-matrix.c
cairo-mempool.c
cairo-mesh-pattern-rasterizer.c
cairo-misc.c
cairo-mono-scan-converter.c
cairo-mutex.c
cairo-no-compositor.c
cairo-observer.c
cairo-output-stream.c
cairo-paginated-surface.c
cairo-path-bounds.c
cairo-path-fill.c
cairo-path-fixed.c
cairo-path-in-fill.c
cairo-path-stroke-boxes.c
cairo-path-stroke-polygon.c
cairo-path-stroke-traps.c
cairo-path-stroke-tristrip.c
cairo-path-stroke.c
cairo-path.c
cairo-pattern.c
cairo-pdf-operators.c
cairo-pdf-surface.c
cairo-pen.c
cairo-png.c
cairo-polygon-intersect.c
cairo-polygon-reduce.c
cairo-polygon.c
cairo-ps-surface.c
cairo-raster-source-pattern.c
cairo-recording-surface.c
cairo-rectangle.c
cairo-rectangular-scan-converter.c
cairo-region.c
cairo-rtree.c
cairo-scaled-font-subsets.c
cairo-scaled-font.c
cairo-script-surface.c
cairo-slope.c
cairo-spans-compositor.c
cairo-spans.c
cairo-spline.c
cairo-stroke-dash.c
cairo-stroke-style.c
cairo-surface-clipper.c
cairo-surface-fallback.c
cairo-surface-observer.c
cairo-surface-offset.c
cairo-surface-snapshot.c
cairo-surface-subsurface.c
cairo-surface-wrapper.c
cairo-surface.c
cairo-svg-surface.c
cairo-time.c
cairo-tor-scan-converter.c
cairo-tor22-scan-converter.c
cairo-toy-font-face.c
cairo-traps-compositor.c
cairo-traps.c
cairo-tristrip.c
cairo-truetype-subset.c
cairo-type1-fallback.c
cairo-type1-glyph-names.c
cairo-type1-subset.c
cairo-type3-glyph-surface.c
cairo-unicode.c
cairo-user-font.c
cairo-version.c
cairo-wideint.c
cairo-win32-debug.c
cairo-win32-device.c
cairo-win32-display-surface.c
cairo-win32-font.c
cairo-win32-gdi-compositor.c
cairo-win32-printing-surface.c
cairo-win32-surface.c
cairo-win32-system.c
cairo.c

14Sep/14Off

Introduction

In this post we'll cover TeX's handing of strings and explain .pool files. Using Web2C to build (Knuthian) TeX from Knuth's TeX.WEB source code involves many steps as explained elsewhere on this site. One of the initial steps when building TeX is combining Knuth's master source file (TeX.WEB) with a "change file" (TeX.CH) to produce a modified WEB source file (let's call it TeXk.WEB) which can be processed via the Web2C process. The TeX.CH change file applies many modifications to the master TeX.WEB source code – i.e., in preparation for conversion to C code and adding support for the kpathsea file-seaching library. After the change file has been applied, the next step is to process our modified TeX.WEB (i.e., TeXk.WEB) via the TANGLE program. If TANGLE successfully parses our TeXk.WEB source code it will output two files (download links are provided for the inquisitive):

• TeXk.p: the source code of TeX (in Pascal).
• TeXk.pool: a file containing the string constants defined in TeXk.WEB

Here's a small fragment of TeXk.pool as produced during my Web2C process:

....
11expandafter
04font
09fontdimen
06halign
05hrule
12ignorespaces
10mathaccent
08mathchar
10mathchoice
08multiply
07noalign
10noboundary
08noexpand
04omit
07penalty
08prevgraf
05relax
06setbox
03the
06valign
07vcenter
05vrule
09save size
15grouping levels
08curlevel
09retaining
09restoring
05SAVE(
28Incompatible magnification (
02);
36 the previous value will be retained
58I can handle only one magnification ratio per job. So I've
59reverted to the magnification you used earlier on this run.
46Illegal magnification has been changed to 1000
52The magnification ratio must be between 1 and 32768.
...
*413816964

TeXk.pool consists of many lines of the format [string length][string text][end_of_line] and final containing *CHECKSUM, where CHECKSUM in the above example is 413816964. Once upon a time, .pool files had to be preserved as an external file for use when building .fmt files via INITEX but in 2008 this was changed and the .pool file is now compiled into the TeX binaries – I'll explain this below. For example, the following note is contained in more recent texmf.cnf files:

As of 2008, pool files don't exist any more (the strings are compiled into the binaries), but just in case something expects to find these:
TEXPOOL = .;$TEXMF/web2c MFPOOL =${TEXPOOL}
MPPOOL = \${TEXPOOL}

As you can see from the above fragment, the TeXk.pool file contains string constants for TeX's primitive commands plus all the strings contained in help/error messages that TeX outputs to the terminal and/or log file.

TeX's internal handling of strings

In addition to the string constants defined in TeXk.pool, TeX will, of course, encounter new strings – for example, when you define new macro names; consequently, TeX needs a way to store the string constants in TeXk.pool and the strings it encounters during its run-time processing of your TeX files. It should not be a surprise that TeX's internal handling of strings is achieved through methods designed to ensure portability.

From TeX.WEB: The TEX system does nearly all of its own memory allocation, so that it can readily be transported into environments that do not have automatic facilities for strings, garbage collection, etc., and so that it can be in control of what error messages the user receives... Control sequence names and diagnostic messages are variable-length strings of eight-bit characters. Since PASCAL does not have a well-developed string mechanism, TeX does all of its string processing by homegrown methods.

How does TeX use/store strings?

In vanilla C, a simple 8-bit string is an array of characters terminated by the null character ('\0'). TeX does not store is strings as individually named string variables but allocates a single large array and uses integer offsets into that array to identify strings (and calculate lengths). Here's how it works.

From TeX.WEB: The array |str_pool| contains all of the (eight-bit) ASCII codes in all of the strings, and the array |str_start| contains indices of the starting points of each string. Strings are referred to by integer numbers, so that string number |s| comprises the characters |str_pool[j]| for |str_start[s]<=j<str_start[s+1]|. Additional integer variables |pool_ptr| and |str_ptr| indicate the number of entries used so far in |str_pool| and |str_start|, respectively; locations |str_pool[pool_ptr]| and |str_start[str_ptr]| are ready for the next string to be allocated.

It is worth noting that when TANGLE produces Pascal code (from the WEB source) it strips out all underscores from variables defined in the WEB code. For example, the |str_pool| variable mentioned above is called strpool in the final C code produced from the Pascal.

After processing via Web2C, the WEB variables |str_pool|, |str_start|, |pool_ptr| and |str_ptr| are global variables declared as follows (near the start of TeX.C):

packedASCIIcode * strpool ;
poolpointer * strstart ;
poolpointer poolptr ;
strnumber strptr

The types packedASCIIcode and poolpointer are simply typedefs:

typedef unsigned char packedASCIIcode ;
typedef int integer;
typedef integer poolpointer ;

Stripping away all typedefs introduced by Web2C gives:

unsigned char* strpool ;
int* strstart ;
int poolptr ;
int strptr ;

To see what's going on, i.e., how TeX identifies a string, let's first look at the global variable strpool (practically all key variables are declared with global scope in TeX.C...!). During initialization (in INITEX mode, and when TeX is reading/unpacking a .fmt file to initialize a particular format (plain.fmt, latex.fmt etc)) the strpool and strstart variables are initialized as follows:

strpool = xmallocarray (packedASCIIcode , poolsize) ;
strstart = xmallocarray (poolpointer , maxstrings) ;

where xmallocarray is a #define:

/* Allocate an array of a given type. Add 1 to size to account for the fact that Pascal arrays are used from [1..size], unlike C arrays which use [0..size]. */
#define xmallocarray(type,size) ((type*)xmalloc((size+1)*sizeof(type)))

and xmalloc(...) is a small utility function wrapped around the standard C function malloc(...).

A Pascal legacy: In many places within TeX.C you have to account for that fact that Pascal arrays start at index 1 but C arrays start at index 0. This is a consequence that Knuthian TeX is written in Pascal, not C.

The allocation of memory for strpool uses an integer variable called poolsize: the value of poolsize is calculated at run-time from the value of other variables – including some variables whose value can be defined by settings in texmf.cnf. So, in essence:

strpool = (char *) malloc(sizeof(unsigned char)*(poolsize +1));

– which looks very much like one huge C string. And, of course, it is. strpool stores all TeX's strings BUT within strpool all strings are contiguous (stored end-to-end) without any delimiter characters between them (such as NULL, ('\0'), space, etc). Clearly, there needs to be a mechanism to define where each individual string starts and stops: i.e., to partition strpool into individual strings. That mechanism is the task of the integer array variable called strstart. Perhaps an example will make this clearer.

We can declare a variable myfakestrpool as follows:

unsigned char fakestrpool[]="ThisismyfakeTeXstrpool";

Here, we have concatenated the 6 strings "This", "is", "my", "fake","TeX" and "strpool" into one long string. These 6 strings start at the following offsets in fakestrpool:

string 0 ("This"): offsets 0
string 1 ("is"): offset 4
string 2 ("my"): offset 6
string 3 ("fake"): offset 8
string 4 ("TeX"): offset 12
string 5 ("strpool") offset 15

So, if we define an array of integers, strstart, to record these offsets:

int strstart[6] ; // for 6 strings numbered 0 to 5

strstart[0]=0
strstart[1]=4
strstart[2]=6
strstart[3]=8
strstart[4]=12
strstart[5]=15

Then for some string identified by a number k (where 0 =< k <= 5), strstart[k] gives the offset into fakestrpool where the kth string starts. And this is exactly how TeX identifies strings: it identifies them using some integer value, k, say, where strstart[k] tells you where that string starts (in strpool) and allows the length (length(k), of string number k) to be easily be calculated using

length(k) = strstart[k + 1] - strstart[k]

For example, let us use this method to calculate the length of the string with number 4 (k=4) ("TeX" in our test array fakestrpool).

length(4) = strstart[5] - strstart[4]
length(5) = 15 - 12 = 3

Of course there is one minor complication – calculating the length of string 5, but we have other variables (poolptr and strptr) to solve issues like this.

Back to .pool files

We started this discussion by noting that running the TANGLE program on TeXk.WEB produces two output files:

• TeXk.p: the source code of TeX (in Pascal).
• TeXk.pool: a file containing the string constants defined in TeXk.WEB

The next stage in the discussion covers the mechanisms for processing .pool files – introduced in circa 2008. Prior to (circa) 2008, you needed to keep .pool files available (part of the TeX distribution) as separate files for use whenever you ran INITEX to generate a new .fmt file. As noted, the contents of the .pool files are string constants generated by TANGLE from string constants defined in main WEB source code to TeX. Given that those strings they don't change (they are constants), it makes more sense to build them into the TeX executable file rather than having to access them each time a new .fmt file created by INITEX. Part of the Web2C process now involves using a small utility program called makecpool.exe (on Windows) – makecpool.C was written by Taco Hoekwater. The input to makecpool.exe is the TeXk.pool file and the output is another C file (called texpool.C or similar) which defines a function called loadpoolstrings(...):

If you just want to see the inputs/outputs you can download the files I produced during my private build of Knuthian TeX:

• TeXk.pool: The .pool file input for makecpool.exe
• texpool.C: The C file output by makecpool.exe, defining the function loadpoolstrings(...).

Once you have generated texpool.c you no longer need the original TeXk.pool file because the contents of TeXk.pool are now stored within texpool.C, stored as array of strings:

static const char *poolfilearr[] = {
"buffer size",
"pool size",
"number of strings",
"" "?" "?" "?",
"m2d5c2l5x2v5i",
"End of file on the terminal!",
"! ",
"(That makes 100 errors; please try again.)",
"" "? ",
"Type <return> to proceed, S to scroll future error messages,",
"R to run without stopping, Q to run quietly,",
"I to insert something, ",
...
...
...
NULL };

Of course, when you build TeX you will need to compile TeXk.C and texpool.C so that the function loadpoolstrings(...) is made available. The function loadpoolstrings(...) is called from TeX.C when TeX is in INITEX mode (i.e., the --ini option is set on the command line). Specifically, loadpoolstrings(...) function is called by the function getstringsstarted(...) just after it has initialized the first 256 strings in TeX's main string container: the strpool array discussed above.

Modifying loadpoolstrings (...) to see what it does

The function loadpoolstrings(...) depends on a few of TeX's internal global variables and the function makestring() (we'll discuss that shortly), notably we need to declare the following vaiables as extern to texpool.C:

extern int makestring ( void ) ;
extern unsigned char * strpool;
extern int poolptr;

Here is my slightly modified version of loadpoolstrings(...) which outputs a file called "datadump.txt" to list the strings and corresponding string numbers generated by makestring():

const char *s;
int g=0;
FILE* dumpvals;
int i=0,j=0;

while ((s = poolfilearr[j++])) {
int l = strlen (s);
fprintf(dumpvals, "//string \"%s\" = number ", s);
i += l;
if (i>=spare_size) return 0;
while (l-- > 0) strpool[poolptr++] = *s++;
g = makestring();
fprintf(dumpvals, "%ld\n", g);
}
fclose(dumpvals);
return g;
}

Those who might be interested to see the contents of datadump.txt can download it here. In any case, here's a listing of the first few lines in datadump.txt:

//string "buffer size" = number 256
//string "pool size" = number 257
//string "number of strings" = number 258
//string "???" = number 259
//string "m2d5c2l5x2v5i" = number 260
//string "End of file on the terminal!" = number 261
//string "! " = number 262
...
...
//string "Using character substitution: " = number 1329

As you can see, the string number of the first string is 256 (i.e., the first string originally contained in TeXk.pool). Assuming that the string numbers start at 0 (they do), TeX has already initialized strings 0..255 before loading the strings from the TeXk.pool file. I hate to do this to you, dear reader, but can you guess what those 256 strings (0..255) might be?

The function makestring()

Here is TeX's makestring() function which returns a string number after checking for overflows – i.e., TeX has enough space to store another string.

strnumber makestring (void)
{
register strnumber Result; makestring_regmem
if (strptr == maxstrings)
overflow (258 , maxstrings - initstrptr) ;
incr (strptr) ;
strstart[strptr] = poolptr ;
Result = strptr - 1 ;
return Result ;
}

Time to stop

Dear reader, writing this post has absorbed the greater part of my Sunday (14 September 2014) so you'll forgive me if I call it a day and leave it here – I'll fix any typos tomorrow :-). I hope it is of use, or interest, to someone "out there", somewhere.

2Sep/14Off

Introduction: From WEB to C, a bit of history/background

For some time I'd wanted to build TeX (the original Knuth version) from the WEB source code, but the relatively complex process to generate C from WEB meant it was one of those "tasks" I kept putting off. Well, back in early 2013 I finally decided to have a go and, eventually, I managed to create a Windows port/build of the Web2C executable and associated tools. Using those tools I was finally able to generate TeX.C from TeX.WEB and compile a working TeX executable. As part of that exercise I decided remove the kpathsea path-searching library from my build of TeX, replacing it with a simple recursive directory search – based, at the moment, on compile-time options (which I plan to make fully configurable – probably with a Lua-based config file).

Why am I doing this... ?

I ask myself this on many occasions... Having "ported" LuaTeX to a native Windows build, I already have a TeX-based system to explore via Visual Studio (and LuaTeX is written in clean C, no need of Web2C). I guess it's mainly curiosity but there is also the fact I can "tweak + explore" some parts of Knuthian TeX and rapidly and easily re-compile it – the C code base of Knuthian TeX is tiny fraction of LuaTeX and is thus far, far quicker to compile. I also don't want to risk doing something dumb and somehow wrecking my port/build of LuaTeX.

Poking around inside TeX.C

Although I have quite a collection of books on TeX, I've always found it really, really hard to understand how TeX – the language and program – actually works. So, for me, I find it much more instructive to watch how some bits of TeX actually work by stepping through the C code as TeX is executing – single-stepping via the Visual Studio interface. However, before attempting to do that I spent some time using regular expressions to "tidy up" the machine generated C code produced by Web2C – the raw C code (produced by Web2C) is almost impossible to read/follow. At present, the "tidied C code" is still far from "easily legible code", but it's gradually improving, especially as I copy/paste explanatory text from TeX.WEB into TeX.C. Many parts of TeX (algorithms) are truly fiendishly complex (line-breaking, hyphenation, math typesetting, etc...) so I doubt I'll spend too much time probing those inner depths. Whilst being in awe at the sophistication and complexity of the algorithms inside TeX, I do confess that, at times, the C code is, in places, somewhat spaghetti-like. For example, there is a significant number of global variables and some individual globals are used for more than 1 purpose. Additionally, there is extensive use of "goto" statements, causing the code to jump all over the place.

Some confusion starts to ease

Despite the difficulty in following the execution of TeX.C, it is nevertheless fascinating to watch TeX actually run: Parsing the input file, acting on catcode values, creating tokens, defining macros, building boxes, running the page-builder and shipping out pages. Although I'm only just starting to explore TeX via C code, it has, for me, started to lift some of the confusion surrounding the TeX language – even if I have barely scratched the surface of this truly extraordinary program.

A new series of posts...?

My plan is to write a series of short, but fairly frequent, posts based on some aspects of TeX's internals: To relate/use those internals to explain, with examples, some parts of the TeX language semantics. At least, in areas that I found tricky to understand and ones that, I hope, might be instructive/useful for others.