2Sep/14Off

# Introduction: From WEB to C, a bit of history/background

For some time I'd wanted to build TeX (the original Knuth version) from the WEB source code, but the relatively complex process to generate C from WEB meant it was one of those "tasks" I kept putting off. Well, back in early 2013 I finally decided to have a go and, eventually, I managed to create a Windows port/build of the Web2C executable and associated tools. Using those tools I was finally able to generate TeX.C from TeX.WEB and compile a working TeX executable. As part of that exercise I decided remove the kpathsea path-searching library from my build of TeX, replacing it with a simple recursive directory search – based, at the moment, on compile-time options (which I plan to make fully configurable – probably with a Lua-based config file).

## Why am I doing this... ?

I ask myself this on many occasions... Having "ported" LuaTeX to a native Windows build, I already have a TeX-based system to explore via Visual Studio (and LuaTeX is written in clean C, no need of Web2C). I guess it's mainly curiosity but there is also the fact I can "tweak + explore" some parts of Knuthian TeX and rapidly and easily re-compile it – the C code base of Knuthian TeX is tiny fraction of LuaTeX and is thus far, far quicker to compile. I also don't want to risk doing something dumb and somehow wrecking my port/build of LuaTeX.

# Poking around inside TeX.C

Although I have quite a collection of books on TeX, I've always found it really, really hard to understand how TeX – the language and program – actually works. So, for me, I find it much more instructive to watch how some bits of TeX actually work by stepping through the C code as TeX is executing – single-stepping via the Visual Studio interface. However, before attempting to do that I spent some time using regular expressions to "tidy up" the machine generated C code produced by Web2C – the raw C code (produced by Web2C) is almost impossible to read/follow. At present, the "tidied C code" is still far from "easily legible code", but it's gradually improving, especially as I copy/paste explanatory text from TeX.WEB into TeX.C. Many parts of TeX (algorithms) are truly fiendishly complex (line-breaking, hyphenation, math typesetting, etc...) so I doubt I'll spend too much time probing those inner depths. Whilst being in awe at the sophistication and complexity of the algorithms inside TeX, I do confess that, at times, the C code is, in places, somewhat spaghetti-like. For example, there is a significant number of global variables and some individual globals are used for more than 1 purpose. Additionally, there is extensive use of "goto" statements, causing the code to jump all over the place.

## Some confusion starts to ease

Despite the difficulty in following the execution of TeX.C, it is nevertheless fascinating to watch TeX actually run: Parsing the input file, acting on catcode values, creating tokens, defining macros, building boxes, running the page-builder and shipping out pages. Although I'm only just starting to explore TeX via C code, it has, for me, started to lift some of the confusion surrounding the TeX language – even if I have barely scratched the surface of this truly extraordinary program.

## A new series of posts...?

My plan is to write a series of short, but fairly frequent, posts based on some aspects of TeX's internals: To relate/use those internals to explain, with examples, some parts of the TeX language semantics. At least, in areas that I found tricky to understand and ones that, I hope, might be instructive/useful for others.