This post is, once again, an aide-mémoire to record a work-in-progress: porting the tools that convert Knuth's original Pascal-based WEB source to C – to create a native build of
fixwrites.exe and other tools using Microsoft's Visual Studio (and not using pipes). My apologies if this post is a little unstructured but the whole task is somewhat convoluted, which may be reflected in my writing style for this post! However, I'd like to record it whilst it is fresh in my memory.
Why would anyone want to do this when there are ready-made, reliable, TeX distributions freely available? Good question. Well, for me, it's nothing more than pure curiosity – and the fact that most British TV programs are now such mind-numbing drivel that I might as well do something productive in the evenings!
Join TUG: Just as an aside, I'm a member of the TeX User Group, TUG, so if you too would like to support TeX why not consider joining?
Another reason for writing this post is that I could not find much documentation on how to build
Web2C.exe from source code – apart from these notes by Timothy Murphy, detailing the process for Macintosh-based port. Even though they were written in 1992 they were extremely helpful in filling in some of the details, so a belated thank you to Timothy Murphy – much of this post draws inspiration from that document. Piecing together the Web2c build process has been somewhat of a "programming jigsaw" – there are still gaps in my understanding but, I think, I can see the big picture even if it's still a little hazy in some areas.
The Big Picture
The source files for TeX, and other TeX-related programs and utilities, are written using Professor Donald Knuth's literate programming methodology. In essence, the program code (in Pascal) and documentation of the source code (in TeX) are contained within a single file, with extension
.web. For example, Professor Knuth's source code of the latest version of TeX is contained in a file called
tex.web. Similarly, within the TeXLive repository (see a previous post) or on CTAN, you can find the WEB source code for the latest versions of other programs; for example:
- bibtex.web: the source code/documentation of
BiBTeX, for formatting and producing reference lists, as widely used within academic journal papers.
- mf.web: the source code/documentation of MetaFont.
- patgen.web: the source code/documentation of
patgenwhich "... takes list of hyphenated words and generates a set of patterns that can be used by the TeX 82 hyphenation algorithm."
- tangle.web: the source code/documentation of
tangle, which converts a WEB file to a Pascal (i.e., extracts the source code in Pascal, not in C – that's why Web2C exists).
- weave.web: the source code/documentation of
weave, which converts a WEB file to TeX (i.e., extracts the documentation of the program's Pascal source code).
and other programs/utilities such as
tftopl.web and so forth.
What's in a name: tangle, web and weave? I've not researched to find out, but I cannot help thinking that Professor Knuth drew inspiration from Sir Walter Scott when naming these programs. Scott's poem Marmion contains the line(s) "O, what a tangled web we weave when we practice to decieve". Maybe these programs are as literary as they are literate?
TeXLive as the source of the files for building
The files I reference throughout this post can be downloaded via SVN from the TeXLive repository. If you want to browse the TeXLive repository, using the TortoiseSVN program on Windows, this post may be of help. The following screenshots show the TeXLive folders you'll need to access for acquiring the various files I mention in this post.
- svn://tug.org/texlive/trunk/Build/source/texk/web2c: this folder contains, for example,
tangleboot.pin(see below) and all the
*.webfiles listed above, plus many other essential files.
- svn://tug.org/texlive/trunk/Build/source/texk/web2c/web2c: this folder contains the source files needed to build the actual
Web2C.exeprogram. Note carefully it does not contain a file called
Web2C.c, more on that below.
TeXLive has an advanced build-process for compiling/building all the tools and software it contains and I, for one, am in awe of the skills and expertise of its maintainers. In describing my explorations of building
Web2C.exe as a Windows-based executable you need to realize that I am taking the source code files of
Web2C.exe out of their "natural build environment". What do I mean by this? Building the Web2C executable program is usually part of the much bigger TeXLive build/compilation process so you should be prepared for a little extra complexity to create
Web2C.exe as a "standalone" Windows program. Note that "standalone" is in quotes because converting WEB-generated Pascal into C code requires other tools in addition to
Web2C.exe: it is not fully accomplished by
A note about Kpathsea
The Kpathsea (path-searching) C library in an integral part of most TeX-related software and the Web2C C source files
#include a number of Kpathsea headers. However, for my own purposes/experiments I've decided to decouple my build of the
Web2C.exe executable from the need to include Kpathsea's headers – the resulting C files generated by
Web2C.exe will, of course, still depend on Kpathsea. If you grab the Web2C source files (see below) then "out of the box" you'll need to checkout the Kpathsea library from:
I've simply not got the time to document everything I had to do to decouple Kpathsea when building
Web2C.exe. It mainly involved commenting out various
#include lines that pulled in Kpathsea headers and placing a few
#define statements into my local version of
web2c.h – plus creating some typedefs and adding a few macros. If you're an experienced C programmer it is unlikely to present difficulties. As mentioned, this post describes a work-in-progress to satisfy my own curiosity and is meant to share a few of the things I've learnt, should they be useful to anyone as a starting point for their own work.
Web2C: so what is it?
Let me be clear that when I refer to Web2C I am referring to the executable program which undertakes the first (main) step in converting Pascal code into to C. So let's now start to take a look at the details but start with a summary of "Where are we?"
Where are we?
The starting point for generating C code is to extract the Pascal code from WEB source files and that is accomplished using the
tangle program. However, where do we get a working
tangle program from to start with – do we have a chicken and egg problem?
tangle is itself distributed in WEB source code (
tangle.web), so if I need
tangle to extract tangle's source code from
tangle.web, how do I create a working tangle program? Well, of course, this is solved by the distribution of tangle's Pascal code in a file called
tangleboot.pin within the Web2C directory of the TeXLive repository (see above). In essence,
tangleboot.pin let's you "bootstrap" the whole Web2C process by creating a working
tangle.exe which you can use to generate the Pascal from WEB source files. Hence the name tangleboot.pin
So, how do I go from
tangleboot.pin to a working tangle.exe? You need to build
Web2C.exe and some associated utility programs (e.g.,
Where are the
Web2C.exe source files?
As mentioned above, the TeXLive folder containing the source files needed to build
The C source files you need to compile/build
Some notes on these files
These C files
#include a number of header files from the TeXLive distribution, notably from the Kpathsea library, so you should definitely look through them to determine any additional files you need.
web2c-lexer.c are worthy of some explanation because they are the core files which drive the Pascal --> C conversion process. However, these two C source files are not hand-coded but are generated from two further source files with similar names. If you look among the source files you will also notice these two additional files:
What are these files with similar names? As you may infer from their names, these files are a lexical analyser and a parser generator and require additional tools to process them:
web2c-lexer.l --> web2c-lexer.cusing a tool called flex.
web2c-parser.y --> web2c-parser.c + web2c-parser.husing a tool called bison.
Are bison/flex available for Windows?
Fortunately they are and, at the time of writing (February 2013), you can download Windows ports of
bison 2.7 and
flex 2.5.37 from http://sourceforge.net/projects/winflexbison/. The executables are called
win_flex.exe respectively. The
win_flex.exe port of flex adds an extra command-line switch (
--wincompat) so that the C code it generates uses the standard Windows header
io.h instead of
unistd.h (which is used on Linux). You can also download older versions of bison and flex for Windows from the GnuWin32 project.
I have not yet tried to use the code generated by
win_bison.exe but to the best of my (current) knowledge the command-line options you need are:
win_bison -y -d web2c-parser.yto generate the parser (you'll get different file names on output:
win_flex --wincompat web2c-lexer.lto generate the lexical analyser (you'll get a different file name on output:
You need more than just
Assuming that you successfully build
Web2c.exe, it is still not the end of the story. Although
Web2c.exe does the bulk of the work in converting the Pascal to C, some initial pre-processing of the Pascal source file is needed before you can run it through
Web2C.exe, and some further post-processing of the C code output by
Web2C.exe is also needed. The details of how these pre- and post-processing steps actually work are contained within an important BASH shell script called
convert (it has no extension) –
convert is located within the TeXLive folder containing the Web2C source files. I readily confess that I know very little about Linux shell scripting so if you are well-versed in shell scripts no doubt you can easily understand what is going on in the
convert file. However, here are pointers to get you started.
Pre-processing: adding the
*.defines files to the Pascal file
Before you can actually run
Web2C.exe on the Pascal file generated from WEB sources you need to concatenate the Pascal source file with some files having the extension "
.defines": you add these files to the start of the Pascal file before running
Web2C.exe. There are several
.defines contained in the Web2C source directory including:
convert script checks which program, and its options, (TeX, MetaFont, BiBTeX etc) is being built and concatenates the appropriate
*.defines file(s) to the start of the corresponding Pascal file. At this time, I don't quite fully understand how/why these files are needed, but for the full details you need to read
convert. By way of an example, when processing
tangleboot.pin I added the file
common.defines to the beginning of
Web2C.exe's output is not quite pure C source code – it may still contain some fragments of Pascal which need a specialist post-processing step to fully convert them to C: enter
Web2C.exe's C output to "...convert Pascal write/writeln's into fprintf's or putc's" (see
Upon reading the
convert script, and when I first ran
Web2C.exe, it became readily apparent that the whole Pascal --> C tool chain (driven by
convert) communicates using pipes) with stdout/stderr. The output of one program is "piped" into the input to another, rather than writing the data out to a physical disc file and then reading it back in. My personal preference, certainly whilst learning, is to output data to a file so that I can capture what's going on.
Without going into too much detail, I needed to make a number of changes in
main.c so that the lexical analyzer
web2c-lexer.c was set to read it's data from a disc file rather than through pipes/stdin. The
FILE* variable you need to set/define is called
yyin. For example, within
main.c there is a function called
initialize () which can be used to set
yyin. For example:
void initialize (void)
register int i;
for (i = 0; i < hash_prime; hash_list[i++] = -1) ; yyin = xfopen("your_path_to\\tangleboot.p","r"); ... ... }
In addition, within
main.c there's a small function called
normal () which does the following:
void normal (void)
out = stdout;
normal () function is called from within
web2c-parser.c to set the output file (
FILE *out) to stdout. At present, I'm not sure precisely why this is done, but I guess it is part of the piping between programs as driven by the
convert process. For example, code within
convert uses sed (the stream editor).
Other output redirections happen in
web2c-parser.c and you can search for these by looking for
out = 0. Tracking down and locating these output redirections certainly helped me to better understand the flow of the programs.
This post is a little disjointed in places and light on detail in a number of areas, reflecting my own (currently) incomplete understanding of the relatively complex processes involved in converting WEB/Pascal to C. Nevertheless, I hope that it is of some use to someone, at some point. As my understanding develops I'll try to fill in the gaps with future posts.