Technique to make Web2C/tangle put comments in the C code generated for TeX

Introduction

Over the last couple of evenings I’ve been looking at the C code for TeX generated by the tangle and Web2C conversion process. By default, the Web2C conversion process generates C source code which is almost completely devoid of comments and symbol strings are converted to numbers (etc), making the C source nearly impossible to read. However, by making a small change to the flex-generated source code (web2c-lexer.c) together with some careful use of regular expressions on the WEB sources (and/or some manual editing) you can get a lot of comments put into the C source. Note: I’ve not yet explored whether it is possible to use the changefile method to achieve the same (or similar) results. Here’s an outline of my experimental technique.

Outline of the technique

Naturally, via the literate programming methodology, the WEB source files for TeX contain a full description of the TeX program. The Pascal code in the WEB source files is full of comments or short descriptions (enclosed in braces {....}) which, if preserved, (in the Web2C-generated C code), would make it much more readable. However, those Pascal comments are stripped out by tangle (but are used by weave); consequently, the Pascal generated by tangle, and fed into Web2C.exe, does not contain any useful comments. I say “useful” because the Pascal does contain some line-number comments but these are not really that helpful and they are removed by the comment-handling code in web2c-lexer.c.

Comments in WEB files

So, what are we looking to do? In essence, we need a way to convince tangle to output comments into the Pascal code it generates and find a way to ensure that those comments are processed and passed into the C code by Web2C.

Web2C and Pascal comments

Caveat: I am not a Pascal programmer and have no desire to become one! However, all you need to know is that within the Pascal generated by tangle the comments are simply enclosed in braces, like this: {This is a comment in Pascal}. These comments are filtered out by web2c-lexer.c. Another caveat: be extremely careful when making any changes whatsoever to the lexer code C (or .l sources), you can break things very badly (hmmm, wonder how I found that out…). I cannot stress the importance of being very, very careful in making any changes to web2c-lexer.l/web2c-parser.y or web2c-lexer.c/web2c-parser.c: these lexical analyser and parser-generator sources are critical to the C-generation process. OK, I think I made the point. The following description probably deserves nomination for an “Ugly Hack Award” and, no doubt, a flex/bison expert (which I’m definitely not) could design an elegant solution to incorporate comment-handling in proper context within the parsing process. OK, enough self-flagellation, let’s move on.

If you look in Web2C-lexer.l the code which handles comments is simply:

"{" { while (webinput() != '}'); }

After running flex on Web2C-lexer.l this becomes (in Web2C-lexer.c)


case 2:
YY_RULE_SETUP
#line 53 "web2c-lexer.l"
{ while (webinput() != '}'); }
YY_BREAK

Basically, the lexical analyser is stripping out things like {This is a comment from Pascal}. To get comments into the C code generated for TeX you’ll need to modify the lexer code to stop it skipping comments and process them to generate C comments such as /* This is a comment from Pascal */. There are a few points to consider here: firstly, you’ll need to experiment to see exactly where the comments end up in your C code. Due to the “Ugly Hack” approach, we’re not paying any real attention to the “context” of where we are in the parsing process when outputting our comments; again, a proper flex/bison implementation is required. For example, by the time your comment is seen by the lexer a newline (\n) may already have been output so your comments may end up on a new line – easily fixed by some manual tidy-up of the C code (or via the use of running regular-expression tools on the Web2C-generated C source code). So, just to note that you’ll need to do some trial and error to see what happens.

Getting comments into WEB and surviving tangle

As noted, tangle strips out comments in the WEB sources and they don’t even reach the Pascal code it outputs. So, can we coerce tangle to preserve comments in WEB soures and put them in the Pascal for Web2.exe to process/output? A quick reading of The WEB User Manual implies that there are two ways to get text output to the Pascal source code produced by tangle:

  • use “control-text” such as @=your comment text here@> which causes the text to be output verbatim into the Pascal code, or
  • use “meta-comments”: such as @{your comment text here @} which, in the Pascal, results in a standard comment such as {your comment text here}.

Robby the Robot says Danger!

Sorry for the reference to Robby the Robot, indulge me….. Seriously, though, if you make edits to the WEB source to put in “control-text” or “meta-comments” you can very easily foul-up tangle’s parser and break tangle’s conversion process pretty badly. As yet, I’m not able to give precise rules on where it is safe to add “control-text” or “meta-comments” (I’m still experimenting) so I suggest you read The WEB User Manual to understand a little more about WEB syntax before attempting it.

Mind the pool file: Be careful inserting/using text with double quotes "..." because it can trigger tangle’s parser to output that text in the tex.pool file which you don’t want to do. I used single quotes '...' and that seems to be safe(er). I can’t recall exactly what I did that caused this to happen, but just be sure to check that the .pool file does not become polluted with any of the text you insert into the WEB sources.

Getting to the point

So far we’ve seen that to get comments into the C source code we need to:

  1. modify the behaviour of web2c-lexer.c and tell it (selectively) not to skip all Pascal’s comment construct {...} (see use of '...', below).
  2. coerce tangle to preserve comments and output them into the Pascal source so that Wb2C.exe sees them and the code in web2c-lexer.c can process them.

An example

Within the TeX WEB source code is a function which initializes TeX’s “primitives”. Here’s a small extract of the raw WEB source code


@ The symbolic names for glue parameters are put into \TeX's hash table
by using the routine called |primitive|, defined below. Let us enter them
now, so that we don't have to list all those parameter names anywhere else.

@=
primitive("lineskip",assign_glue,glue_base+line_skip_code);@/
@!@:line_skip_}{\.{\\lineskip} primitive@>
primitive("baselineskip",assign_glue,glue_base+baseline_skip_code);@/
...

When this is translated to C the result looks something like this:


...
primitive ( 381 , 75 , 24527 ) ;
primitive ( 382 , 75 , 24528 ) ;
...

Not a string or comment in sight. tangle has also converted everything into integers: "lineskip" becomes 381 … single-stepping through this C code with a debugger is not my idea of fun. So, what to do?

If you look at the form of code like

primitive("lineskip",assign_glue,glue_base+line_skip_code);

it is very amenable to processing with regular expressions. What you can do, for example, is pre-process the WEB source with your favourite regex tool to add “meta-comments” that will reach the Pascal and (with your modified lexer) make it into the C code. For example (should all be on one line):

primitive("lineskip",assign_glue,glue_base+line_skip_code); @{'lineskip,assign_glue,glue_base+line_skip_code'@};@/

Here we added the “meta-comment”

@{'lineskip,assign_glue,glue_base+line_skip_code'@}

just after the original Pascal code. Note that I have used single quotes '...' to delimit the text simply because I want to be able to detect the comments I introduced when the modified lexer is scanning my comments. To cut a long story short, through this technique you end up with C code that looks like this:


primitive ( 381 , 75 , 24527 ) ; /*lineskip,assign_glue,glue_base+line_skip_code*/
primitive ( 382 , 75 , 24528 ) ; /*baselineskip,assign_glue,glue_base+baseline_skip_code*/

Maybe not beautiful, but at least you now know what (some) of those tangle-generated numbers represent.

In conclusion

This techniqe is not “pretty” but, if used with care, you can get tangle to output a lot of useful comments, either through regular-expressions and pre-processing of the WEB code, or hand-editing the WEB to write summaries of the descriptions of the source code. I must stress that you can’t put “meta-comments” just anywhere in the WEB source because you risk breaking tangle’s parsing process: you’ll need to experiment and proceed carefully with (say) small/minor manual edits to make sure tangle or Web2C don’t “choke” on your changes.