STM publishing: tools, technologies and change A WordPress site for STM Publishing

1Dec/160

The flavours of TeX—a guide for publishing staff: LaTeX, pdfTeX, pdfLaTeX, XeTeX, XeLaTeX, LuaTeX et al

Posted by Graham Douglas

Summary

This is not a technical article on using TeX (i.e, TeX installation or programming). Instead, it offers some background information for people who work in STM (scientific, technical and medical) publishing and aims to provide an easy-to-follow explanation by addressing the question “what is TeX?”—and, hopefully, demystifies some confusing terminology. My objective is, quite simply, to offer an introduction to TeX-based software for new, or early-career, STM publishing staff—especially those working in production (print or digital). Just by way of a very brief bio, as in “am I qualified to write this”: I’m writing this piece based on my 20+ years of experience of STM publishing, having worked in senior editorial positions through to technical production and programming roles. In addition, over the last few years I have spent a great deal of time building and compiling practically every TeX engine from its original source code, together with creating my own custom TeX installation to explore the potential of production automation through modern TeX-based software.

Introduction

If you work in STM (scientific, technical and medical) publishing, especially within mathematics and physics, chances are that you’ve heard of something called “TeX” (usually pronounced “tech”)—you might also have encountered, or read about, authors using tools called LaTeX, pdfTeX, pdfLaTeX, XeTeX, XeLaTeX, LuaTeX, LuaLateX etc. Unless you are a TeX user, or familiar with the peculiarities of the TeX ecosystem, you may be forgiven for feeling somewhat confused as to what those terms actually mean. If you are considering working in STM publishing and have never heard of TeX, then I should just note that it is software which excels at typesetting advanced mathematics and is widely used by mathematicians, physicists, computer scientists to write and prepare their journal articles, books, PhD theses and so forth. TeX’s roots date back to late 1970s but over the intervening decades new versions have evolved to provide considerable enhancements and additional functionality. Those new to STM publishing, or considering it as a career, may be surprised to learn that a piece of software dating back to the late 1970s is still in widespread use by technical authors—and publishing workflows.

NOTE: TeX is not just for mathematics. It is a common misconception that the use of TeX is restricted to scientific and technical disciplines—typesetting of complex mathematics. Whilst it finds most users in those domains, TeX is widely used for the production of non-mathematical content. In addition to typesetting mathematics, modern TeX engines (XeTeX and LuaTeX) provide exquisite handling of typeset text, support for OpenType font technologies, Unicode support, OpenType math fonts (as pioneered by Microsoft Word), multilingual typesetting (including Arabic and other complex scripts) and output directly to PDF. LuaTeX, in particular, is incredibly powerful because it also has the Lua scripting language built into its typesetting engine, offering (for example) almost unlimited scope for the automated production/typesetting of highly complex or bespoke documentation, books and so forth. LuaTeX also provides you with the ability to write plugins to extend its capabilities. Those plugins are usually written in C/C++ to perform specialist tasks—for example: graphics processing, parsing XML, specialist text manipulation, on-the-fly database queries or, indeed, pretty much anything you might need to do as part of your document production processes. If you don’t want the complexities of writing plugins, chances are you can simply use the Lua scripting language to perform many of your more complex processing tasks.

Irrespective of the tools used by authors to write/prepare their work, the lingua franca of today’s digital publishing workflows—especially journals—is XML, which is generated from the collection of text and graphics files submitted by authors. Most publishers now outsource the generation of XML to offshore companies usually based in countries such as India, China or the Philippines. Many production staff usually do not have to worry (too much) about the messy details of conversion—provided the XML passes quality control procedures and is a correct and faithful representation of the authors’ work. The future is, of course, online authorship platforms which remove the need for this expensive conversion of authors’ work into XML—but we’re still some way from that being standard practice: old habits die hard, so Microsoft Word and TeX will be around for some time, as will the need for conversion into XML.

And so to TeX: A brief history in time

My all-time favourite quote comes from the American historian Daniel J. Boorstin who once noted that:

“Trying to plan for the future without a sense of the past is like trying to plant cut flowers.”

In keeping with the ethos of that quote I’ll start with a very brief history of TeX.

On 30 March 1977 the diary of Professor Donald Knuth, a computer scientist at Stanford University, recorded the following note:

“Galley proofs for vol. 2 finally arrive, they look awful (typographically)… I decide I have to solve the problem myself”.

That small entry in Professor Knuth’s diary was the catalyst for a programming journey which lasted several years and the outcome of that epic project was a piece of typesetting software capable of producing exquisitely typeset mathematics and, of course, text: that program was called TeX. Along the way, Knuth, and his colleagues, designed new and sophisticated algorithms to solve some very complex typesetting problems: including automatic line breaking, hyphenation and, of course, mathematical typesetting. As part of the development, Knuth needed to fonts to use with his typesetting software so he also developed his own font technology called METAFONT, although we won’t discuss that in any detail here.

To cut short a very long story, TeX proved to be a huge success—in no small part because Knuth took the decision to make TeX’s source code (i.e., program code) freely available, meaning that it could be built/ported, for free, to work on a wide range of computer systems. TeX enabled mathematicians, physicists, computer scientists and authors from many other technical disciplines to have exquisite control over typesetting their own work, producing beautifully typeset material containing highly complex mathematical content. Authors could use TeX to write and prepare their books and papers, and submit their “TeX code” to publishers—usually assured of a greater degree of certainty that their final proofs would not suffer the same fate as Knuth’s.

TeX: Knuth maintains his version, but others have evolved

Even today, nearly 4 decades after that fateful genesis of TeX, Professor Knuth continues to make periodic bug fixes to the master source code of his version of TeX—which is archived at ftp://ftp.cs.stanford.edu/tex/ and available from other sources, such as CTAN (Comprehensive TeX Archive Network). Those updates take place every few years with the latest being “The TeX tuneup of 2014” as reported in the journal TUGboat 35:1, 2014. During those “tuneups” Knuth does not add any new features to TeX, they really are just bug fixes. In the 1980s Knuth decided that in the interest of achieving long-term stability he would freeze the development of TeX; i.e., that no new features would be added to his version of  TeX. I specifically mentioned “his version of TeX” because Knuth did not exclude or prevent others from using his code to create “new versions of TeX” which have additional features and functionality. Those “new versions” are usually given names to indicate that whilst they are based on Knuth’s original they have additional functionality—hence the addition of prefixes to give program names such as pdfTeX, XeTeX and LuaTeX.

Huh—what about LaTeX? At this point you might be wondering why I have not mentioned LaTeX, and it is a good question. Just to jump ahead slightly, the reason I am not mentioning LaTeX (at this point) is because LateX is not a version of the executable TeX typesetting program—it is a collection of TeX macros, a topic which I will discuss in more detail below.

At this point, I’ll just use the term “TeX” (in quotes) to refer to the Knuth’s original version and all its later descendants (pdfTeX, XeTeX, LuaTeX).

So, what does “TeX” actually do?

As noted, “TeX” is a typesetting program—but if you have formed a mental image of a graphical user interface (GUI), such as Adobe InDesign, then think again. At the time of TeX’s genesis, in the late 1970s, today’s sophisticated graphical interfaces and operating systems were still some way into the future and TeX’s modus operandi still reflects its heritage—even for the new modern variants of TeX. Those accustomed to using modern page layout applications, such as Adobe InDesign, may be surprised to see how TeX works. Suppose someone gives you a copy of a “TeX” executable program and you want to use it to do something, how do you do that? “TeX” uses a so-called command-line interface: it has no fancy graphical screen into which you type your text to be typeset or point, click, tap to set options or configurations. If you run the “TeX” program you see a simple screen with a blinking cursor. Just by way of example, here’s the screen I see when I run LuaTeX (luatex.exe on Windows):

luatex

Clearly, if you want a piece of software to typeset something, you will need to provide some form of input (material to typeset) in order to get some form of output (your typeset material). Your input to the typesetting program will not only need to contain the material to be typeset but will also require some instructions to tell a typesetting program which fonts to use, the page size and a myriad of other details controlling the appearance of the typeset results. To typeset anything with “TeX” you provide it with an input text file containing your to-be-typeset material interspersed with “typesetting instructions” telling “TeX” how to format/typeset the material you have provided: i.e., what you want it to achieve. And here is where “TeX” achieves its legendary power and flexibility. The “typesetting instructions” that control “TeX’s” typesetting process are written using a very powerful programming language—one that Professor Knuth designed specifically to provide users with enormous flexibility and detailed control of “TeX’s” typesetting capabilities. So we can now start to see that “TeX” is, in fact, a piece of typesetting software that users can direct and control by providing it with instructions written in a programming language. You should think of “TeX” as an executable program (“typesetting engine”) which understands the TeX typesetting language.

A tiny example

Just to make it clear, here is a tiny example of some input to “TeX”—please do not worry about the meaning of the strange-looking markup (“TeX” commands that start with a “\”). The purpose here is simply to show you what input to “TeX” looks like:

$$\left| 4 x^3 + \left( x + {42 \over 1+x^4} \right) \right|.$$

And here is the output (as displayed in this WordPress blog using the MathJax-LaTeX plugin):

\[\left| 4 x^3 + \left( x + {42 \over 1+x^4} \right) \right|.\]

So, in order to produce your magnum opus you would write and prepare a text file containing your material interspersed with “TeX” commands and save that to a file called, say, myopus.tex and then tell your “TeX” engine to process that file. If all goes well, and there are no bugs in your “TeX” code (i.e., “TeX” programming instructions) then you should get an output myopus.pdf containing a beautifully typeset version of your work. I have, of course, omitted quite some detail here because, as I said at the start, this is not an article about running/using “TeX”.

“TeX” the program (typesetting “engine”) and “TeX” the typesetting language

So, the word “TeX” refers both to an executable program (the “TeX” typesetting engine) and the set of typesetting instructions that the engine can process: instructions written in the “TeX” language. Understanding that the executable “TeX” engine is programmable is central to truly appreciating the differences between LaTeX, pdfTeX, pdfLaTeX, XeTeX, LuaTeX and so forth.

Each “TeX” engine (program) understands hundreds of so-called primitive commands. Primitive in this sense does not mean “simple” or “unsophisticated”, it means that they are the fundamental building blocks of the TeX language. A simple, though not wholly accurate, analogy is the alphabet of a particular language: the individual characters of the alphabet cannot be reduced to simpler entities; they are the fundamental building blocks from which words, sentences etc are constructed.

And finally: from TeX to pdfTeX, XeTeX and LuaTeX

Just to recap. When Knuth wrote the original version of “TeX” he defined it to have the features and capabilities that he thought were sufficient to meet the needs of sophisticated text and mathematical typesetting based, of course, on the technology environment of that time—including processing and memory of available computers, font technologies and output devices. Knuth’s specification of “TeX” included its internal/programming design (“TeX’s” typesetting algorithms) and, of course, defining the “TeX” language that people can use to “mark up” the material to be typeset. What I mean by “defining the TeX language” is defining the set of several hundred primitive commands that the “TeX” engine can understand, and the action taken by the “TeX” engine whenever it encounters one of those primitives during the processing of your input text.

Naturally, technology environments evolve: computers become faster and have more storage/memory, new font technologies are released (Type 1, TrueType, OpenType),  file output formats evolve (e.g., the move from PostScript to PDF) and Unicode became the dominant way to encode text. Naturally, “TeX” users wanted those new technologies to be supported by “TeX”—in addition to incorporating ideas for, and improvements to, the existing features and capabilities of Knuth’s original TeX program. As noted earlier, in the 1980s Knuth decided to freeze his development of TeX: no more new features in his version—bug fixes only.  With the genuine need to update/modernize Knuth’s original software, TeX programming experts have taken Knuth’s original source code and enhanced it to add new features and provide support for modern typesetting technologies. The four-decade history of TeX’s evolution is quite complex but if you really want the full story then read this article by Frank Mittelbach: TUGboat, Volume 34 (2013), No. 1.

These new versions of TeX not only provide additional features (e.g., outputting direct to PDF, supporting OpenType fonts) they also extend and adapt the TeX language too: by adding new primitives to Knuth’s original set, thus providing users with greater programming power and flexibility to control the actions of the typesetting engine. Each new TeX engine is given its own name to distinguish it from Knuth’s original software: hence you now have pdfTeX, XeTeX and LuaTeX. These three TeX engines are not 100% compatible with each other and it is quite easy to prepare input that can be processed with one TeX engine but fail to work with others—simply because a particular TeX engine may support primitive commands that the others do not. But all is not lost: enter the world of TeX macros!

Primitives are not the whole story: macros and LaTeX

I have mentioned that each TeX engine supports a particular set of low-level commands called primitives—but this is not the full story. Of course, many of the same primitives are supported by all engines but some are specific to a particular engine. “TeX” achieves its true power and sophistication through so-called TeX macros. The primitive commands of an engine’s TeX language can be combined together to define new commands, or macros, built from low-level primitive instructions—and/or other macros. TeX macros allow you to define new commands that are capable of performing complex typesetting operations, saving a great deal of time, typing and programming errors. In addition, TeX engines provide primitives that you can use to detect which TeX engine is being used to typeset a document—so that a TeX engine can, on-the-fly, adapt its behaviour depending on whether or not it supports a particular primitive it might encounter. If a particular primitive is not supported directly but can be “mimicked” (using combinations of other primitives) then all is usually well—but if the chosen TeX engine really cannot cope with a particular primitive then typesetting will fail and an error will be reported.

The TeX language is, after all, a programming language—albeit one designed to solve typesetting problems; but as a programming language TeX is extremely arcane and works very differently to most programming languages you are likely to encounter today.

So, finally, what is LaTeX?

We’ve talked about various versions of the TeX engine—from Knuth's original TeX to its descendants of pdfTeX, XeTeX and LuaTeX—and briefly discussed TeX as a typesetting language: primitives, programming and the ability to write macros. Finally, we are in a position to discuss LaTeX. The logical extension to writing individual TeX macros for some specific task you want to solve, as an individual, is to prepare a collection of macros that others can also use—a package of macros that collectively provide some useful tools and commands that others can benefit from. And that is precisely what LaTeX is: it is a very large collection of complex and sophisticated macros designed to help you typeset books, journal papers and so forth. It provides a wealth of features to control things like page layout, fonts and a myriad of other typesetting details. Not only that but LaTeX was designed to be extensible: you can plug-in additional, more specialist, macro packages written to solve specific typesetting problems—e.g., producing nicely typeset tables or typesetting  particularly complex forms of mathematics. If you visit the Comprehensive TeX Archive Network you can choose from hundreds, if not thousands, of macro packages that have been written and contributed by users worldwide.

So, if someone says they are typesetting their work with LaTeX then they are only telling you part of the story. What they really mean is that they are using the LaTeX macro package with a particular TeX engine—usually pdfTeX but maybe  XeTeX (for multilingual work) or LuaTeX (perhaps for advanced customized document production). Sometimes you will see terms such as pdfLaTeX, XeLaTeX or even LuaLaTeX: but these are not actually the names of TeX engines, all they signify is which TeX engine is being used to run LaTeX. For example, if someone says I am “using pdfLaTeX” what that really means is “I am preparing my typeset documents using the LaTeX macro package and processing it with the pdfTeX engine”. Equally, if anyone says to you that they are “using TeX” then, I hope, you now see that statement does not actually tell you the whole story.

Filed under: TeX (general) No Comments
14Dec/15Off

A note on a “gotcha” when Building TeX Live from source (on Windows) [updated]

Posted by Graham Douglas

Post-publication update: GNU gawk

Since publication of the article below, subsequent investigations with a member of the TeX Live team have identified the exact cause of the problem: An outdated version of GNU's gawk command-line tool (used during compilation). I had been using version 3.1.7 of GNU's gawk (supplied with the MSYS distribution I was using) but after updating it to version 4.0.2 the line-ending problem no longer arises. If you are using MSYS on Windows, and want to compile TeX Live..., check the version of gawk installed on your machine. As I say, you live and (re)learn...

Original article

Just a short note to share the solution to a problem I experienced when trying to compile Tex Live from the C/C++ source distribution... on Windows. I have a bit of relevant experience because I regularly compile LuaTeX from source and have built other TeX engines–including Knuthian TeX from raw WEB code and some versions of XeTeX.

So, with that experience, I decided to have a go at building TeX Live from the source file distribution–it's useful to be able to build and use the latest versions of TeX-related software. Using SVN (via the Tortoise SVN client) I checked out the TeX Live source directory and tried to build it using MinGW64/MSYS64. I read through the notes in README.2building (supplied with the TeX Live source) and followed the example to build dvipdfm-x. Running the Build/configure scripts (using the --disable-all-pkgs option) worked fine but, sadly, compilation failed with a cascade of errors... so I wanted to find out why.

Unquestionably, TeX Live is a truly impressive work of considerable complexity and, of course, it should build OK on Windows–so I figured that the problem must be a relatively minor one to do with my setup. However, tracking it down initially felt like "looking for a needle in a haystack", to quote a well-known English figure of speech. Well, after a couple of days I found the problem... line endings in some key text files! When I checked out the source via SVN some key files (config.h.in and similar *.in files) had been saved with Windows line endings (CR+LF) rather than Linux endings of LF only. Running the top-level TeX Live Build/configure scripts generates a config.status shell script file for each component/sub-system that has to be compiled. As the config.status scripts execute, they create a number of temporary files which are processed and deleted on-the-fly. To stop these temporary files being deleted (to assist my bug hunt) I used a simple trick of adding the line alias rm='echo' at the start of one of the config.status shell scripts (which are generated by configure).

I discovered that the config.status scripts generate a temporary file called defines.awk–which is a script designed to be executed by the AWK program. The purpose of defines.awk is to process "template" configuration files (called config.h.in (and similar)) to generate various config.h files that contain important settings (#defines) detected during the configuration process (i.e., during the execution of configure). These config.h files vary for each program you are building and are essential for successful compilation. Well, it turned out that the defines.awk script was failing to correctly parse the config.h.in files (and similar) simply because the Windows line endings were causing a vital regular expression (in defines.awk) to fail. This resulted in the config.h files being a copy of config.h.in because none of the text replacements had worked due to failure of the AWK regular expression. Not surprisingly, erroneous config.h files caused the spectacular failure of compilation I experienced on my first attempt. Re-saving the config.h.in files (and some other *.in files) with Linux line endings seems to have solved the problems.

And yes, so far all the TeX-related programs I have tried to build have compiled successfully. This is not the first time I have been "bitten" through problems caused by Linux/Widows line endings... so I guess you always live and (re)learn.



2Dec/15Off

TeX’s DVI file preamble: deriving the values of num = 25400000 and den = 473628672

Posted by Graham Douglas

Introduction

If you are at all interested in the innards of TeX's DVI files you might find the following article of some help – a quick post, in the form of a PDF, deriving the values of num = 25400000 and den = 473628672.

Download PDF

25Jul/15Off

Lua-scriptable PATGEN – i.e., PATGEN 2.4 with a Lua binding…

Posted by Graham Douglas

PATGEN: from WEB to C

I recently became curious about TeX's hyphenation patterns and started to read about how they are created – usually using PATGEN though, from what I've read, some brave souls do actually hand-craft hyphenation patterns! I decided to build PATGEN 2.4 from source code which, of course, means converting the PATGEN WEB source to C code via Web2C. Some time ago I went through the process of building my own Web2C executable for Windows (see this article for more details). I won't go into the specifics of doing the conversion but I was able to create patgen.c – the resulting C code is less than 2,000 lines long. I also spent some time re-formatting the C code simply because the Web2C process of machine-generated C does not aim for beauty, just functionality. I removed all dependencies on Kpathsea and generally tidied the code to create clean, stripped-down code that is easy to compile.

Understanding PATGEN: not so easy

PATGEN is, of course, a very highly specialized program and one that is designed for expert users who really need to use it. As a non-expert looking to understand just the basics I found that there was very little step-by-step "beginners" material – although a search on tex.stackexchange.com provided some useful "snippets" and the tutorial "A small tutorial on the multilingual features of PatGen2" by Yannis Haralambous was very helpful. There are, of course, a number of articles, by luminaries and experts, on specific uses of the PATGEN program; however, for me anyway, it was a case of piecing together the puzzle... reading the PATGEN documentation, source code plus some parts of Frank Liang's thesis Word Hy-phen-a-tion by Com-put-er which describes the hyphenation algorithms that PATGEN implements.

Running PATGEN

To run PATGEN you need to provide it with the names/paths of (up to) four files (some can be "nul" if you are not using them):

PATGEN dictionary_file starting_patterns translate_file output_patterns

TIP: I created a PDF file of PATGEN's documentation that you can download here. Some information on the files you provide to PATGEN are discussed in sections 1 to 6 in the first few pages of the documentation.

In very brief outline, the files you provide on the command line are

  • dictionary_file: A pre-prepared list of hyphenated words from which you want to generate hyphenation patterns for TeX to use.
  • starting_patterns: (can be "nul", i.e., it is not mandatory) Best to read the description(s) in the documentation (link above).
  • translate_file: (can be "nul", i.e., it is not mandatory) From the documentation "The translate file may specify the values of \lefthyphenmin and \righthyphenmin as well as the external representation and collating sequence of the `letters' used by the language. It also specifies other information – see the documentation for further details (section 54).
  • output_patterns: the output from PATGEN – a file of hyphenation patterns for use with TeX.

PATGEN: questions, questions...

In order to work its magic, PATGEN makes multiple passes through the dictionary_file as it builds the list of hyphenation patterns. As it performs the processing PATGEN stops to ask you for input: it needs your help at various stages of the processing. Now, I'm not going to go into the details of those questions simply because I'm not sufficiently experienced with the program to be sure that I'd be giving sensible advice. Sorry :-(.

Answering questions via Lua

So, finally, to the main topic of this post. As noted, during processing PATGEN asks you to provide it with some information to guide the pattern-generation process: those details concern the hyphenation levels, pattern lengths plus some heuristics data that assist PATGEN to choose patterns. Ultimately, the answers you give to PATGEN are integer values that you enter at the command line. However, it's a bit frustrating to keep answering PATGEN's questions so I wondered if it would be possible to "automate" providing those answers and, in addition, create a Dynamic Link Library (DLL) that I could use with LuaTeX – perhaps something very basic to start with, like this:


\directlua{

local pgen=require("patgen")
pgen.setparams(...)
local patterns=pgen.run()

}

In the above code, require("patgen") will load a DLL (patgen.dll) and return a table of functions that let you set various parameters for PATGEN and then run it to return the pattern list as a string that you can subsequently use with LuaTeX. Note, LuaTeX does NOT require INITEX mode to use hyphenation patterns.

Calling Lua code (functions) from patgen.dll

The above simple scenario does indeed work and it's quite easy to implement this. Firstly, within PATGEN's void mainbody(void) routine you can replace the code that stops to ask you questions – such as the request for the start/finish pattern lengths:


Fputs(output, "pat_start, pat_finish: ");
input2ints(&n1, &n2);

The above code uses a function input2ints (int *a, int *b) to request two integers:


void input2ints (int *a, int *b)
{
int ch;
while (scanf (SCAN2INT, a, b) != 2)
{
while ((ch = getchar ()) != EOF && ch != '\n');
if (ch == EOF) return;
fprintf (stderr, "Please enter two integers.\n");
}

while ((ch = getchar ()) != EOF && ch != '\n');
}

You can replace this with your own function, say get_pattern_start_finish(&n1, &n2) which can, for example, call a function in your Lua script to work out the values you want to return for n1 and n2 (values for pat_start, pat_finish). Perhaps you might store those values in Lua as a table. At the time of writing I've not yet written that part but, at the moment, from the Lua/C module I just return some hardcoded answers. The next step simply requires making a call from the Lua C API to a named Lua script function that works out the values you want to provide. This gives the most flexibility because the logic is all contained in your Lua code which, of course, makes it very quick and easy to experiment with different settings to generate different patterns. This technique can also be used for other parameters that PATGEN asks for.

Returning the generated pattern(s)

Within PATGEN, there is a function called zoutputpatterns(...) which generates the hyphenation patterns and writes them out to a file. I'm experimenting with a function which "wrappers" this into another function that uses a C++ stringstream object to capture/save the pattern text – rather than writing it to a file. To do this simply required modifying zoutputpatterns(...) to pass in the stringstream object and output patterns (character data) to the stringstream rather write the data than a physical file. Once finished, you can then get access to the stringstream's stored data as a C-style string (containing the generated hyphenation patterns) which you can return to Lua, thus to LuaTeX.


using namespace std;
void do_output_patterns(int i, int j)
{
std::stringstream *ss;
ss=new std::stringstream();
zoutputpatterns (i , j, ss) ;
std::cout << ss->str().c_str() << endl; //you can then pass the string of patterns back to Lua and thus LuaTeX delete ss; }

In conclusion

This is just a quick summary of a work-in-progress but it looks like it will provide a nice way for fast/rapid experimentation with PATGEN. It seems to offer dynamic generation of hyphenation patterns and provides a method to fully script PATGEN's activities and thus very quickly understand the effect of the parameters PATGEN asks you to provide. If there is any interest I might (eventually) release it once I'm happy that it's good enough.

22Jun/15Off

Building LuaTeX 0.80 on Windows and debugging with Eclipse IDE

Posted by Graham Douglas

Long time, no posts!

It's been a very long time since my last post, some 8 months, so I thought it was about time I posted something new. At the moment, I'm currently looking for new contract work (or employment opportunities) within STM publishing so, for a while, I have some time to devote to my blog.

A new LuaTeX beta (version 0.80) was released on 13 June 2015 and, as usual, I wanted to compile LuaTeX from source code. I grabbed a copy of LuaTeX's source from the subversion repository – on Windows I use the excellent, free, TortoiseSVN software to create my local repository. To create a local repository with TortoiseSVN you use the URL https://foundry.supelec.fr/svn/luatex/tags/beta-0.80.0

Compilation failed: time for an update of MinGW/MSYS

At first I could not get a successful compilation of LuaTeX 0.80 even though the prior release (0.79.3.1) compiled perfectly. Note that this failure to build LuaTeX 0.80 could simply be due to a problem with my local setup and others might not experience it: I'm merely documenting what I did to fix my own issues with the build. After some discussions with a member of the LuaTeX development team I decided it was time to do a fresh/updated install of the tools you need to compile LuaTeX – MinGW and MSYS, which provide the compiler, libraries, Bash shell and other tools/utilities.

Which ones did I use?

I decided to use the following versions of MinGW and MSYS:

Notes on installation

Installing mingw-w64 just requires running the .exe provided. To install MSYS you simply unpack the file MSYS-20111123.zip. I chose to install mingw-w64 and MSYS on my E: drive in directories called MinGW64 and MSYS respectively. Once you've installed MSYS you need to run a small "post installation" batch file called pi.bat which is located in the postinstall subdirectory of your MSYS folder (e.g,. e:\msys\postinstall\pi.bat). This batch file asks a couple of simple questions to "link up" your MinGW installation and your MSYS installation. After installing mingw-w64 and MSYS I was able to build LuaTeX 0.80 without any difficulties. Note that you will probably need to update your system's PATH environment variable to include the location of the directories which contain the numerous executables provided by mingw-w64 and MSYS.

Next step: Grab the LuaTeX code

As noted, you'll need an SVN client to checkout your own local copy of the LuaTeX repository. I used TortioseSVN and the aforementioned URL: https://foundry.supelec.fr/svn/luatex/tags/beta-0.80.0 Let's assume you successfully downloaded LuaTeX's source code into a repository directory called, say, e:\luatex\beta-0.80.0. The next step is to start the MSYS Bash shell by double-clicking on the batch file msys.bat located in the root of your MSYS folder. With the Bash shell running, change your current directory by issuing the command cd e:/luatex/beta-0.80.0

Running the build script: build.sh

Located within the e:\luatex\beta-0.80.0 directory is a file (Bash shell script) called build.sh which you execute to perform the compilation process (i.e., it calls configure and make). If you look inside build.sh you'll observe there are several command-line options you can give to the script but I'm not going to cover those here – apart from the --debug option which I'll discuss in a moment. To execute the script you just need to type ./build.sh press return and, hopefully, the build will start. Depending on the options you give to the build script the build process can take quite a long time. On my Intel i7 (6 core) machine (with 16GB memory) it can take as long as 20 minutes for a full build.

Using build.sh --debug

As you might have guessed, running the build.sh script with the --debug option (./build.sh --debug) creates a version of the luatex.exe executable that contains a wealth of additional information which provides GNU's debugger (gdb) with the information it needs in order to run the executable for debugging purposes. Just to note that, at the time of writing, the non-debug luatex.exe file (on Windows) is approximately 8MB, but the debug version explodes in size to something like 325MB! (again, on Windows). This has been reported to the LuaTeX team and is presently being investigated.

Now the fun stuff: debugging luatex.exe

LuaTeX is a large and very complex piece of software which makes use of many C/C++ libraries, including: FontForge, Cairo, MetaPost, GNU numerical libraries, libpng, zlib and others – all in addition to its own code base plus, of course, the Lua scripting language. It's really quite an amazing feat of programming to glue all these libraries together. If, like me, you are interested to see how LuaTeX works "under the hood", the only way to really achieve that is to create a debug version of LuaTeX (noted above) and run it using the GNU debugger (gdb) – the GNU debugger is installed as part of the mingw-w64 distribution.

I prefer a Visual Debugger

Of course, it's quite possible to use the GNU debugger via a command line but after years of Using Microsost Visual Studio I very much prefer using a graphical interface to set breakpoints, single-step through code, examine variables etc – all the things you do as part of a debug session. However, we've built the luatex.exe (debug version) through a script, using GNU compilers, and we don't have a nice Visual Studio project we can use: so how can we have the pleasures of a GUI-based debugging session? Well, there's some great news: you can! The Eclipse IDE (Integrated Development Environment) has a fantastic feature that let's you import an executable (debug version) and automatically creates a project that lets you use GNU's gdb within a nice GUI world – you can single step through the original C/C++ code, and work just as you would in a typical Visual Studio world. It's really quite amazing and is possible because the debug version of luatex.exe is expanded to provide/include the additional information that lets you do this.

Installing Eclipse on Windows

Eclipse is built in Java so you'll first need to ensure you have Java (and the Java Development Kit) installed before you try to install Eclipse. I already had 32-bit Java installed but I decided to install the 64-bit version (keeping the 32-bit version) – all I did was to install the 64-bit Java version in a different directory. Again, you might need to update your system's PATH environment variable so that it can find Java executables.

Which Eclipse?

You need the Eclipse IDE for C/C++ Developers. The latest version is, at the time of writing, available here (again I opted for the 64-bit version). Now I must confess that I did encounter a few minor issues with trying to configure Eclipse and telling it to use to use the compiler setup provided by mingw-w64. Such issues can be very dependent on your local setup so I won't go into the details. However, if, like me, you do encounter difficulties trying test the Eclipse install (compiling a simple test C program with the GNU compiler) then be patient and Google for help + tips because most issues are likely to have been noted/discussed somewhere on the web.

A TIP I can offer: .w source file extensions

One particular issue you might hit when trying to debug LuaTeX with the GNU debugger (gdb) is the strange source file extension used by some source files in the LuaTeX code base. Much of core LuaTeX is written in CWEB, which is the C-code version of Knuth's venerable WEB (structured documentation) format. CWEB code is a mixture of C program code and TeX documentation code. During the build process a program called CTANGLE processes the .w files to generate the C source for compilation. However, the debug executable contains references to these .w source files but Eclipse needs to be told that files with a .w extension are source files, otherwise Eclipse and the GNU debugger (gdb) get "confused" and claim they can't find the .w source files – meaning you can't step into the source code. All I did was (within Eclipse) to set up .w as a source file type under Window --> Preferences --> C/C++--> File Types as shown in the screenshot below.

And finally: opening luatex.exe with Eclipse

From the Eclise menu, choose File --> Import and select C/C++ Executable. Click "Next" then "Browse" to locate the debug version of the LuaTeX executable you built earlier (Note: here I've named my executable file as luatex080debug.exe). From then on, just continue with the import process – for now, during the import process I just accepted the default options offered by Eclipse (you can read-up later). Once Eclipse is ready, click "Debug" in the final step and the debug version of the executable will be examined and parsed to extract all the information it contains and from that data Eclipse will build you a project for debugging LuaTeX, complete with access to all the source code for you to set breakpoints and step through at your leisure. How amazing is that!

The debug executable being parsed by gdb and Eclipse building your debugging project:

The Eclipse IDE debugging LuaTeX using GNU gdb debugger

Single-stepping through LuaTeX's source code in Eclipse 🙂

The Eclipse IDE debugging LuaTeX using GNU gdb debugger

25Oct/14Off

MetaPost: Direct to PDF via MPlib

Posted by Graham Douglas

Introduction

Using the Cairo graphics library (under Windows/Visual Studio) I have, with some caveats, been able to create a direct-to-PDF backend for MetaPost via the brilliant MPlib C library. Of course, Cairo does not support the CMYK colour space which is a real shame, despite there being a lot of discussion on the need for that. I might look at using LibHaru or possibly PoDoFo, both of which I've managed to build on Windows – although I found PoDoFo somewhat difficult to build as a native Windows library. In addition, I have not yet added support for including text in the MetaPost graphics which is, of course, a pretty big omission! That's on the "TODO" list. An example PDF is included in this post, based on the MetaPost code available on this site. If you look at the example PDF you will see it is created with Cairo 1.12.16, the latest release available at the time I wrote this post (25 October 2014).

Download PDF

Quick overview of the process

At the moment, the PDF backend seems to work well, at least with the MetaPost code I've tried it with (minus text, of course!). The lack of CMYK support in Cairo is a nuisance and at the moment I do a very simple, and wholly inadequate, "conversion" of CMYK to RGB, which really makes me cringe. Perhaps I might put in a "callback" feature to use other PDF libraries at the appropriate points in my C code. MPlib itself is a superb C library and the API documentation (version 1.800) that's currently available was a helpful start but as very non-expert MetaPost user I did need to resort to John Hobby's original work in order to understand just a little more about some MetaPost internals. In writing the PDF backend I pretty much had to go through the PostScript backend and replace PostScript output with the appropriate Cairo API calls. The trickiest part, at least for me, was implementing management of the graphics state (as MetaPost sees it). In the end, I chose to use MPlib's ability to register a userdata pointer (void*) with the MetaPost interpreter. In the PostScript backend the graphics state is managed internally by the MetaPost interpreter (MPlib). Can't quite recall why I chose to externalise the graphics state code but I think it was to give me a bit more flexibility; either way, so far it basically works well. I chose to build MPlib as a static Windows .lib file – no particular reason, just that's what I prefer to do – although building a DLL is no more difficult. Much of MPlib is released as a set of CWEB files so you will need to extract the C code via CTANGLE.EXE. I use Windows and Visual Studio so, not surprisingly, I found that the MPlib C code would not compile immediately "out of the box" but a few minor (pretty trivial) adjustments to the header files (and some manual #defines) soon resolved the problems and it compiled fine after that.

A little deeper

Assuming you have a working compilation of MPlib, how do you actually use it? I won't repeat the information available in the the MPlib API documentation but will give a brief summary of additional considerations that might be helpful to others. Firstly, in my implementation I instantiate an instance of the MP interpreter like this:

	MP mp = init_metapost((void*)create_mp_graphics_state());
	if ( ! mp ) exit ( EXIT_FAILURE ) ;

where (void*)create_mp_graphics_state() is a function to create a new graphics state and register this as the userdata item stored in the MPlib instance – see the code for init_metapost(void* userdata) below (Note: this is a work-in-progress and the error checking is very minimal!!! :-)). Providing the initialization succeeds you will get a new MetaPost interpreter instance returned to you. As part of the initialization you have to provide a callback that tells MetaPost how to find input files – my callback is called file_finder which uses recursive directory searching: no kpathsea involved at all. One very important setting in MP-options is math_mode which affects how MetaPost performs its internal calculations: later versions of MPlib (after 1.800) support all 4 of the possible options. As part of the initialization I also preload the plain.mp macro collection.

MP init_metapost(void* userdata)
{

	MP mp;
	MP_options * opt = mp_options () ;
	opt -> command_line = NULL;
	opt -> noninteractive = 1 ;
	opt->find_file = file_finder;
	opt->print_found_names = 1;
	opt->userdata = userdata;

	/*
	typedef enum{
	mp_math_scaled_mode= 0,
	mp_math_double_mode= 1,
	mp_math_binary_mode= 2,
	mp_math_decimal_mode= 3
	}mp_math_mode;
	*/

	opt->math_mode =mp_math_scaled_mode;
        opt->ini_version = 1;
	mp = mp_initialize ( opt ) ;
	if ( ! mp ) 
		//exit ( EXIT_FAILURE )
		return NULL;
	else
	{
		char * input= "let dump = endinput ; input plain; ";
		mp_execute(mp, input, strlen(input));
		mp_run_data * res = mp_rundata(mp);
		
		if(mp->history > 0)
		{
			printf("Error text (%s\n)", res->term_out.data);
			return NULL;
		}
		else{
		
			return mp;
		}
	}

}

Got a working instance, now what?

In you get a working MP instance the next task is, of course, to feed it with some MetaPost code (using mp_execute(mp, your_code, strlen(your_code))😉 and checking to see if MetaPost successfully interpreted your_code. Now I'm not going to give full details of the checks you need to perform as this is pretty routine and the API documentation contains enough help already. In essence, if MPlib was able to run your MetaPost code successfully, it stores the individual graphics (produced from your_code) as a linked list of so-called edge structures (mp_edge_objects). Each edge structure (mp_edge_object) is a graphic that you want to output and, in essence, each edge structure results from the successful execution of the code contained in each beginfig(x) ... endfig; pair. In turn, each edge structure (individual graphic to output) is itself made up from smaller building blocks of 8 types of fundamental graphics object (mp_graphic_object). Each mp_graphic_object has a type to tell you what sort of graphic object it is so you can call the appropriate function to render it – as the equivalent PostScript, PDF, PNG, SVG etc.

In summary

If your MetaPost interpreter instance is called, say, mp, then to gain access to the linked list of edge structures you do something like this:

 
mp_run_data * res = mp_rundata(mp);
mp_edge_object* graphics = res->edges;

Note that the edge structures form a simple linked list but the list of components within each individual edge structure (the mp_graphic_object objects) form a circularly-linked list, so you have to be careful to check when you get to the end of the circular list of the mp_graphic_object objects: see the API docs for an example. In closing, here's the loop from my code to process an individual edge structure into PDF – not including all the additional functions to process the various types of the mp_graphic_object objects.

int draw_mp_graphic_on_pdf(mp_edge_object* single_graphic, cairo_t *cr)
{

 		mp_graphic_object*p;
 		MP mp = single_graphic->parent;

		 p=single_graphic->body;
		
		// Inherited this weirdness from core MP engine...
		init_graphics_state(mp, 0);
		// Here we are looping over all the objects in a single graphics
		// resulting from a beginfig(x) ... endfig pair
		 while (p != NULL) 
		 {
			mp_gr_fix_graphics_state(mp,p,cr);
 			switch (gr_type(p)) 
			 {
				 case mp_fill_code:

				 {
				
				 if(gr_pen_p((mp_fill_object*)p)==NULL)
					 {
						//mp_dump_solved_path(gr_path_p((mp_fill_object*)p));
						cairo_gr_pdf_fill_out(mp,gr_path_p((mp_fill_object*)p),cr); 
					}
					 else if(pen_is_elliptical(gr_pen_p((mp_fill_object*)p)))
					{
						//mp_dump_solved_path(gr_path_p((mp_fill_object*)p));
						cairo_gr_stroke_ellipse(mp,p,true,cr);
					 }else{
						//mp_dump_solved_path(gr_path_p((mp_stroked_object*)p));
						cairo_gr_pdf_fill_out(mp,gr_path_p((mp_fill_object*)p),cr);
						cairo_gr_pdf_fill_out(mp,gr_htap(p),cr); 
					 }
	
					 if(   ((mp_fill_object*)p)->post_script != NULL)
					 {
					        // just something I'm experimenting with
						//ondraw(cr, ((mp_fill_object*)p)->post_script);
					}
				}
				break;

				 case mp_stroked_code:
				 {
					
					 mp_dump_solved_path(gr_path_p((mp_stroked_object*)p));	
					if(pen_is_elliptical(gr_pen_p((mp_stroked_object*)p)))
						cairo_gr_stroke_ellipse(mp, p, false, cr);
 					else
 					{
						//mp_dump_solved_path(gr_path_p((mp_stroked_object*)p));
						cairo_gr_pdf_fill_out(mp,gr_path_p((mp_stroked_object*)p),cr);
 					}

					 if(((mp_stroked_object*)p)->post_script != NULL)
					{
						ondraw(cr, ((mp_stroked_object*)p)->post_script);
 					}
 				}
 				break;

  			         case mp_text_code: // not yet implemented
				 {
					 mp_text_object* to;
					to = (mp_text_object*)p;
					char * po = to->post_script;
					char * ps = to->pre_script;
				}
				break;

				case mp_start_clip_code:
					cairo_save(cr);
					cairo_gr_pdf_path_out(mp,gr_path_p((mp_clip_object*)p),cr);
					cairo_clip(cr);
				break;
				
				case mp_stop_clip_code:
					cairo_restore(cr);
				break;		
				
				case mp_start_bounds_code: // ignored
					//mp_bounds_object *sbo;
					//sbo = (mp_bounds_object *)gr;
				break;

				case mp_stop_bounds_code: //ignored
					//mp_special_object
				break;
				case mp_special_code: //just more experimenting, ignore
					
					mp_special_object *speco;
					speco = (mp_special_object *)p;
					printf("%s", speco->pre_script);
					ondraw(cr, speco->pre_script);
				break;
			}
				p= gr_link(p);
		}
		return 0;
	}

Conclusion

I wish I could switch on the commenting feature but, sadly, spammers make this impossible. So, I just hope the above is a useful starting point for anyone wanting to explore the marvellous MPlib C library.