PATGEN: from WEB to C
I recently became curious about TeX’s hyphenation patterns and started to read about how they are created – usually using PATGEN though, from what I’ve read, some brave souls do actually hand-craft hyphenation patterns! I decided to build PATGEN 2.4 from source code which, of course, means converting the PATGEN WEB source to C code via Web2C. Some time ago I went through the process of building my own Web2C executable for Windows (see this article for more details). I won’t go into the specifics of doing the conversion but I was able to create patgen.c – the resulting C code is less than 2,000 lines long. I also spent some time re-formatting the C code simply because the Web2C process of machine-generated C does not aim for beauty, just functionality. I removed all dependencies on Kpathsea and generally tidied the code to create clean, stripped-down code that is easy to compile.
Understanding PATGEN: not so easy
PATGEN is, of course, a very highly specialized program and one that is designed for expert users who really need to use it. As a non-expert looking to understand just the basics I found that there was very little step-by-step “beginners” material – although a search on tex.stackexchange.com provided some useful “snippets” and the tutorial “A small tutorial on the multilingual features of PatGen2” by Yannis Haralambous was very helpful. There are, of course, a number of articles, by luminaries and experts, on specific uses of the PATGEN program; however, for me anyway, it was a case of piecing together the puzzle… reading the PATGEN documentation, source code plus some parts of Frank Liang’s thesis Word Hy-phen-a-tion by Com-put-er which describes the hyphenation algorithms that PATGEN implements.
To run PATGEN you need to provide it with the names/paths of (up to) four files (some can be “nul” if you are not using them):
PATGEN dictionary_file starting_patterns translate_file output_patterns
TIP: I created a PDF file of PATGEN’s documentation that you can download here. Some information on the files you provide to PATGEN are discussed in sections 1 to 6 in the first few pages of the documentation.
In very brief outline, the files you provide on the command line are
- dictionary_file: A pre-prepared list of hyphenated words from which you want to generate hyphenation patterns for TeX to use.
- starting_patterns: (can be “nul”, i.e., it is not mandatory) Best to read the description(s) in the documentation (link above).
- translate_file: (can be “nul”, i.e., it is not mandatory) From the documentation “The translate file may specify the values of
\righthyphenminas well as the external representation and collating sequence of the `letters’ used by the language. It also specifies other information – see the documentation for further details (section 54).
- output_patterns: the output from PATGEN – a file of hyphenation patterns for use with TeX.
PATGEN: questions, questions…
In order to work its magic, PATGEN makes multiple passes through the dictionary_file as it builds the list of hyphenation patterns. As it performs the processing PATGEN stops to ask you for input: it needs your help at various stages of the processing. Now, I’m not going to go into the details of those questions simply because I’m not sufficiently experienced with the program to be sure that I’d be giving sensible advice. Sorry :-(.
Answering questions via Lua
So, finally, to the main topic of this post. As noted, during processing PATGEN asks you to provide it with some information to guide the pattern-generation process: those details concern the hyphenation levels, pattern lengths plus some heuristics data that assist PATGEN to choose patterns. Ultimately, the answers you give to PATGEN are integer values that you enter at the command line. However, it’s a bit frustrating to keep answering PATGEN’s questions so I wondered if it would be possible to “automate” providing those answers and, in addition, create a Dynamic Link Library (DLL) that I could use with LuaTeX – perhaps something very basic to start with, like this:
In the above code,
require("patgen") will load a DLL (
patgen.dll) and return a table of functions that let you set various parameters for PATGEN and then run it to return the pattern list as a string that you can subsequently use with LuaTeX. Note, LuaTeX does NOT require INITEX mode to use hyphenation patterns.
Calling Lua code (functions) from
The above simple scenario does indeed work and it’s quite easy to implement this. Firstly, within PATGEN’s
void mainbody(void) routine you can replace the code that stops to ask you questions – such as the request for the start/finish pattern lengths:
Fputs(output, "pat_start, pat_finish: ");
The above code uses a function
input2ints (int *a, int *b) to request two integers:
void input2ints (int *a, int *b)
while (scanf (SCAN2INT, a, b) != 2)
while ((ch = getchar ()) != EOF && ch != '\n');
if (ch == EOF) return;
fprintf (stderr, "Please enter two integers.\n");
while ((ch = getchar ()) != EOF && ch != '\n');
You can replace this with your own function, say
get_pattern_start_finish(&n1, &n2) which can, for example, call a function in your Lua script to work out the values you want to return for
n2 (values for
pat_start, pat_finish). Perhaps you might store those values in Lua as a table. At the time of writing I’ve not yet written that part but, at the moment, from the Lua/C module I just return some hardcoded answers. The next step simply requires making a call from the Lua C API to a named Lua script function that works out the values you want to provide. This gives the most flexibility because the logic is all contained in your Lua code which, of course, makes it very quick and easy to experiment with different settings to generate different patterns. This technique can also be used for other parameters that PATGEN asks for.
Returning the generated pattern(s)
Within PATGEN, there is a function called
zoutputpatterns(...) which generates the hyphenation patterns and writes them out to a file. I’m experimenting with a function which “wrappers” this into another function that uses a C++ stringstream object to capture/save the pattern text – rather than writing it to a file. To do this simply required modifying
zoutputpatterns(...) to pass in the stringstream object and output patterns (character data) to the stringstream rather write the data than a physical file. Once finished, you can then get access to the stringstream’s stored data as a C-style string (containing the generated hyphenation patterns) which you can return to Lua, thus to LuaTeX.
using namespace std;
void do_output_patterns(int i, int j)
zoutputpatterns (i , j, ss) ;
std::cout << ss->str().c_str() << endl; //you can then pass the string of patterns back to Lua and thus LuaTeX
This is just a quick summary of a work-in-progress but it looks like it will provide a nice way for fast/rapid experimentation with PATGEN. It seems to offer dynamic generation of hyphenation patterns and provides a method to fully script PATGEN’s activities and thus very quickly understand the effect of the parameters PATGEN asks you to provide. If there is any interest I might (eventually) release it once I’m happy that it’s good enough.