{"id":3590,"date":"2014-09-14T21:04:58","date_gmt":"2014-09-14T21:04:58","guid":{"rendered":"http:\/\/www.readytext.co.uk\/?p=3590"},"modified":"2014-09-14T21:04:58","modified_gmt":"2014-09-14T21:04:58","slug":"looking-inside-tex-strings-and-pool-files","status":"publish","type":"post","link":"https:\/\/www.readytext.co.uk\/?p=3590","title":{"rendered":"Looking inside TeX: strings and pool files"},"content":{"rendered":"<p><H1>Introduction<\/H1><\/p>\n<p>In this post we&#8217;ll cover TeX&#8217;s handing of strings and explain <code>.pool<\/code> files. Using Web2C to build (Knuthian) TeX from Knuth&#8217;s <code>TeX.WEB<\/code> source code involves many steps as <a href=\"https:\/\/www.readytext.co.uk\/?p=2529\">explained elsewhere on this site<\/a>. One of the initial steps when building TeX is combining Knuth&#8217;s master source file (<code>TeX.WEB<\/code>) with a &#8220;change file&#8221; (<code>TeX.CH<\/code>) to produce a modified WEB source file (let&#8217;s call it <code>TeXk.WEB<\/code>) which can be processed via the Web2C process. The <code>TeX.CH<\/code> change file applies many modifications to the master <code>TeX.WEB<\/code> source code &ndash; i.e., in preparation for conversion to C code and adding support for the kpathsea file-seaching library. After the change file has been applied, the next step is to process our modified <code>TeX.WEB<\/code> (i.e., <code>TeXk.WEB<\/code>) via the TANGLE program. If TANGLE successfully parses our TeXk.WEB source code it will output two files (download links are provided for the inquisitive):<\/p>\n<ul>\n<li><a href=\"https:\/\/www.readytext.co.uk\/files\/TeXk.p\"><code>TeXk.p<\/a><\/code>: the source code of TeX (in Pascal).<\/li>\n<li><a href=\"https:\/\/www.readytext.co.uk\/files\/TeXk.pool\"><code>TeXk.pool<\/a><\/code>: a file containing the string constants defined in <code>TeXk.WEB<\/code><\/li>\n<\/ul>\n<p>Here&#8217;s a small fragment of <code>TeXk.pool<\/code> as produced during my Web2C process:<\/p>\n<pre class=\"brush: plain; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n....\r\n11expandafter\r\n04font\r\n09fontdimen\r\n06halign\r\n05hrule\r\n12ignorespaces\r\n10mathaccent\r\n08mathchar\r\n10mathchoice\r\n08multiply\r\n07noalign\r\n10noboundary\r\n08noexpand\r\n04omit\r\n07penalty\r\n08prevgraf\r\n07radical\r\n04read\r\n05relax\r\n06setbox\r\n03the\r\n06valign\r\n07vcenter\r\n05vrule\r\n09save size\r\n15grouping levels\r\n08curlevel\r\n09retaining\r\n09restoring\r\n05SAVE(\r\n28Incompatible magnification (\r\n02);\r\n36 the previous value will be retained\r\n58I can handle only one magnification ratio per job. So I've\r\n59reverted to the magnification you used earlier on this run.\r\n46Illegal magnification has been changed to 1000\r\n52The magnification ratio must be between 1 and 32768.\r\n...\r\n*413816964\r\n<\/pre>\n<p><code>TeXk.pool<\/code> consists of many lines of the format <code>[string length][string text][end_of_line]<\/code> and final containing <code>*CHECKSUM<\/code>, where <code>CHECKSUM<\/code> in the above example is <code>413816964<\/code>. Once upon a time, <code>.pool<\/code> files had to be preserved as an external file for use when building <code>.fmt<\/code> files via INITEX but in 2008 this was changed and the <code>.pool<\/code> file is now compiled into the TeX binaries &ndash; I&#8217;ll explain this below. For example, the following note is contained in more recent <code>texmf.cnf<\/code> files:<\/p>\n<p><code>As of 2008, pool files don't exist any more (the strings are compiled into the binaries), but just in case something expects to find these:<br \/>\nTEXPOOL = .;$TEXMF\/web2c<br \/>\nMFPOOL = ${TEXPOOL}<br \/>\nMPPOOL = ${TEXPOOL}<br \/>\n<\/code><\/p>\n<p>As you can see from the above fragment, the <code>TeXk.pool<\/code> file contains string constants for TeX&#8217;s primitive commands plus all the strings contained in help\/error messages that TeX outputs to the terminal and\/or log file. <\/p>\n<p><H1>TeX&#8217;s internal handling of strings<\/H1><\/p>\n<p>In addition to the string constants defined in <code>TeXk.pool<\/code>, TeX will, of course, encounter new strings &ndash; for example, when you define new macro names; consequently, TeX needs a way to store the string constants in <code>TeXk.pool<\/code> and the strings it encounters during its run-time processing of your TeX files. It should not be a surprise that TeX&#8217;s internal handling of strings is achieved through methods designed to ensure portability.<\/p>\n<blockquote><p><strong>From TeX.WEB<\/strong>: <code>The TEX system does nearly all of its own memory allocation, so that it can readily be transported into environments that do not have automatic facilities for strings, garbage collection, etc., and so that it can be in control of what error messages the user receives... Control sequence names and diagnostic messages are variable-length strings of eight-bit characters. Since PASCAL does not have a well-developed string mechanism, TeX does all of its string processing by homegrown methods.<\/code><\/p><\/blockquote>\n<p><H2>How does TeX use\/store strings?<\/H2><\/p>\n<p>In vanilla C, a simple 8-bit string is an array of characters terminated by the null character (<code>'\\0'<\/code>). TeX does not store is strings as individually named string variables but allocates a single large array and uses integer offsets into that array to identify strings (and calculate lengths). Here&#8217;s how it works.<\/p>\n<blockquote><p><strong>From TeX.WEB<\/strong>:<code> The array |str_pool| contains all of the (eight-bit) ASCII codes in all of the strings, and the array |str_start| contains indices of the starting points of each string. Strings are referred to by integer numbers, so that string number |s| comprises the characters |str_pool[j]| for |str_start[s]&lt;=j&lt;str_start[s+1]|. Additional integer variables |pool_ptr| and |str_ptr| indicate the number of entries used so far in |str_pool| and |str_start|, respectively; locations |str_pool[pool_ptr]| and |str_start[str_ptr]| are ready for the next string to be allocated.<\/code><\/p><\/blockquote>\n<p>It is worth noting that when TANGLE produces Pascal code (from the WEB source) it strips out all underscores from variables defined in the WEB code. For example, the |str_pool| variable mentioned above is called strpool in the final C code produced from the Pascal.<\/p>\n<p>After processing via Web2C, the WEB variables <code>|str_pool|<\/code>, <code>|str_start|<\/code>, <code>|pool_ptr|<\/code> and <code>|str_ptr|<\/code> are global variables declared as follows (near the start of <code>TeX.C<\/code>): <\/p>\n<p><code><br \/>\npackedASCIIcode * strpool  ;<br \/>\npoolpointer * strstart  ;<br \/>\npoolpointer poolptr  ;<br \/>\nstrnumber strptr<br \/>\n<\/code><\/p>\n<p>The types <code>packedASCIIcode<\/code> and <code>poolpointer<\/code> are simply typedefs:<\/p>\n<p><code><br \/>\ntypedef unsigned char packedASCIIcode  ;<br \/>\ntypedef int integer;<br \/>\ntypedef integer poolpointer  ;<br \/>\n<\/code><\/p>\n<p>Stripping away all <code>typedef<\/code>s introduced by Web2C gives:<\/p>\n<p><code><br \/>\nunsigned char* strpool  ;<br \/>\nint* strstart  ;<br \/>\nint poolptr  ;<br \/>\nint strptr  ;<br \/>\n<\/code><\/p>\n<p>To see what&#8217;s going on, i.e., how TeX identifies a string, let&#8217;s first look at the global variable <code>strpool<\/code> (practically all key variables are declared with global scope in <code>TeX.C<\/code>&#8230;!). During initialization (in INITEX mode, and when TeX is reading\/unpacking a <code>.fmt<\/code> file to initialize a particular format (<code>plain.fmt<\/code>, <code>latex.fmt<\/code> etc)) the <code>strpool<\/code> and <code>strstart<\/code> variables are initialized as follows:<\/p>\n<p><code>strpool = xmallocarray (packedASCIIcode , poolsize) ;<\/code><br \/>\n<code>strstart = xmallocarray (poolpointer , maxstrings) ;<\/code><\/p>\n<p>where <code>xmallocarray<\/code> is a <code>#define<\/code>:<\/p>\n<p><code><br \/>\n\/* Allocate an array of a given type. Add 1 to size to account for the fact that Pascal arrays are used from [1..size], unlike C arrays which use [0..size]. *\/<br \/>\n#define xmallocarray(type,size) ((type*)xmalloc((size+1)*sizeof(type)))<br \/>\n<\/code><\/p>\n<p>and <code>xmalloc(...)<\/code> is a small utility function wrapped around the standard C function <code>malloc(...)<\/code>.<\/p>\n<blockquote><p><strong>A Pascal legacy<\/strong>: In many places within <code>TeX.C<\/code> you have to account for that fact that Pascal arrays start at index 1 but C arrays start at index 0. This is a consequence that Knuthian TeX is written in Pascal, not C. <\/p><\/blockquote>\n<p>The allocation of memory for <code>strpool<\/code> uses an integer variable called <code>poolsize<\/code>: the value of <code>poolsize<\/code> is calculated at run-time from the value of other variables &ndash; including some variables whose value can be defined by settings in <code>texmf.cnf<\/code>. So, in essence:<\/p>\n<p><code>strpool = (char *) malloc(sizeof(unsigned char)*(poolsize +1));<\/code><\/p>\n<p>&ndash; which looks very much like one huge C string. And, of course, it is. <code>strpool<\/code> stores all TeX&#8217;s strings BUT within <code>strpool<\/code> all strings are contiguous (stored end-to-end) without any delimiter characters between them (such as NULL, (<code>'\\0'<\/code>), space, etc). Clearly, there needs to be a mechanism to define where each individual string starts and stops: i.e., to partition <code>strpool<\/code> into individual strings. That mechanism is the task of the integer array variable called <code>strstart<\/code>. Perhaps an example will make this clearer.<\/p>\n<p>We can declare a variable <code>myfakestrpool<\/code> as follows:<\/p>\n<p><code>unsigned char fakestrpool[]=\"ThisismyfakeTeXstrpool\";<\/code><\/p>\n<p>Here, we have concatenated the 6 strings <code>\"This\"<\/code>, <code>\"is\"<\/code>, <code>\"my\"<\/code>, <code>\"fake\"<\/code>,<code>\"TeX\"<\/code> and <code>\"strpool\"<\/code> into one long string. These 6 strings start at the following offsets in <code>fakestrpool<\/code>:<\/p>\n<p><code><br \/>\nstring 0 (\"This\"): offsets 0<br \/>\nstring 1 (\"is\"): offset 4<br \/>\nstring 2 (\"my\"): offset 6<br \/>\nstring 3 (\"fake\"): offset 8<br \/>\nstring 4 (\"TeX\"): offset 12<br \/>\nstring 5 (\"strpool\") offset 15<br \/>\n<\/code><\/p>\n<p>So, if we define an array of integers, <code>strstart<\/code>, to record these offsets:<\/p>\n<p><code>int strstart[6] ; \/\/ for 6 strings numbered 0 to 5<\/code><\/p>\n<p><code><br \/>\nstrstart[0]=0<br \/>\nstrstart[1]=4<br \/>\nstrstart[2]=6<br \/>\nstrstart[3]=8<br \/>\nstrstart[4]=12<br \/>\nstrstart[5]=15<br \/>\n<\/code><\/p>\n<p>Then for some string identified by a number <code>k<\/code> (where <code>0 =&lt; k &lt;= 5<\/code>), <code>strstart[k]<\/code> gives the offset into <code>fakestrpool<\/code> where the kth string starts. And this is exactly how TeX identifies strings: it identifies them using some integer value, <code>k<\/code>, say, where  <code>strstart[k]<\/code> tells you where that string starts (in <code>strpool<\/code>) and allows the length (<code>length(k)<\/code>, of string number <code>k<\/code>) to be easily be calculated using<\/p>\n<p><code>length(k) = strstart[k + 1] - strstart[k]<\/code><\/p>\n<p>For example, let us use this method to calculate the length of the string with number 4 (<code>k=4<\/code>) (<code>\"TeX\"<\/code> in our test array <code>fakestrpool<\/code>).<\/p>\n<p><code><br \/>\nlength(4) = strstart[5] - strstart[4]<br \/>\nlength(5) = 15 - 12 = 3<br \/>\n<\/code><\/p>\n<p>Of course there is one minor complication &ndash; calculating the length of string 5, but we have other variables (<code>poolptr<\/code> and <code>strptr<\/code>) to solve issues like this.<\/p>\n<p><H1>Back to <code>.pool<\/code> files<\/H1><\/p>\n<p>We started this discussion by noting that running the TANGLE program on <code>TeXk.WEB<\/code> produces two output files:<\/p>\n<ul>\n<li><a href=\"http:\/\/readytext.co.uk\/files\/TeXk.p\"><code>TeXk.p<\/a><\/code>: the source code of TeX (in Pascal).<\/li>\n<li><a href=\"http:\/\/readytext.co.uk\/files\/TeXk.pool\"><code>TeXk.pool<\/a><\/code>: a file containing the string constants defined in <code>TeXk.WEB<\/code><\/li>\n<\/ul>\n<p>The next stage in the discussion covers the mechanisms for processing <code>.pool<\/code> files &ndash; introduced in circa 2008. Prior to (circa) 2008, you needed to keep <code>.pool<\/code> files available (part of the TeX distribution) as separate files for use whenever you ran INITEX to generate a new <code>.fmt<\/code> file. As noted, the contents of the <code>.pool<\/code> files are string constants generated by TANGLE from string constants defined in main WEB source code to TeX. Given that those strings they don&#8217;t change (they are constants), it makes more sense to build them into the TeX executable file rather than having to access them each time a new <code>.fmt<\/code> file created by INITEX. Part of the Web2C process now involves using a small utility program called  <code>makecpool.exe<\/code> (on Windows) &ndash; <code>makecpool.C<\/code> was written by Taco Hoekwater. The input to <code>makecpool.exe<\/code> is the <code>TeXk.pool<\/code> file and the output is another C file (called <code>texpool.C<\/code> or similar) which defines a function called <code>loadpoolstrings(...)<\/code>:<\/p>\n<p><code>int loadpoolstrings (int spare_size)<\/code><\/p>\n<p><H2>Downloads<\/H2><\/p>\n<p>If you just want to see the inputs\/outputs you can download the files I produced during my private build of Knuthian TeX:<\/p>\n<ul>\n<li><a href=\"http:\/\/readytext.co.uk\/files\/TeXk.pool\"><code>TeXk.pool<\/code><\/a>: The .pool file input for <code>makecpool.exe<\/code><\/li>\n<li><a href=\"http:\/\/readytext.co.uk\/files\/texpool.c\"><code>texpool.C<\/code><\/a>: The C file output by <code>makecpool.exe<\/code>, defining the function <code>loadpoolstrings(...)<\/code>.<\/li>\n<\/ul>\n<p>Once you have generated <code>texpool.c<\/code> you no longer need the original <code>TeXk.pool<\/code> file because the contents of <code>TeXk.pool<\/code> are now stored within <code>texpool.C<\/code>, stored as array of strings:<\/p>\n<pre class=\"brush: plain; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\nstatic const char *poolfilearr&#x5B;] = {\r\n  &quot;buffer size&quot;,\r\n  &quot;pool size&quot;,\r\n  &quot;number of strings&quot;,\r\n  &quot;&quot; &quot;?&quot; &quot;?&quot; &quot;?&quot;,\r\n  &quot;m2d5c2l5x2v5i&quot;,\r\n  &quot;End of file on the terminal!&quot;,\r\n  &quot;! &quot;,\r\n  &quot;(That makes 100 errors; please try again.)&quot;,\r\n  &quot;&quot; &quot;? &quot;,\r\n  &quot;Type &lt;return&gt; to proceed, S to scroll future error messages,&quot;,\r\n  &quot;R to run without stopping, Q to run quietly,&quot;,\r\n  &quot;I to insert something, &quot;,\r\n...\r\n...\r\n...\r\nNULL };\r\n<\/pre>\n<p>Of course, when you build TeX you will need to compile <code>TeXk.C<\/code> and <code>texpool.C<\/code> so that the function <code>loadpoolstrings(...)<\/code> is made available. The function <code>loadpoolstrings(...)<\/code> is called from <code>TeX.C<\/code> when TeX is in INITEX mode (i.e., the <code>--ini<\/code> option is set on the command line). Specifically, <code>loadpoolstrings(...)<\/code> function is called by the function <code>getstringsstarted(...)<\/code> just after it has initialized the first 256 strings in TeX&#8217;s main string container: the <code>strpool<\/code> array discussed above.<\/p>\n<p><H2>Modifying loadpoolstrings (&#8230;) to see what it does<\/H2><\/p>\n<p>The function loadpoolstrings(&#8230;) depends on a few of TeX&#8217;s internal global variables and the function makestring() (we&#8217;ll discuss that shortly), notably we need to declare the following vaiables as extern to <code>texpool.C<\/code>:<\/p>\n<p><code><br \/>\nextern int makestring ( void ) ;<br \/>\nextern unsigned char * strpool;<br \/>\nextern int poolptr;<br \/>\n<\/code><\/p>\n<p>Here is my slightly modified version of <code>loadpoolstrings(...)<\/code> which outputs a file called <code>\"datadump.txt\"<\/code> to list the strings and corresponding string numbers generated by <code>makestring()<\/code>:<\/p>\n<pre class=\"brush: plain; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\nint loadpoolstrings (int spare_size) {\r\n  const char *s;\r\n  int g=0;\r\n  FILE* dumpvals;\r\n  int i=0,j=0;\r\n  dumpvals=fopen(&quot;datadump.txt&quot;, &quot;wb&quot;);\r\n\r\n  while ((s = poolfilearr&#x5B;j++])) {\r\n    int l = strlen (s);\r\n\tfprintf(dumpvals, &quot;\/\/string \\&quot;%s\\&quot; = number &quot;, s);\r\n    i += l;\r\n    if (i&gt;=spare_size) return 0;\r\n    while (l-- &gt; 0) strpool&#x5B;poolptr++] = *s++;\r\n    g = makestring();\r\n\tfprintf(dumpvals, &quot;%ld\\n&quot;, g);\r\n  }\r\n  fclose(dumpvals);\r\n  return g;\r\n}\r\n<\/pre>\n<p><H2><code>datadump.txt<\/code><\/H2><\/p>\n<p>Those who might be interested to see the contents of <code>datadump.txt<\/code> can <a href=\"https:\/\/www.readytext.co.uk\/files\/datadump.txt\">download it here<\/a>. In any case, here&#8217;s a listing of the first few lines in <code>datadump.txt<\/code>:<\/p>\n<pre class=\"brush: plain; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n\/\/string &quot;buffer size&quot; = number 256\r\n\/\/string &quot;pool size&quot; = number 257\r\n\/\/string &quot;number of strings&quot; = number 258\r\n\/\/string &quot;???&quot; = number 259\r\n\/\/string &quot;m2d5c2l5x2v5i&quot; = number 260\r\n\/\/string &quot;End of file on the terminal!&quot; = number 261\r\n\/\/string &quot;! &quot; = number 262\r\n...\r\n...\r\n\/\/string &quot;Using character substitution: &quot; = number 1329\r\n<\/pre>\n<p>As you can see, the string number of the first string is 256 (i.e., the first string originally contained in <code>TeXk.pool<\/code>). Assuming that the string numbers start at 0 (they do), TeX has already initialized strings 0..255 before loading the strings from the <code>TeXk.pool<\/code> file. I hate to do this to you, dear reader, but can you guess what those 256 strings (0..255) might be?<\/p>\n<p><H2>The function <code>makestring()<\/code><\/H2><\/p>\n<p>Here is TeX&#8217;s <code>makestring()<\/code> function which returns a string number after checking for overflows &ndash; i.e., TeX has enough space to store another string.<\/p>\n<pre class=\"brush: plain; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\nstrnumber makestring (void) \r\n{\r\n  register strnumber Result; makestring_regmem\r\n  if (strptr == maxstrings) \r\n  overflow (258 , maxstrings - initstrptr) ;\r\n  incr (strptr) ;\r\n  strstart&#x5B;strptr] = poolptr ;\r\n  Result = strptr - 1 ;\r\n  return Result ;\r\n}\r\n<\/pre>\n<p><H1>Time to stop<\/H1><\/p>\n<p>Dear reader, writing this post has absorbed the greater part of my Sunday (14 September 2014) so you&#8217;ll forgive me if I call it a day and leave it here &ndash; I&#8217;ll fix any typos tomorrow :-). I hope it is of use, or interest, to someone &#8220;out there&#8221;, somewhere.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In this post we&#8217;ll cover TeX&#8217;s handing of strings and explain .pool files. Using Web2C to build (Knuthian) TeX from Knuth&#8217;s TeX.WEB source code involves many steps as explained elsewhere on this site. One of the initial steps when building TeX is combining Knuth&#8217;s master source file (TeX.WEB) with a &#8220;change file&#8221; (TeX.CH) to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3590","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/3590","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3590"}],"version-history":[{"count":56,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/3590\/revisions"}],"predecessor-version":[{"id":4045,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/3590\/revisions\/4045"}],"wp:attachment":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3590"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3590"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3590"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}