<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>STM publishing: tools, technologies and change</title>
	<atom:link href="http://www.readytext.co.uk/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.readytext.co.uk</link>
	<description>A WordPress site for STM Publishing</description>
	<lastBuildDate>Tue, 30 Apr 2013 13:18:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Building SIL&#8217;s Graphite2 on Windows</title>
		<link>http://www.readytext.co.uk/?p=2903&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=building-sils-graphite2-on-windows</link>
		<comments>http://www.readytext.co.uk/?p=2903#comments</comments>
		<pubDate>Mon, 29 Apr 2013 18:26:18 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2903</guid>
		<description><![CDATA[A very short post
I've been reading about SIL International's Graphite engine and it looks really interesting. I downloaded the code and ran the CMake-based build process through the CMake graphical interface. It didn't work. Eventually, I found some instructions to build it from the command line, so here's the way I did it. 

Make sure [...]]]></description>
			<content:encoded><![CDATA[<h1>A very short post</h1>
<p>I've been reading about SIL International's <a href="http://scripts.sil.org/cms/scripts/page.php?site_id=projects&#038;item_id=graphite_home">Graphite engine</a> and it looks really interesting. I downloaded the code and ran the CMake-based build process through the <a href="http://www.cmake.org/">CMake</a> graphical interface. It didn't work. Eventually, I found some instructions to build it from the command line, so here's the way I did it. </p>
<ol>
<li>Make sure the <code>cmake.exe</code> is in your Windows <code>PATH</code>.</li>
<li>Download the <a href="http://projects.palaso.org/projects/graphitedev/files">Graphite source code</a> and unpack into a directory (e.g, called <code>Graphite</code>).</li>
<li>Change directory to the one containing the Graphite source code.</li>
<li>I use Microsoft Visual Studio 2008 so you'll need to adjust the <code>-G</code> parameter (below) to your build environment (<code>cmake --help</code> tells you the ones it supports).</li>
<li>Run the command (all on one line): <code>cmake -G "Visual Studio 9 2008" -DCMAKE_BUILD_TYPE=Release -DGRAPHITE2_COMPARE_RENDERER:BOOL=OFF</code></li>
</ol>
<p>If all goes well you should see something like the following, together with a generated Visual Studio Solution file <code>graphite2.sln</code>.</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
-- Build: Release
-- Segment Cache support: enabled
-- File Face support: enabled
-- Tracing support: enabled
CMake Warning at CMakeLists.txt:54 (message):
  vm machine type direct can only be built using GCC

-- Using vm machine type: call
-- Configuring done
-- Generating done
-- Build files have been written to: E:/SILgraide/Graphite
</pre>
<p>Your Visual Studio Solution should look something like this:</p>
<p><img width="100%" src="http://www.readytext.co.uk/files/graphite2.png"/></p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2903</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Glyph chart for ScheherazadeRegOT</title>
		<link>http://www.readytext.co.uk/?p=2830&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=glyph-chart-for-scheherazaderegot</link>
		<comments>http://www.readytext.co.uk/?p=2830#comments</comments>
		<pubDate>Sun, 07 Apr 2013 14:01:29 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2830</guid>
		<description><![CDATA[I'm in the middle of writing the first of a new series of articles on using the libotf C library to typeset fully vowelled Arabic. I hope to get the first article finished in the next week or two. In the meantime, here's a glyph chart for the free OpenType Arabic typeface called Scheherazade, produced [...]]]></description>
			<content:encoded><![CDATA[<p>I'm in the middle of writing the first of a new series of articles on using the <a href="http://savannah.nongnu.org/files/?group=m17n">libotf C library</a> to typeset fully vowelled Arabic. I hope to get the first article finished in the next week or two. In the meantime, here's a glyph chart for the free OpenType Arabic typeface called Scheherazade, produced by, and <a href="http://scripts.sil.org/cms/scripts/page.php?item_id=ArabicFonts_Download#ofl">available for download</a> from, SIL International. Many thanks to them for providing this typeface, together with the <a href="http://www.microsoft.com/typography/volt.mspx">Microsoft VOLT</a> project files (contained in the <a href="http://scripts.sil.org/cms/scripts/render_download.php?format=file&#038;media_id=scheherazade_OT_1_005_dev&#038;filename=ScheherazadeRegOT-1.005-developer.zip">developer package</a>).</p>
<p><a href="http://readytext.co.uk/files/ScheherazadeRegOTchart.pdf">Download PDF</a></p>
<p><iframe src="https://docs.google.com/gview?url=http://readytext.co.uk/files/ScheherazadeRegOTchart.pdf&#038;embedded=true" style="width:100%; height:600px;" frameborder="0"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2830</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Type 42 PostScript fonts with DVIPS: FreeType, LCDF Typetools and re-encoding</title>
		<link>http://www.readytext.co.uk/?p=2693&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=type-42-postscript-fonts-with-dvips-freetype-lcdf-typetools-and-re-encoding</link>
		<comments>http://www.readytext.co.uk/?p=2693#comments</comments>
		<pubDate>Sat, 30 Mar 2013 15:03:03 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[C programming (miscellaneous)]]></category>
		<category><![CDATA[OpenType]]></category>
		<category><![CDATA[Post about about fonts, glyphs and characters]]></category>
		<category><![CDATA[TeX (general)]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2693</guid>
		<description><![CDATA[Summary
This is a lengthy post which covers numerous topics on using fonts with TeX and DVIPS. It was fun to write and program but it certainly absorbed many hours of my evenings and weekends. In some areas I've had to omit some finer details because it would make the article way too long and I'd [...]]]></description>
			<content:encoded><![CDATA[<h1>Summary</h1>
<p>This is a lengthy post which covers numerous topics on using fonts with TeX and DVIPS. It was fun to write and program but it certainly absorbed many hours of my evenings and weekends. In some areas I've had to omit some finer details because it would make the article way too long and I'd probably run out of steam and never finish it: think of it as a "getting started" tutorial. I hope it is useful and interesting. Now to get on with some of those household tasks I've put off whilst writing this &ndash; and thanks to my partner, Alison Tovey, who has waited patiently (well, almost <img src='http://www.readytext.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> ) whilst I was glued to WordPress!</p>
<h1>Introduction</h1>
<p>Modern TeX(-based) engines, such as XeTeX and LuaTeX, provide direct access to using OpenType fonts, albeit using different philosophies/methods. This post looks at just one way to use TrueType-flavoured OpenType fonts with the traditional TeX&ndash;DVIPS&ndash;PostScript&ndash;PDF workflow which is usually associated with the 8-bit world of Type 1 PostScript fonts. The idea is that we'll convert TrueType-flavoured OpenType fonts to Type 42 PostScript fonts and include the Type 42 font data into DVIPS's PostScript output stream using the <code>DVIPS -h filename</code> mechanism. In addition, we'll look at using font encoding and the creation of TeX Font Metrics to enable access to the rich set of glyphs in a modern TrueType-flavour OpenType font.</p>
<p>Many Truetype-flavoured OpenType fonts (and thus the resulting Type 42 PostScript font) contain hundreds, if not <em>thousands</em>, of glyphs &ndash; making the 8-bit world of the traditional PostScript Encoding Vector little more than a tiny window into the rich array of available glyphs. By re-encoding the base Type 42 font we can generate a range of 256-character fonts for TeX and DVIPS to exploit the full range of glyphs in the original TrueType font &ndash; such as a true small caps font if the TrueType font has them.</p>
<p>We will also need to create the TeX Font Metrics (TFMs) so that TeX can access the metric data describing our fonts &ndash; the width, height, depth plus any kerning and linatures we care to add. Of course, the virtual font mechanism is also a valid approach &ndash; see <a href="http://tug.org/TUGboat/tb11-1/tb27knut.pdf">Virtual Fonts: More Fun for Grand Wizards</a> for more details. Much of what we're doing here uses a number of freely available software tools to extract key data from the actual OpenType font files for onward processing into a form suitable for TeX.</p>
<h2>Context of these experiments</h2>
<p>Over the past few weeks I've spent some evenings and weekends building TeX and friends from WEB source code, using Microsoft's Visual Studio. At the moment, this all resides in a large Visual Studio project containing all the various applications and is a little "<a href="http://en.wikipedia.org/wiki/W._Heath_Robinson">Heath Robinson</a>" at the moment, although it does work. Within each of my builds of TeX and friends I've replaced the venerable Kpathsea path/file-searching library with my one of own creation &ndash; which does a direct search using recursive directory traversal. I'm also toying with using database-lookup approach, hence the appearance of SQLite in the list of C libries within the screenshot.</p>
<p><img src="http://readytext.co.uk/files/web2c.png" width="100%"/></p>
<p>Turning to Eddie Kohler's marvellous <a href="http://www.lcdf.org/type/">LCDF Typetools collection</a>, I used MinGW/MSYS to build this. LCDF Typetools contains some incredibly useful tools for working with fonts via TeX/DVIPS &ndash; including <code>ttftotype42</code> which can generate a <a href="http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5012.Type42_Spec.pdf">Type 42 PostScript font</a> from TrueType-flavoured OpenType fonts. You can think of a Type 42 font as a PostScript "wrapper" around the native TrueType font data, allowing you to insert TrueType fonts into PostScript code.</p>
<h1>Characters, glyphs, glyph names, encodings and glyph IDs</h1>
<p>Firstly, we need to review several interrelated topics: characters, glyphs, glyph names, encodings and glyph IDs (contained in OpenType fonts). Let's begin by thinking about characters. A <em>character</em> can be considered as the fundamental building block of a language: it is, if you  like, an "atomic unit of communication" (spoken or not) which has a defined role and purpose: the character's <em>meaning</em> (semantics). Most characters usually need some form visual representation; however, that visual representation may not be fixed: most characters of a human spoken/written language can be represented in different forms. For example, the character 'capital H' (H) can take on different visual appearances depending on the font you use to display it. Fonts come in different designs and each design of our 'capital H' is called a <em>glyph</em>: a specific visual design which is particular to the font used to represent the 'capital H'. Each character that a font is capable of displaying will have a glyph designed to to represent it &ndash; not only that but you may have a fancy font that contains multiple representations for a particular character: small caps, italic, bold and so forth. Each of these variants uses a different glyph to represent the same character: they still represent the same fundamental "unit of meaning" (a character) just using different visual forms of expression (glyphs).</p>
<p>If we look around us we see, of course, that there are hundreds of languages in our world and if we break these languages down into their core units of expression/meaning we soon find that many thousands of characters are needed to "define" or encompass these languages. So, how do we go about listing these characters and, more to the point, communicating in these languages through e-mails, text files, printed documents and so forth? As humans we refer to characters by a name (e.g., 'capital H') but computers, obviously, deal with numbers. To communicate our characters by computer we need a way to allocate an agreed set of numbers to those characters so that we can store or transmit them electronically. And that's called the <em>encoding</em>. An encoding is simply an agreed set of numbers assigned to an agreed set of characters &ndash; so that we can store those numbers and know that our software will eventually display the correct glyphs to provide visual expression of our characters. To communicate using numbers to represent characters both sides have to agree on the encoding (mapping of numbers to characters) being used. If I save my text file (a bunch of numbers) and you open it up then your software must interpret those numbers in the same way I did when I wrote the text. Clearly, it's essential for encoding standards to exist and perhaps the most well known is, of course, the Unicode standard which allocates a unique number to well over 100,000 characters (at present), with new characters being added from time-to-time as the Uniciode standard is updated.</p>
<p>Let's take closer at fonts. We've seen that the job of a font is to provide the glyphs which represent a certain set of characters. Naturally, any particular font will only contain glyphs to represent a small subset of the world's characters: there are just too many for any single font to contain them all. We've also said that some fonts may contain multiple glyphs to represent the same character. Considering OpenType fonts for the moment, within each font the individual glyphs (designs representing a specific chartacter) are each given a <em>name</em> and a numeric identifier, called  the <em>glyph identifier</em> (also called the index or glyph ID). Each glyph is thus described by a (name, glyph ID) pair. It's really important to realise that the glyph ID has <em><strong>nothing</strong> to do with encoding of characters</em>: it is just an internal bookkeeping number used within the font and assigned to each glyph by the font's creator. The numeric IDs assigned to a particular glyph are not defined by a global standard. Furthermore, the names given to glyphs also show a great deal of variation too, although there are some attempts at standardizing them: see the <a href="http://en.wikipedia.org/wiki/Adobe_Glyph_List">Adobe Glyph List</a> which aims to provide a standard naming convention.</p>
<p>Let's recap. We've seen that the fundamental "unit of communication" is the character and that characters are <em>encoded</em> by assigning each one to a number. We've also seen that fonts contain the designs, called <em>glyphs</em>, which represent the characters supported by the font. Internally, each (OpenType) font assigns every glyph an <em>identifier</em> (glyph ID) and a <em>glyph name</em> which may, or may not, be "standard".</p>
<p>So, the next question we need to think about is: given a text file containing characters represented (stored) according to a specific encoding (a set of numbers), how does any font actually know how to map from a certain character in the text file to the correct glyph to represent it? After all, the encoding in the text file is usually based on a standard but the data in our font, glyph IDs and glyph names, are not standard? Well, not surprisingly there is indeed some extra bit of data inside the font which provides the glue and this is called the <em>Encoding Vector</em> (in older PostScript fonts) or character map (CMAP) table within the modern world of Unicode and OpenType fonts. The job of the Encoding Vector (or character map (CMAP)) is to provide the link between the standard world of encoded characters to the (relatively) non-standard inner font world of glyph IDs and glyph names.<br />
<h1>A sneak peek at GentiumPlus-R: 5586 glyphs in a single font</h1>
<p>For the remainder of this post I'll use the free <a href="http://scripts.sil.org/cms/scripts/page.php?item_id=Gentium_download#801ab246">Gentium OpenType font</a> (GentiumPlus-R) as an example because I do not want to  inadvertantly infringe any commercial licence conditions in the work below. To help solidify the ideas described above I generated a table of all the glyphs (plus glyph ID and glyph name) contained within the GentiumPlus-R TrueType-flavour OpenType font.</p>
<h2>GentiumPlus-R glyph chart</h2>
<blockquote><p><strong>Technical details:</strong> To generate these glyph tables I wrote a command-line utility (in C) which used the FreeType library to extract the low-level data from inside the OpenType font. This data was written out as a PostScript program which loops over all the glyphs: drawing each glyph together with its glyph ID and name. This PostScript program was combined with the GentiumPlus (TrueType) font after converting it to a Type 42 PostScript font using <code>ttftotype42</code> compiled from the source code distributed as part of the wonderful <a href="http://www.lcdf.org/type/">LCDF Typetools collection</a>.
</p></blockquote>
<p><a href="http://readytext.co.uk/files/gentium.pdf">Download PDF</a></p>
<p><iframe src="https://docs.google.com/gview?url=http://readytext.co.uk/files/gentium.pdf&#038;embedded=true" style="width:100%; height:600px;" frameborder="0"></iframe></p>
<h1>PostScript Encoding Vectors</h1>
<p>Let's recap on our objectives. We've explored the idea of glyphs, characters and encodings and seen that OpenType fonts can contain many thousands of glyphs to display thousands of characters. However, OpenType fonts can't easily be used within the <em>traditional </em> TeX&ndash;DVIPS&ndash;PostScriptS&ndash;PDF workflow: most traditional TeX workflows use 8-bit characters and Type 1 PostScript fonts. As yet, we've still not explained exctly how a character code is "mapped" to a specific glyph in a font. So, it's time to look at this, focussing on Type 1 and Type 42 PostScript fonts, ignoring OpenType fonts. The "magic glue" we need to explore is the so-called <em>Encoding Vector</em> present in Type 1 and Type 42 fonts. The job of the Encoding Vector is to map from character codes in the input to glyphs contained in the font. Let's look at an example to make this clearer. I'll assume that you have access to the <code>ttftotype42</code> utility from the <a href="http://www.lcdf.org/type/">LCDF Typetools collection</a>. If you don't have it, or can't compile it, contact me and I'll e-mail my compiled version to you.</p>
<h2>Using <code>ttftotype42</code></h2>
<p>If you run <code>ttftotype42</code> on a TrueType-flavour OpenType font it will generate a fairly large plain text file which you can inspect with any text editor, so let's do that. In these examples I'll use the free <a href="http://scripts.sil.org/cms/scripts/page.php?item_id=Gentium_download#801ab246">Gentium OpenType font</a>.</p>
<p>If you download GentiumPlus and place the GentiumPlus-R.ttf file in the same directory as <code>ttftotype42</code> and run</p>
<p><code>ttftotype42 GentiumPlus-R.ttf GentiumPlus.t42</code></p>
<p>you should generate a file <code>GentiumPlus.t42</code> which is a little over 2MB in size &ndash;  remember, the GentiumPlus font contains over 5,500 glyphs! Loosely speaking you can think of the Type 42 font generated by <code>ttftotype42</code> as being made up from the following sections:</p>
<ul>
<li>PostScript header</li>
<li>Encoding Vector</li>
<li>/sfnts glyph data array</li>
<li>/CharStrings dictionary</li>
<li>PostScript trailer</li>
</ul>
<blockquote><p><strong>Download <code>GentiumPlus.t42</code></strong>: I uploaded the Type 42 font file <code>GentiumPlus.t42</code> created by <code>ttftotype42</code> onto this site: you can <a href="http://www.readytext.co.uk/files/GentiumPlus.zip">download it here</a>. </p>
</blockquote>
<p>Here's an extract from the Type 42 font version of GentiumPlus-R.ttf with vast amouts of data snipped out for brevity:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
%!PS-TrueTypeFont-65536-98828-1
%%VMusage: 0 0
11 dict begin
/FontName /GentiumPlus def
/FontType 42 def
/FontMatrix [1 0 0 1 0 0] def
/FontBBox [-0.676758 -0.463867 1.49951 1.26953] readonly def
/PaintType 0 def
/XUID [42 16#30C4BB 16#E5CA1A 16#75CC0A 16#BE5D07 16#47E1FB 16#4C] def
/FontInfo 10 dict dup begin
/version (Version 1.510) readonly def
/Notice (Gentium is a trademark of SIL International.) readonly def
/Copyright (Copyright \(c\) 2003-2012, SIL International \(http://scripts.sil.org/\)) readonly def
/FullName (Gentium Plus) readonly def
/FamilyName (Gentium Plus) readonly def
/Weight (Regular) readonly def
/isFixedPitch false def
/ItalicAngle 0 def
/UnderlinePosition -0.146484 def
/UnderlineThickness 0.0488281 def
end readonly def
/Encoding 256 array
0 1 255{1 index exch/.notdef put}for
dup 13 /nonmarkingreturn put
dup 32 /space put
dup 33 /exclam put
dup 34 /quotedbl put
dup 35 /numbersign put
dup 36 /dollar put
dup 37 /percent put
dup 38 /ampersand put
...
...
-- snipped lots of lines of the encoding vector --
...
...
dup 254 /thorn put
dup 255 /ydieresis put
readonly def
/sfnts[
&lt;00010000.......
...
...
-- snipped vast amounts of glyph data --
...
...
] def
/CharStrings 5586 dict dup begin
/.notdef 0 def
/.null 1 def
/nonmarkingreturn 2 def
/space 3 def
/exclam 4 def
/quotedbl 5 def
/numbersign 6 def
...
...
-- snipped vast amounts of CharStrings data --
...
...
end readonly def
FontName currentdict end definefont pop
</pre>
<p>The section of interest here is the Encoding Vector which is reproduced in full:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
/Encoding 256 array
0 1 255{1 index exch/.notdef put}for
dup 13 /nonmarkingreturn put
dup 32 /space put
dup 33 /exclam put
dup 34 /quotedbl put
dup 35 /numbersign put
dup 36 /dollar put
dup 37 /percent put
dup 38 /ampersand put
dup 39 /quotesingle put
dup 40 /parenleft put
dup 41 /parenright put
dup 42 /asterisk put
dup 43 /plus put
dup 44 /comma put
dup 45 /hyphen put
dup 46 /period put
dup 47 /slash put
dup 48 /zero put
dup 49 /one put
dup 50 /two put
dup 51 /three put
dup 52 /four put
dup 53 /five put
dup 54 /six put
dup 55 /seven put
dup 56 /eight put
dup 57 /nine put
dup 58 /colon put
dup 59 /semicolon put
dup 60 /less put
dup 61 /equal put
dup 62 /greater put
dup 63 /question put
dup 64 /at put
dup 65 /A put
dup 66 /B put
dup 67 /C put
dup 68 /D put
dup 69 /E put
dup 70 /F put
dup 71 /G put
dup 72 /H put
dup 73 /I put
dup 74 /J put
dup 75 /K put
dup 76 /L put
dup 77 /M put
dup 78 /N put
dup 79 /O put
dup 80 /P put
dup 81 /Q put
dup 82 /R put
dup 83 /S put
dup 84 /T put
dup 85 /U put
dup 86 /V put
dup 87 /W put
dup 88 /X put
dup 89 /Y put
dup 90 /Z put
dup 91 /bracketleft put
dup 92 /backslash put
dup 93 /bracketright put
dup 94 /asciicircum put
dup 95 /underscore put
dup 96 /grave put
dup 97 /a put
dup 98 /b put
dup 99 /c put
dup 100 /d put
dup 101 /e put
dup 102 /f put
dup 103 /g put
dup 104 /h put
dup 105 /i put
dup 106 /j put
dup 107 /k put
dup 108 /l put
dup 109 /m put
dup 110 /n put
dup 111 /o put
dup 112 /p put
dup 113 /q put
dup 114 /r put
dup 115 /s put
dup 116 /t put
dup 117 /u put
dup 118 /v put
dup 119 /w put
dup 120 /x put
dup 121 /y put
dup 122 /z put
dup 123 /braceleft put
dup 124 /bar put
dup 125 /braceright put
dup 126 /asciitilde put
dup 160 /uni00A0 put
dup 161 /exclamdown put
dup 162 /cent put
dup 163 /sterling put
dup 164 /currency put
dup 165 /yen put
dup 166 /brokenbar put
dup 167 /section put
dup 168 /dieresis put
dup 169 /copyright put
dup 170 /ordfeminine put
dup 171 /guillemotleft put
dup 172 /logicalnot put
dup 173 /uni00AD put
dup 174 /registered put
dup 175 /macron put
dup 176 /degree put
dup 177 /plusminus put
dup 178 /twosuperior put
dup 179 /threesuperior put
dup 180 /acute put
dup 181 /mu put
dup 182 /paragraph put
dup 183 /periodcentered put
dup 184 /cedilla put
dup 185 /onesuperior put
dup 186 /ordmasculine put
dup 187 /guillemotright put
dup 188 /onequarter put
dup 189 /onehalf put
dup 190 /threequarters put
dup 191 /questiondown put
dup 192 /Agrave put
dup 193 /Aacute put
dup 194 /Acircumflex put
dup 195 /Atilde put
dup 196 /Adieresis put
dup 197 /Aring put
dup 198 /AE put
dup 199 /Ccedilla put
dup 200 /Egrave put
dup 201 /Eacute put
dup 202 /Ecircumflex put
dup 203 /Edieresis put
dup 204 /Igrave put
dup 205 /Iacute put
dup 206 /Icircumflex put
dup 207 /Idieresis put
dup 208 /Eth put
dup 209 /Ntilde put
dup 210 /Ograve put
dup 211 /Oacute put
dup 212 /Ocircumflex put
dup 213 /Otilde put
dup 214 /Odieresis put
dup 215 /multiply put
dup 216 /Oslash put
dup 217 /Ugrave put
dup 218 /Uacute put
dup 219 /Ucircumflex put
dup 220 /Udieresis put
dup 221 /Yacute put
dup 222 /Thorn put
dup 223 /germandbls put
dup 224 /agrave put
dup 225 /aacute put
dup 226 /acircumflex put
dup 227 /atilde put
dup 228 /adieresis put
dup 229 /aring put
dup 230 /ae put
dup 231 /ccedilla put
dup 232 /egrave put
dup 233 /eacute put
dup 234 /ecircumflex put
dup 235 /edieresis put
dup 236 /igrave put
dup 237 /iacute put
dup 238 /icircumflex put
dup 239 /idieresis put
dup 240 /eth put
dup 241 /ntilde put
dup 242 /ograve put
dup 243 /oacute put
dup 244 /ocircumflex put
dup 245 /otilde put
dup 246 /odieresis put
dup 247 /divide put
dup 248 /oslash put
dup 249 /ugrave put
dup 250 /uacute put
dup 251 /ucircumflex put
dup 252 /udieresis put
dup 253 /yacute put
dup 254 /thorn put
dup 255 /ydieresis put
readonly def
</pre>
<p>The Encoding Vector is an array indexed by a number which runs from 0 to 255 and the value stored at each index position is the <em>name</em> of a glyph contained in the font. You have probably guessed that the index (0 to 255) is the numeric value of an input character. So, via the Encoding Vector with 256 potential character values as input, we can reach up to 256 individual glyphs contained in the font. (Note: I'm ignoring the PostScript <code>glyphshow</code> operator which allows access to any glyph if you know its name). </p>
<blockquote><p><strong>The full story (quoting from the <a href="http://www.adobe.com/devnet/font/pdfs/5012.Type42_Spec.pdf">Type 42 font specification</a>)</strong> "The PostScript interpreter uses the /Encoding array to look up the character name, which is then used to access the /Charstrings entry with that name. The value of that entry is the glyph index, which is then used to retrieve the glyph description."</p></blockquote>
<p>However, there are 5586 glyphs in GentiumPlus so does this mean the remaining 5330 glyphs are wasted and unreachable? Of course that's not true but we can only reach 256 glyphs via each individual Encoding Vector: the trick we need is <em>font re-encoding</em>. The Encoding Vector is not a fixed entity: you can amend it or replace it entirely with a new one to map character codes 0 to 255 to different glyphs within the font. I won't give the full details here, although it's quite simple to understand. What you do, in effect, is a bit of PostScript programming to "clone" some of the font data structures, give this "clone" a new PostScript font name and a new Encoding Vector which maps the 256 character codes to totally different glyphs. For some excellent tutorials on PostScript programming, including font re-encoding, I highly recommend reading the truly excellent <a href="http://www.acumentraining.com/acumenjournal.html">Acumen Training Journal</a> which is completely free. Specifically, <a href="http://www.acumentraining.com/Acumen_Journal/AcumenJournal_Nov2001.zip">November 2001</a> and <a href="http://www.acumentraining.com/Acumen_Journal/AcumenJournal_Dec2001.zip">December 2001</a> issues.</p>
<p>If you want a simple example to explore the ideas behind Encoding Vectors you can download <a href="http://www.readytext.co.uk/files/encoding.zip">this code example (with PDF)</a> to see the results of re-encoding Times-Roman.</p>
<h1>Hooking this up to TeX and DVIPS</h1>
<p>Having discussed fonts, encoding and glyphs at some length we now move to the next task: how do we use these ideas with TeX and DVIPS? Let's start with TeX. Here, I'm referring to the <em>traditional</em> TeX workflows that use TeX Font Metric (TFM) files. So what is a TFM? To do its typesetting work TeX's algorithms need only some basic information about the font you want to use: it needs the <em>metrics</em>. TeX does not care about the actual glyphs in your font or what they look like, it needs a set of data that describes how big each glyph is: to TeX your glyphs are boxes with a certain width, depth and height. That's not the whole story, of course, because TeX also needs some additional data called <code>fontdimen</code>s which are a set of additional parameters that describe some overall characteristics of the font. For pure text fonts there are 7 of these <code>fontdimen</code>s, for math fonts there are 13 or 22 depending on the type/role of the math font. These <code>fontdimen</code>s are, of course, built into the TFM file.</p>
<h2>Looking inside TFMs</h2>
<p>TFM files are a highly compact binary file format and quite unsuitable for viewing or editing. However, you can convert a TFM file to a readable/editable text representation using a program called <code>tftopl</code>, which is part of most TeX distributions. The editable text version of a TFM is referred to as a <em>property list</em> file. At the start of a TFM file for a text font (e.g., <code>cmr10.tfm</code>) you should see the 7 <code>fontdimen</code>s displayed like this:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
(FONTDIMEN
   (SLANT R 0.0)
   (SPACE R 0.333334)
   (STRETCH R 0.166667)
   (SHRINK R 0.111112)
   (XHEIGHT R 0.430555)
   (QUAD R 1.000003)
   (EXTRASPACE R 0.111112)
   )
</pre>
<p>If you run <code>tftopl</code> on <code>cmex10.tfm</code> (math font with extensible symbols) you see 13 <code>fontdimen</code>s displayed like this:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
(FONTDIMEN
   (SLANT R 0.0)
   (SPACE R 0.0)
   (STRETCH R 0.0)
   (SHRINK R 0.0)
   (XHEIGHT R 0.430556)
   (QUAD R 1.0)
   (EXTRASPACE R 0.0)
   (DEFAULTRULETHICKNESS R 0.04)
   (BIGOPSPACING1 R 0.111111)
   (BIGOPSPACING2 R 0.166667)
   (BIGOPSPACING3 R 0.2)
   (BIGOPSPACING4 R 0.6)
   (BIGOPSPACING5 R 0.1)
   )
</pre>
<p>If you run <code>tftopl</code> on <code>cmsy10.tfm</code> (math symbol font) you see 22 <code>fontdimen</code>s displayed like this: </p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
(FONTDIMEN
   (SLANT R 0.25)
   (SPACE R 0.0)
   (STRETCH R 0.0)
   (SHRINK R 0.0)
   (XHEIGHT R 0.430555)
   (QUAD R 1.000003)
   (EXTRASPACE R 0.0)
   (NUM1 R 0.676508)
   (NUM2 R 0.393732)
   (NUM3 R 0.443731)
   (DENOM1 R 0.685951)
   (DENOM2 R 0.344841)
   (SUP1 R 0.412892)
   (SUP2 R 0.362892)
   (SUP3 R 0.288889)
   (SUB1 R 0.15)
   (SUB2 R 0.247217)
   (SUPDROP R 0.386108)
   (SUBDROP R 0.05)
   (DELIM1 R 2.389999)
   (DELIM2 R 1.01)
   (AXISHEIGHT R 0.25)
   )
</pre>
<p>The role of these <code>fontdimen</code>s within math fonts is extremely complex. If you want to read about this in depth you can find a list of excellent articles <a href="http://www.readytext.co.uk/?p=2319">in this post</a>. In addition to the glyph metrics (height, width, depth) and <code>fontdimen</code>s TFM files contain constructs for kerning and ligatures. There's a lot of information already available on the inner details of TFMs so there's no point repeating it here.</p>
<p>The bulk of a TFM file is concerned with providing the height, width and depth of the characters encoded into the TFM. And that brings up a very important point: individual TFM files are tied to a particular encoding. For example, right at the start of a <code>cmr10.tfm</code> file you should see something like this:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
(FAMILY CMR)
(FACE O 352)
(CODINGSCHEME TEX TEXT)
(DESIGNSIZE R 10.0)
(COMMENT DESIGNSIZE IS IN POINTS)
(COMMENT OTHER SIZES ARE MULTIPLES OF DESIGNSIZE)
(CHECKSUM O 11374260171)
</pre>
<p>It contains the line <code>(CODINGSCHEME TEX TEXT)</code> telling you that the TFM is encoded using the TeX Text encoding scheme. Let's examine this. Referring back to our discussion of PostScript Encoding Vectors, let's take a look at the first few lines of the Encoding Vector sitting inside the Type 1 font file for cmr10 &ndash; i.e., <code>cmr10.pfb</code>. The first 10 positions are encoded like this:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
dup 0 /Gamma put
dup 1 /Delta put
dup 2 /Theta put
dup 3 /Lambda put
dup 4 /Xi put
dup 5 /Pi put
dup 6 /Sigma put
dup 7 /Upsilon put
dup 8 /Phi put
dup 9 /Psi put
dup 10 /Omega put
</pre>
<p>And this is the key point: the character encoding in your TFM file has to match the encoding of your PostScript font (or a re-encoded version of it). If we look at the metric data for the corresponding characters encoded in the <code>cmr10.tfm</code> file we find: </p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
(CHARACTER O 0
   (CHARWD R 0.625002)
   (CHARHT R 0.683332)
   )
(CHARACTER O 1
   (CHARWD R 0.833336)
   (CHARHT R 0.683332)
   )
(CHARACTER O 2
   (CHARWD R 0.777781)
   (CHARHT R 0.683332)
   )
(CHARACTER O 3
   (CHARWD R 0.694446)
   (CHARHT R 0.683332)
   )
(CHARACTER O 4
   (CHARWD R 0.666669)
   (CHARHT R 0.683332)
   )
(CHARACTER O 5
   (CHARWD R 0.750002)
   (CHARHT R 0.683332)
   )
(CHARACTER O 6
   (CHARWD R 0.722224)
   (CHARHT R 0.683332)
   )
(CHARACTER O 7
   (CHARWD R 0.777781)
   (CHARHT R 0.683332)
   )
(CHARACTER O 10
   (CHARWD R 0.722224)
   (CHARHT R 0.683332)
   )
(CHARACTER O 11
   (CHARWD R 0.777781)
   (CHARHT R 0.683332)
   )
(CHARACTER O 12
   (CHARWD R 0.722224)
   (CHARHT R 0.683332)
   )
</pre>
<p>Statements such as <code>CHARACTER O 0</code> describe the metrics (just width and height in these examples) for the character with octal value 0, <code>CHARACTER O 12</code> describes character with octal value 12 (i.e., 10 in denary (base 10)). Note that the values are relative to the <code>(DESIGNSIZE R 10.0)</code> which means, for example, that <code>CHARACTER O 12</code> has a width of 0.722224 &times; 10 = 7.22224 points &ndash; because the DESIGNSIZE is 10 points. So, it is clearly vital that the encoding of your TFM matches the encoding of your PostScript font otherwise you'll get the wrong glyphs on output and the wrong widths, heights and depths used by TeX's typesetting calculations!</p>
<h2>Using FreeType to generate raw metric data</h2>
<p><a href="http://www.freetype.org/">FreeType</a> is a superb C library which provides a rich set of functions to access many internals of a font, together, of course, with functions to rasterize fonts for screen display. Just to note, FreeType does not provide an OpenType shaping engine, for that you'll need to use the equally superb <a href="http://savannah.nongnu.org/files/?group=m17n">libotf C library</a> (which also uses FreeType). However, I digress. Using FreeType you can create some extremely useful and simple utilities to extract a wide range of data from font files to generate raw data for creating the TFM files and Encoding Vectors you'll need to hook-up a Type 42 font to DVIPS and TeX. Let's look at this is a little detail. The task at hand is: given an OpenType (TrueType) font, how can do you obtain details of the glyphs it contains: the names and metrics (width, height, depth)?</p>
<h3>FreeType's view of glyph metrics</h3>
<p>The FreeType API provides access to the glyph metrics shown in the <a href=" http://www.freetype.org/freetype2/docs/glyphs/glyphs-3.html">FreeType Glyph Conventions</a> documentation. You should read this together with the Adobe's <a href="http://partners.adobe.com/public/developer/en/font/T1_SPEC.PDF">Type 1 Font Format Specification</a> (Chapter 3) to make sure you understand what is meant by a glyph's width.</p>
<h2>Simple examples of using the FreeType API</h2>
<p>Here's some ultra-basic examples, without any proper error checking etc, to show how you might use FreeType. You start by initializing the FreeType library (<code>FT_Init_FreeType(...)</code>), then create a new face object (<code>FT_New_Face(...)</code>) and use this to access the font and glyph details you need. The first example writes metric data to STDOUT, the second example processes the font data to create an Encoding Vectors and a skeleton <em>property list</em> file for creating a TFM. Note that is a "bare bones" TFM and does not generate any ligatures or kerning data. To generate a binary TFM from a property list file you need another utility called <code>pltotf</code> which is also part of most TeX distributions.</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
#include &lt;windows.h&gt;
#include &lt;ft2build.h&gt;
#include &lt;freetype/t1tables.h&gt;
#include &lt;freetype/ftoutln.h&gt;
#include &lt;freetype/ftbbox.h&gt;
#include FT_GLYPH_H
#include FT_FREETYPE_H

int main (int  ac,  char** av)
{
    FT_Library font_library;
    FT_Face    font_face;
    FT_BBox    bbox;

    int  glyph_index;
    int  glyph_count;
    char char_name[256];
    const char* fontfilepath = &quot;PUT THE PATH TO YOUR FONT HERE&quot;;
    char buf[5];
    int err=1;

   if (FT_Init_FreeType( &amp;font_library ) )
	{
	// Failed to init library,
	} else
	{
		if ( FT_New_Face( font_library, fontfilepath, 0 , &amp;font_face ) )
		{
			// Managed to open library but failedto open the font
			FT_Done_FreeType(font_library);
			return err;
		}
		else {
			// library and font opened OK
			// find out the number of glyphs and process each glyph
			glyph_count = font_face-&gt;num_glyphs;
			for ( glyph_index = 0 ; glyph_index &lt; glyph_count; glyph_index++ )
				{
                                // NOTE: FT_Get_Glyph_Name can FAIL for some TrueType-flavour
                                // OpenType fonts so you *really* do need to check the value of err!!
				err = FT_Get_Glyph_Name(font_face, glyph_index, &amp;char_name[0], 32 );
				_itoa(glyph_index, buf, 10);
				// load the glyph with no scaling etc to get raw data
				FT_Load_Glyph(font_face,  glyph_index,  FT_LOAD_NO_SCALE);
				// get the bounding box of the raw glyph data
				FT_Outline_Get_BBox(&amp;(font_face-&gt;glyph-&gt;outline), &amp;bbox);
				printf( &quot;/%s %ld def &quot;,   char_name, glyph_index);
				printf(&quot;width=%ld &quot;, font_face-&gt;glyph-&gt;metrics.width);
				printf(&quot;height=%ld &quot;, font_face-&gt;glyph-&gt;metrics.height);
				printf(&quot;horiAdvance=%ld &quot;, font_face-&gt;glyph-&gt;metrics.horiAdvance);
				printf(&quot;horiBearingX=%ld &quot;, font_face-&gt;glyph-&gt;metrics.horiBearingX);
				printf(&quot;horiBearingY=%ld &quot;, font_face-&gt;glyph-&gt;metrics.horiBearingY);
				printf(&quot;vertAdvance=%ld &quot;,  font_face-&gt;glyph-&gt;metrics.vertAdvance);
				printf(&quot;vertBearingX=%ld &quot;, font_face-&gt;glyph-&gt;metrics.vertBearingX);
				printf(&quot;vertBearingY=%ld &quot;, font_face-&gt;glyph-&gt;metrics.vertBearingY);
				printf(&quot;xMax=%ld &quot;, bbox.xMax);
				printf(&quot;yMax=%ld &quot;, bbox.yMax);
				printf(&quot;yMin=%ld &quot;, bbox.yMin);
				printf(&quot;xMin=%ld \n&quot;, bbox.xMin);
			}
		}
		FT_Done_FreeType(font_library);
	}
}
</pre>
<h3>Creating an Encoding Vector and property list file</h3>
<p>The following simple-minded function shows how you might use FreeType to generate an Encoding Vector and property list file. Reflecting the unusual glyphs we're using, the output files are called <code>weirdo.pl</code> and <code>weirdo.enc</code>.</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
void makeweirdo(FT_Face font_face, char *name, int len)
{
FILE * vec;
FILE * plist;
int i;
FT_BBox bbox;
int k=32; // only encode positions 32--255

char *fname=&quot;your_path_here\\weirdo.enc&quot;;
char *pname=&quot;your_path_here\\weirdo.weirdo.pl&quot;;
char * header= &quot;(COMMENT Created by Graham Douglas)\r\n\
(FAMILY WEIRDO)\r\n\
(CODINGSCHEME WEIRDO)\r\n\
(DESIGNSIZE R 10.0)\r\n\
(FONTDIMEN\r\n\
   (SLANT R 0.0)\r\n\
   (SPACE R 0.333334)\r\n\
   (STRETCH R 0.166667)\r\n\
   (SHRINK R 0.111112)\r\n\
   (XHEIGHT R 0.430555)\r\n\
   (QUAD R 1.000003)\r\n\
   (EXTRASPACE R 0.111112)\r\n\
   )\r\n&quot;;
	vec = fopen(fname, &quot;wb&quot;);
	plist = fopen(pname, &quot;wb&quot;);

	fprintf(vec,&quot;%s&quot;, &quot;/veccy 256 array 0 1 255 {1 index exch /.notdef put} for\r\n&quot;);
	fprintf(plist,&quot;%s&quot;, header);

	// Here we are looping over GentiumPlus glyph IDs whose value is 5000 to 5223
	for (i=5000; i&lt;5224; i++)
	{
		FT_Get_Glyph_Name(font_face, i, name, len);
		FT_Load_Glyph(font_face,  i,  FT_LOAD_NO_SCALE);
		FT_Outline_Get_BBox(&amp;(font_face-&gt;glyph-&gt;outline), &amp;bbox);
		fprintf(plist,&quot;(CHARACTER O %o (COMMENT Glyph name is %s)\r\n&quot;, k, name);
		fprintf(plist,&quot;    (CHARWD R %.5f)\r\n&quot;, font_face-&gt;glyph-&gt;metrics.horiAdvance/2048.0);
		fprintf(plist,&quot;    (CHARHT R %.5f)\r\n&quot;, bbox.yMax/2048.0);
                // FreeType's depth values are negative, TeX Font Metrics are not
                // If bbox.yMin not negative then we don't output anything and TeX assumes zero depth
		if(bbox.yMin &lt; 0)
		{
			fprintf(plist,&quot;    (CHARDP R %.5f)\r\n&quot;, -1*bbox.yMin/2048.0);
		}
		fprintf(plist, &quot;%s&quot;,&quot;    )\r\n&quot;); 

	fprintf(vec, &quot;dup %ld /%s put\r\n&quot;, k, name);
	k++;
	}
	fprintf(vec, &quot;%s&quot;, &quot; def\r\n&quot;);
	fclose(vec);
	fclose(plist);
}
</pre>
<p>Here is a small extract from <code>weirdo.vec</code> and <code>weirdo.pl</code> &ndash; if you wish to explore the output you can download them (and <code>weirdo.tfm</code>) <a href="http://www.readytext.co.uk/files/weirdo.zip">in this zip file</a>. (In the data below I followed the neat idea from LCDF Typetools and put the glyph name in as a comment).</p>
<h4>weirdo.vec</h4>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
/veccy
256 array 0 1 255 {1 index exch /.notdef put} for
dup 32 /uni1D9C.Dep50 put
dup 33 /uni1D9C.Dep41 put
dup 34 /uni023C.Dep51 put
dup 35 /uni023C.Dep50 put
....
dup 251 /uni024C.Dep51 put
dup 252 /uni024C.Dep50 put
dup 253 /uni2C64.Dep51 put
dup 254 /uni2C64.Dep50 put
dup 255 /uni1DB3.Dep51 put
def
</pre>
<h4>weirdo.pl</h4>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
(COMMENT Created by Graham Douglas)
(FAMILY WEIRDO)
(CODINGSCHEME WEIRDO)
(DESIGNSIZE R 10.0)
(FONTDIMEN
   (SLANT R 0.0)
   (SPACE R 0.333334)
   (STRETCH R 0.166667)
   (SHRINK R 0.111112)
   (XHEIGHT R 0.430555)
   (QUAD R 1.000003)
   (EXTRASPACE R 0.111112)
   )
(CHARACTER O 40 (COMMENT Glyph name is uni1D9C.Dep50)
    (CHARWD R 0.30566)
    (CHARHT R 0.59863)
    )
(CHARACTER O 41 (COMMENT Glyph name is uni1D9C.Dep41)
    (CHARWD R 0.30566)
    (CHARHT R 0.59863)
    )
(CHARACTER O 42 (COMMENT Glyph name is uni023C.Dep51)
    (CHARWD R 0.43701)
    (CHARHT R 0.55811)
    (CHARDP R 0.09033)
    )
(CHARACTER O 43 (COMMENT Glyph name is uni023C.Dep50)
    (CHARWD R 0.43701)
    (CHARHT R 0.55811)
    (CHARDP R 0.09033)
    )
(CHARACTER O 44 (COMMENT Glyph name is uni023C.Dep41)
    (CHARWD R 0.43701)
    (CHARHT R 0.55811)
    (CHARDP R 0.09033)
    )
(CHARACTER O 45 (COMMENT Glyph name is uni1D9D.Dep51)
    (CHARWD R 0.30566)
    (CHARHT R 0.59863)
    )
(CHARACTER O 46 (COMMENT Glyph name is uni1D9D.Dep50)
    (CHARWD R 0.30566)
    (CHARHT R 0.59863)
    )

...
...
...

(CHARACTER O 375 (COMMENT Glyph name is uni2C64.Dep51)
    (CHARWD R 0.56104)
    (CHARHT R 0.64453)
    (CHARDP R 0.20020)
    )
(CHARACTER O 376 (COMMENT Glyph name is uni2C64.Dep50)
    (CHARWD R 0.56104)
    (CHARHT R 0.64453)
    (CHARDP R 0.20020)
    )
(CHARACTER O 377 (COMMENT Glyph name is uni1DB3.Dep51)
    (CHARWD R 0.27002)
    (CHARHT R 0.59863)
    )
</pre>
<p>To generate the binary TFM file <code>weirdo.tfm</code> from <code>weirdo.pl</code> run <code>pltotf</code>:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
pltotf weirdo.pl weirdo.tfm
I had to round some heights by 0.0002451 units.
</pre>
<p>I got a warning from <code>pltotf</code>, but I don't think it is too serious. To use the TFM you'll need to put it in a suitable location within your TEXMF tree.</p>
<h1>Hooking up to DVIPS</h1>
<p>We've covered a huge range of topics so it is time to recap. So far, we've generated an Encoding Vector (<code>weirdo.vec</code>) based on the names of glyphs (in the Gentium-Plus font) whose glyph IDs span the range 5000&ndash;5523. Within our Encoding Vector we mapped those glyph names to the character codes 32&ndash;255. We have also created a property list file, based on the same encoding, which simply contains the width, height and depth of the Gentium-Plus glyphs in the range 5000&ndash;5523. Our next task is to pull together the following items and convince DVIPS to use them.</p>
<ol>
<li><strong>Re-encode <code>GentiumPlus.t42</code></strong>: We need to create a re-encoded font that uses our new Encoding Vector (<code>weirdo.vec</code>).</li>
<li><strong>Update <code>config.ps</code></strong>: We need to tell DVIPS how to use our new font by creating a <code>.map</code> file and making sure DVIPS can find that map file.</li>
<li><strong>Command-line switches</strong>: We'll need to use some command-line switches to give DVIPS the info it needs to do its job.</li>
<li><strong>Our Type 42 font</strong>: <code>GentiumPlus.t42</code>: We must tell DVIPS to embed that font into its PostScript output.</li>
</ol>
<p>Our goal is to tell TeX to load a font (TFM) called <code>weirdo</code> and for DVIPS to know how to use and find the <code>weirdo</code> font data to generate the correct PostScript code to render our glyphs. We'll use our strange new <code>weirdo</code> font like this (in plain TeX):</p>
<p><code>\font\weird=weirdo {\weird HELLO}</code></p>
<p>Note that the displayed output will <strong>not</strong> be the English word "HELLO" because we've chosen some rather strange glyphs from Gentium-Plus. The key observation is the input character codes are the ASCII values of the string <code>HELLO</code>; i.e. (in base 10):</p>
<p><code><br />
H = 72<br />
E = 69<br />
L = 76<br />
L = 76<br />
O = 79<br />
</code>	</p>
<p>and our <code>weirdo.enc</code> Encoding Vector maps these character codes to the following glyphs:</p>
<p><code><br />
72 = uni1D94.Dep51<br />
69 = uni1D9F.Dep51<br />
76 = uni0511.Dep50<br />
79 = uni0510.Dep51<br />
</code></p>
<p>So, we can expect some strange output in the final PostScript or PDF file!</p>
<h2>How do we do the re-encoding?</h2>
<p>The basic idea is that we tell DVIPS to embed the <code>GentiumPlus.t42</code> PostScript Type 42 font data into its PostScript output stream. We will then write some short PostScript headers that will do the re-encoding to generate our newly re-encoded font: which we're calling <code>weirdo</code>. By using the DVIPS <code>-h</code> command-line switch we can get DVIPS to embed <code>GentiumPlus.t42</code> and the header PostScript file to perform the re-encoding. For example:</p>
<p><code>DVIPS -h GentiumPlus.t42 -h weirdo.ps sometexfile.dvi</code></p>
<p>The actual re-encoding, and "creation", of our <code>weirdo</code> font will be taken care of by the file <code>weirdo.ps</code>, which will also need to contain the <code>weirdo.enc</code> data. If you wish, you can download <code><a href="http://www.readytext.co.uk/files/weirdo.ps">weirdo.ps</a></code>. Here is the tiny fragment of PostScript required within <code>weirdo.ps</code> to "create" the weirdo font by re-encoding our Type 42 font whose PostScript name is GentiumPlus.</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
/otfreencode
{
	findfont dup length dict copy
	dup 3 2 roll /Encoding exch put
	definefont
	pop
} bind def

/weirdo veccy /GentiumPlus otfreencode
</pre>
<p>Note, of course, you could create a header PostScript file to generate multiple new fonts each with their own unique Encoding Vectors containing a range of glyphs from the Type 42 font.</p>
<h2>Telling DVIPS how to use our new font</h2>
<p>So far we've built the TFM file for TeX so now we need to tell DVIPS how to use it &ndash; so that it can process the <code>weirdo</code> font name as it parses the DVI file. DVIPS uses <code>.map</code> files to associate TFM file names with PostScript font names, together the actions DVIPS needs to take in order to process the font files and get the right PostScript font data into its output. These actions include processing/parsing Type 1 font files (<code>.pfb</code>, <code>.pfa</code>) and re-encoding Type 1 fonts. For our <code>weirdo</code> font the <code>.map</code> file is very simple: all we need to do is create a file called <code>weirdo.map</code> with a single line:</p>
<p><code>weirdo weirdo</code></p>
<p>This super-simple <code>.map</code> file says that the TeX font name (TFM file) <code>weirdo</code> is mapped to a PostScript font called <code>weirdo</code> (as defined by the code in <code>weirdo.ps</code>). It also tells DVIPS that no other actions are needed because we're not doing the re-encoding, here nor are we asking DVIPS to process a Type 1 font file (<code>.pfb</code>) file associated with <code>weirdo</code> &ndash; because there isn't one! After you have created <code>weirdo.map</code> you'll need to edit the DVIPS's configuration file <code>config.ps</code> to tell DVIPS to use <code>weirdo.map</code>. Again, this is easy and all you need to do is add the following instruction to <code>config.ps</code>:</p>
<p><code>p +weirdo.map</code></p>
<h1>Does it work?</h1>
<p>Well, I'd have wasted many hours if it didn't <img src='http://www.readytext.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> . I used the following simple plain TeX example (test.tex) which I processed using my personal build of TeX for Windows (which does not use Kpathsea).</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
\hsize=300pt
\vsize=300pt
\font\smallweird=weirdo at 12pt
Dear \TeX\ I would like to say HELLO in weirdo so {\smallweird HELLO}. I would also like to see
a lot of strange glyphs so I'll input a text file containing  some of them: {\smallweird \input weirdchars }.
\bye
</pre>
<p><a href="http://readytext.co.uk/files/weirdchars.tex">Download weirdchars.tex</a></p>
<p>The resulting DVI file was processed to PostScript using a standard build of DVIPS with the following command line:</p>
<p><code>DVIPS -h GentiumPlus.t42 -h weirdo.ps  test.dvi</code></p>
<p>The resulting PostScript file is large because the <code>GentiumPlus.t42</code> file is over 2MB. However, the PDF file produced by Acrobat Distiller was about 35KB because the Type 42 font (<code>GentiumPlus.t42</code>) was subsetted.</p>
<p><a href="http://readytext.co.uk/files/weirdotest.pdf">Download PDF</a></p>
<p><iframe src="https://docs.google.com/gview?url=http://readytext.co.uk/files/weirdotest.pdf&#038;embedded=true" style="width:100%; height:300px;" frameborder="0"></iframe></p>
<h1>Concluding thoughts</h1>
<p>"Alison, I'm ready to do the gardening. What?, it's too late. That's a shame." <img src='http://www.readytext.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2693</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Compiling LCDF Typetools under Windows using MinGW</title>
		<link>http://www.readytext.co.uk/?p=2680&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=compiling-lcdf-typetools-under-windows-using-mingw</link>
		<comments>http://www.readytext.co.uk/?p=2680#comments</comments>
		<pubDate>Thu, 14 Mar 2013 01:59:32 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[C programming (miscellaneous)]]></category>
		<category><![CDATA[Post about about fonts, glyphs and characters]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2680</guid>
		<description><![CDATA[Introduction
This is just a short post to share a workaround to a problem I ran into when building Eddie Kohler's superb LCDF Typetools under Windows using MinGW. After running ./configure to create the make files I hit a problem during compilation with lots of error messages referring to undefined reference to `ntohl@4'

../typetools/libefont/otf.cc:863: undefined reference to [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>This is just a short post to share a workaround to a problem I ran into when building Eddie Kohler's superb <a href="http://www.lcdf.org/type/">LCDF Typetools</a> under Windows using MinGW. After running <code>./configure</code> to create the make files I hit a problem during compilation with lots of error messages referring to <code>undefined reference to `ntohl@4'</code></p>
<p><code><br />
../typetools/libefont/otf.cc:863: undefined reference to `ntohl@4'<br />
../typetools/libefont/otf.cc:861: undefined reference to `ntohs@4'<br />
../typetools/libefont/otf.cc:861: undefined reference to `ntohs@4'<br />
</code></p>
<h1>One solution</h1>
<p>The cause of the error is failure to link to the library <code>libwsock32.a</code> (contained in the <code>c:\MinGW\lib\</code> directory on my PC). The following workaround solves the problem but I'm sure there are better ways of doing it. Several tools within the Typetools collection depend on <code>libwsock32.a</code> to compile successfully. There are:</p>
<ul>
<li>otfinfo</li>
<li>otftotfm</li>
<li>cfftot1</li>
</ul>
<p>To build these programs you need to make a small edit to the generated makefiles.</p>
<ol>
<li>Create a directory called (say) <code>libs</code> within the Typetools directory tree.</li>
<li>Copy <code>libwsock32.a</code> into that directory.</li>
<li>For each application listed above, that depends on <code>libwsock32.a</code>, open the <code>makefile</code> in the appropriate application directory and look for a line starting with<code> XXXXX_LDADD</code> where XXXX is <code><strong>otfinfo</strong></code> or <code><strong>otftotfm</strong></code> or <code><strong>cfftot1</strong></code></li>
<li>Edit that line to include <code>libwsock32.a</code></li>
<li>Example: <code>cfftot1_LDADD = ../libefont/libefont.a ../libs/libwsock32.a ../liblcdf/liblcdf.a</code></li>
</ol>
<p>You should now be able to run <code>make </code>and achieve a successful compilation. It worked for me, I hope it works for you.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2680</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Technique to make Web2C/tangle put comments in the C code generated for TeX</title>
		<link>http://www.readytext.co.uk/?p=2619&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=technique-to-make-web2ctangle-put-comments-in-the-c-code-generated-for-tex</link>
		<comments>http://www.readytext.co.uk/?p=2619#comments</comments>
		<pubDate>Wed, 13 Feb 2013 11:20:40 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[Web2C, tangle, weave, WEB]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2619</guid>
		<description><![CDATA[Introduction
Over the last couple of evenings I've been looking at the C code for TeX generated by the tangle and Web2C conversion process. By default, the Web2C conversion process generates C source code which is almost completely devoid of comments and symbol strings are converted to numbers (etc), making the C source nearly impossible to [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>Over the last couple of evenings I've been looking at the C code for TeX generated by the tangle and Web2C conversion process. By default, the Web2C conversion process generates C source code which is almost completely devoid of comments and symbol strings are converted to numbers (etc), making the C source nearly impossible to read. However, by making a small change to the flex-generated source code (<code>web2c-lexer.c</code>) together with some careful use of regular expressions on the WEB sources (and/or some manual editing) you can get a lot of comments put into the C source. Note: I've not yet explored whether it is possible to use the changefile method to achieve the same (or similar) results. Here's an outline of my experimental technique.</p>
<h1>Outline of the technique</h1>
<p>Naturally, via the literate programming methodology, the WEB source files for TeX contain a full description of the TeX program. The Pascal code in the WEB source files is full of comments or short descriptions (enclosed in braces <code>{....}</code>) which, if preserved, (in the Web2C-generated C code), would make it much more readable. However, those Pascal comments are stripped out by tangle (but are used by weave); consequently, the Pascal generated by tangle, and fed into Web2C.exe, does not contain any <em>useful</em> comments. I say "useful" because the Pascal does contain some line-number comments but these are not really that helpful and they are removed by the comment-handling code in <code>web2c-lexer.c</code>.</p>
<h2>Comments in WEB files</h2>
<p>So, what are we looking to do? In essence, we need a way to convince tangle to output comments into the Pascal code it generates and find a way to ensure that those comments are processed and passed into the C code by Web2C.</p>
<h3>Web2C and Pascal comments</h3>
<p>Caveat: I am not a Pascal programmer and have no desire to become one! However, all you need to know is that within the Pascal generated by tangle the comments are simply enclosed in braces, like this: <code>{This is a comment in Pascal}</code>. These comments are filtered out by <code>web2c-lexer.c</code>. Another caveat: be <em>extremely</em> careful when making any changes whatsoever to the lexer code C (or .l sources), you can break things very badly (hmmm, wonder how I found that out...). I cannot stress the importance of being very, very careful in making any changes to <code>web2c-lexer.l/web2c-parser.y</code> or <code>web2c-lexer.c/web2c-parser.c</code>: these lexical analyser and parser-generator sources are critical to the C-generation process. OK, I think I made the point. The following description probably deserves nomination for an "Ugly Hack Award" and, no doubt, a flex/bison expert (which I'm definitely <em>not</em>) could design an elegant solution to incorporate comment-handling in proper context within the parsing process. OK, enough self-flagellation, let's move on. </p>
<p>If you look in <code>Web2C-lexer.l</code> the code which handles comments is simply:</p>
<p><code>"{"		{ while (webinput() != '}'); }</code></p>
<p>After running flex on <code>Web2C-lexer.l</code> this becomes (in <code>Web2C-lexer.c</code>) </p>
<p><code><br />
case 2:<br />
YY_RULE_SETUP<br />
#line 53 "web2c-lexer.l"<br />
{ while (webinput() != '}'); }<br />
	YY_BREAK<br />
</code></p>
<p>Basically, the lexical analyser is stripping out things like <code>{This is a comment from Pascal}</code>. To get comments into the C code generated for TeX you'll need to modify the lexer code to stop it skipping comments and process them to generate C comments such as <code>/* This is a comment from Pascal */</code>. There are a few points to consider here: firstly, you'll need to experiment to see exactly where the comments end up in your C code. Due to the "Ugly Hack" approach, we're not paying any real attention to the "context" of where we are in the parsing process when outputting our comments; again, a proper flex/bison implementation is required. For example, by the time your comment is seen by the lexer a newline (<code>\n</code>) may already have been output so your comments may end up on a new line &ndash; easily fixed by some manual tidy-up of the C code (or via the use of running regular-expression tools on the Web2C-generated C source code). So, just to note that you'll need to do some trial and error to see what happens.</p>
<h3>Getting comments into WEB and surviving tangle</h3>
<p>As noted, tangle strips out comments in the WEB sources and they don't even reach the Pascal code it outputs. So, can we coerce tangle to preserve comments in WEB soures and put them in the Pascal for Web2.exe to process/output? A quick reading of <a href="http://www.readytext.co.uk/files/webman.pdf">The WEB User Manual</a> implies that there are two ways to get text output to the Pascal source code produced by tangle:</p>
<ul>
<li>use "control-text" such as <code>@=your comment text here@></code> which causes the text to be output verbatim into the Pascal code, or</li>
<li>use "meta-comments": such as<code> @{your comment text here @}</code> which, in the Pascal, results in a standard comment such as <code>{your comment text here}</code>.</li>
</ul>
<h3><a href="http://en.wikipedia.org/wiki/Robby_the_Robot">Robby the Robot</a> says Danger!</h3>
<p>Sorry for the reference to Robby the Robot, indulge me..... Seriously, though, if you make edits to the WEB source to put in "control-text" or "meta-comments" you can <em>very</em> easily foul-up tangle's parser and break tangle's conversion process pretty badly. As yet, I'm not able to give precise rules on where it is safe to add "control-text" or "meta-comments" (I'm still experimenting) so I suggest you read <a href="http://www.readytext.co.uk/files/webman.pdf">The WEB User Manual</a> to understand a little more about WEB syntax before attempting it.</p>
<blockquote><p><strong>Mind the pool file</strong>: Be careful inserting/using text with double quotes <code>"..."</code> because it can trigger tangle's parser to output that text in the <code>tex.pool</code> file which you don't want to do. I used single quotes <code>'...'</code> and that seems to be safe(er). I can't recall exactly what I did that caused this to happen, but just be sure to check that the <code>.pool</code> file does not become polluted with any of the text you insert into the WEB sources.</p></blockquote>
<h2>Getting to the point</h2>
<p>So far we've seen that to get comments into the C source code we need to:</p>
<ol>
<li>modify the behaviour of <code>web2c-lexer.c</code> and tell it (selectively) not to skip all Pascal's comment construct <code>{...}</code> (see use of <code>'...'</code>, below).</li>
<li>coerce tangle to preserve comments and output them into the Pascal source so that Wb2C.exe sees them and the code in <code>web2c-lexer.c</code> can process them. </li>
</ol>
<h3>An example</h3>
<p>Within the TeX WEB source code is a function which initializes TeX's "primitives". Here's a small extract of the raw WEB source code</p>
<p><code><br />
@ The symbolic names for glue parameters are put into \TeX's hash table<br />
by using the routine called |primitive|, defined below. Let us enter them<br />
now, so that we don't have to list all those parameter names anywhere else.</p>
<p>@<Put each of \TeX's primitives into the hash table@>=<br />
primitive("lineskip",assign_glue,glue_base+line_skip_code);@/<br />
@!@:line_skip_}{\.{\\lineskip} primitive@><br />
primitive("baselineskip",assign_glue,glue_base+baseline_skip_code);@/<br />
...<br />
</code></p>
<p>When this is translated to C the result looks something like this:</p>
<p><code><br />
  ...<br />
  primitive ( 381 , 75 , 24527 ) ;<br />
  primitive ( 382 , 75 , 24528 ) ;<br />
  ...<br />
</code></p>
<p>Not a string or comment in sight. tangle has also converted everything into integers: <code>"lineskip"</code> becomes <code>381</code> ... single-stepping through this C code with a debugger is not my idea of fun. So, what to do?</p>
<p>If you look at the form of code like</p>
<p><code>primitive("lineskip",assign_glue,glue_base+line_skip_code);</code></p>
<p>it is very amenable to processing with regular expressions. What you can do, for example, is pre-process the WEB source with your favourite regex tool to add "meta-comments" that will reach the Pascal and (with your modified lexer) make it into the C code. For example (should all be on one line):
<p><code>primitive("lineskip",assign_glue,glue_base+line_skip_code); @{'lineskip,assign_glue,glue_base+line_skip_code'@};@/ </code></p>
<p>Here we added the "meta-comment"</p>
<p><code>@{'lineskip,assign_glue,glue_base+line_skip_code'@}</code></p>
<p>just after the original Pascal code. Note that I have used single quotes <code>'...'</code> to delimit the text simply because I want to be able to detect the comments I introduced when the modified lexer is scanning my comments. To cut a long story short, through this technique you end up with C code that looks like this:</p>
<p><code><br />
  primitive ( 381 , 75 , 24527 ) ; /*lineskip,assign_glue,glue_base+line_skip_code*/<br />
  primitive ( 382 , 75 , 24528 ) ; /*baselineskip,assign_glue,glue_base+baseline_skip_code*/<br />
 </code></p>
<p>Maybe not beautiful, but at least you now know what (some) of those tangle-generated numbers represent.</p>
<h1>In conclusion</h1>
<p>This techniqe is not "pretty" but, if used with care, you can get tangle to output a lot of useful comments, either through regular-expressions and pre-processing of the WEB code, or hand-editing the WEB to write summaries of the descriptions of the source code. I must stress that you can't put "meta-comments" just anywhere in the WEB source because you risk breaking tangle's parsing process: you'll need to experiment and proceed carefully with (say) small/minor manual edits to make sure tangle or Web2C don't "choke" on your changes.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2619</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Porting and building Web2C.exe for Windows</title>
		<link>http://www.readytext.co.uk/?p=2529&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=porting-and-building-web2c-exe-for-windows</link>
		<comments>http://www.readytext.co.uk/?p=2529#comments</comments>
		<pubDate>Sat, 02 Feb 2013 16:00:57 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[C programming (miscellaneous)]]></category>
		<category><![CDATA[Web2C, tangle, weave, WEB]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2529</guid>
		<description><![CDATA[Introduction
This post is, once again, an aide-mémoire to record a work-in-progress: porting the tools that convert Knuth's original Pascal-based WEB source to C &#8211; to create a native build of Web2C.exe, fixwrites.exe and other tools using Microsoft's Visual Studio (and not using pipes). My apologies if this post is a little unstructured but the whole [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>This post is, once again, an aide-mémoire to record a work-in-progress: porting the tools that convert Knuth's original Pascal-based WEB source to C &ndash; to create a native build of <code>Web2C.exe</code>, <code>fixwrites.exe</code> and other tools using Microsoft's Visual Studio (and not using <a href="http://www.linfo.org/pipe.html">pipes</a>). My apologies if this post is a little unstructured but the whole task is somewhat convoluted, which may be reflected in my writing style for this post! However, I'd like to record it whilst it is fresh in my memory.</p>
<p>Why would anyone want to do this when there are ready-made, reliable, TeX distributions freely available? Good question. Well, for me, it's nothing more than pure curiosity &ndash; and the fact that most British TV programs are now such mind-numbing drivel that I might as well do something productive in the evenings!</p>
<blockquote><p><strong>Join TUG</strong>: Just as an aside, I'm a member of the TeX User Group, TUG, so if you too would like to support TeX why not <a href="http://tug.org/join.html">consider joining</a>?</p></blockquote>
<p>Another reason for writing this post is that I could not find much documentation on <em>how</em> to build <code>Web2C.exe</code> from source code &ndash; apart from these <a href="http://www.readytext.co.uk/files/ctowebstuff.pdf"> notes by Timothy Murphy</a>, detailing the process for Macintosh-based port. Even though they were written in 1992 they were extremely helpful in filling in some of the details, so a belated thank you to Timothy Murphy &ndash; much of this post draws inspiration from that document. Piecing together the Web2c build process has been somewhat of a "programming jigsaw"  &ndash; there are still gaps in my understanding but, I think, I can see the big picture even if it's still a little hazy in some areas.
</p>
<h1>The Big Picture</h1>
<p>The source files for TeX, and other TeX-related programs and utilities, are written using Professor Donald Knuth's <a href="http://en.wikipedia.org/wiki/Literate_programming">literate programming</a> methodology. In essence, the program code (in Pascal) and documentation of the source code (in TeX) are contained within a single file, with extension <code>.web</code>. For example, Professor Knuth's source code of the latest version of TeX is contained in a file called <code>tex.web</code>. Similarly, within the TeXLive repository (<a href="http://www.readytext.co.uk/?p=365">see a previous post</a>) or on CTAN, you can find the WEB source code for the latest versions of other programs; for example:  </p>
<ul>
<li><strong>bibtex.web</strong>: the source code/documentation of <code>BiBTeX</code>, for formatting and producing reference lists, as widely used within academic journal papers.</li>
<li><strong>mf.web</strong>: the source code/documentation of MetaFont.</li>
<li><strong>patgen.web</strong>: the source code/documentation of <code>patgen</code> which "... takes list of hyphenated words and generates a set of patterns that can be used by the TeX 82 hyphenation algorithm." </li>
<li><strong>tangle.web</strong>: the source code/documentation of <code>tangle</code>, which converts a WEB file to a Pascal (i.e., extracts  the source code in Pascal, not in C &ndash; that's why Web2C exists).</li>
<li><strong>weave.web</strong>: the source code/documentation of <code>weave</code>, which converts a WEB file to TeX (i.e., extracts the documentation of the program's Pascal source code).</li>
</ul>
<p>and other programs/utilities such as <code>dvicopy.web</code>, <code>pltotf.web</code>, <code>tftopl.web</code> and so forth.</p>
<blockquote><p><strong>What's in a name: tangle, web and weave?</strong> I've not researched to find out, but I cannot help thinking that Professor Knuth drew inspiration from Sir Walter Scott when naming these programs. Scott's poem <a href="http://en.wikipedia.org/wiki/Marmion_%28poem%29">Marmion</a> contains the line(s) "O, what a tangled web we weave when we practice to decieve". Maybe these programs are as literary as they are literate?</p></blockquote>
<h1>TeXLive as the source of the files for building <code>Web2C.exe</code></h1>
<p>The files I reference throughout this post can be downloaded via SVN from the TeXLive repository. If you want to browse the TeXLive repository, using the TortoiseSVN program on Windows, <a href="http://www.readytext.co.uk/?p=365">this post</a> may be of help. The following screenshots show the TeXLive folders you'll need to access for acquiring the various files I mention in this post.</p>
<ul>
<li><strong>svn://tug.org/texlive/trunk/Build/source/texk/web2c</strong>: this folder contains, for example, <code>tangleboot.pin</code> (see below) and all the <code>*.web</code> files listed above, plus many other essential files.
<p><img src="http://readytext.co.uk/files/texlive1.png" width="100%"/></p>
</li>
<li><strong>svn://tug.org/texlive/trunk/Build/source/texk/web2c/web2c</strong>: this folder contains the source files needed to build the actual <code>Web2C.exe</code> program. Note carefully it <strong>does not</strong> contain a file called <code> Web2C.c</code>, more on that below.
<p><img src="http://readytext.co.uk/files/texlive2.png" width="100%"/></p>
</li>
</ul>
<p>TeXLive has an advanced build-process for compiling/building all the tools and software it contains and I, for one, am in awe of the skills and expertise of its maintainers. In describing my explorations of building <code>Web2C.exe</code> as a Windows-based executable you need to realize that I am taking the source code files of <code>Web2C.exe</code> out of their "natural build environment". What do I mean by this? Building the Web2C executable program is usually part of the much bigger TeXLive build/compilation process so you should be prepared for a little extra complexity to create <code>Web2C.exe</code> as a "standalone" Windows program. Note that "standalone" is in quotes because converting WEB-generated Pascal into C code requires other tools in addition to <code>Web2C.exe</code>: it is not fully accomplished by <code>Web2C.exe</code> alone.</p>
<h1>A note about Kpathsea</h1>
<p>The <a href="http://tug.org/kpathsea/">Kpathsea (path-searching) C library</a> in an integral part of most TeX-related software and the Web2C C source files <code>#include</code> a number of Kpathsea headers. However, for my own purposes/experiments I've decided to decouple my build of the <code>Web2C.exe</code> executable from the need to include Kpathsea's headers &ndash; the resulting C files generated by <code>Web2C.exe</code> will, of course, still depend on Kpathsea. If you grab the Web2C source files (see below) then "out of the box" you'll need to checkout the Kpathsea library from:</p>
<p>svn://tug.org/texlive/trunk/Build/source/texk/kpathsea</p>
<p>I've simply not got the time to document everything I had to do to decouple Kpathsea when building <code>Web2C.exe</code>. It mainly involved commenting out various <code>#include</code> lines that pulled in Kpathsea headers and placing a few <code>#define</code> statements into my local version of <code>web2c.h</code> &ndash; plus creating some typedefs and adding a few macros. If you're an experienced C programmer it is unlikely to present difficulties. As mentioned, this post describes a work-in-progress to satisfy my own curiosity and is meant to share a few of the things I've learnt, should they be useful to anyone as a starting point for their own work.</p>
<h1>Web2C: so <em>what is it</em>?</h1>
<p>Let me be clear that when I refer to Web2C I am referring to the executable program which undertakes the first (main) step in converting Pascal code into to C. So let's now start to take a look at the details but start with a summary of "Where are we?"</p>
<h2>Where are we?</h2>
<p>The starting point for generating C code is to extract the Pascal code from WEB source files and that is accomplished using the <code>tangle</code> program. However, where do we get a working <code>tangle</code> program from to start with &ndash; do we have a <a href="http://en.wikipedia.org/wiki/Chicken_or_the_egg">chicken and egg</a> problem?</strong> <code>tangle</code> is itself distributed in WEB source code (<code>tangle.web</code>), so if I need <code>tangle</code> to extract tangle's source code from <code>tangle.web</code>, how do I create a working tangle program? Well, of course, this is solved by the distribution of tangle's Pascal code in a file called <code>tangleboot.pin</code> within the Web2C directory of the TeXLive repository (see above). In essence, <code>tangleboot.pin</code> let's you "bootstrap" the whole Web2C process by creating a working <code>tangle.exe</code> which you can use to generate the Pascal from WEB source files. Hence the name tangle<strong>boot</strong>.pin</p></blockquote>
<p>So, how do I go from <code>tangleboot.pin</code> to a working tangle.exe? You need to build <code>Web2C.exe</code> and some associated utility programs (e.g., <code>fixwrites.exe</code>).</p>
<h2>Where are the <code>Web2C.exe</code> source files?</h2>
<p>As mentioned above, the TeXLive folder containing the source files needed to build <code>Web2C.exe</code> is</p>
<ul>
<li>svn://tug.org/texlive/trunk/Build/source/texk/web2c/web2c</li>
</ul>
<p>The C source files you need to compile/build <code>Web2C.exe</code> are:</p>
<ul>
<li>kps.c</li>
<li>main.c</li>
<li>web2c-lexer.c</li>
<li>web2c-parser.c</li>
</ul>
<h2>Some notes on these files</h2>
<p>These C files <code>#include</code> a number of header files from the TeXLive distribution, notably from the Kpathsea library, so you should definitely look through them to determine any additional files you need.</p>
<p>The files <code>web2c-parser.c</code> and <code>web2c-lexer.c</code> are worthy of some explanation because they are the core files which drive the Pascal --> C conversion process. However, these two C source files are not hand-coded but are <em>generated</em> from two further source files with similar names. If you look among the source files you will also notice these two additional files:</p>
<ul>
<li><code>web2c-lexer.l</code></li>
<li><code>web2c-parser.y</code></li>
</ul>
<p>What are these files with similar names? As you may infer from their names, these files are a lexical analyser and a parser generator and require additional tools to process them:</p>
<ul>
<li><code>web2c-lexer.l --> web2c-lexer.c </code> using a tool called <a href="http://flex.sourceforge.net/">flex</a>.</li>
<li><code>web2c-parser.y --> web2c-parser.c + web2c-parser.h</code> using a tool called <a href="http://www.gnu.org/software/bison/">bison</a>.</li>
</ul>
<h3>Are bison/flex available for Windows?</h3>
<p>Fortunately they are and, at the time of writing (February 2013), you can download Windows ports of <code>bison 2.7</code> and <code>flex 2.5.37</code> from <a href="http://sourceforge.net/projects/winflexbison/ ">http://sourceforge.net/projects/winflexbison/</a>. The executables are called <code>win_bison.exe</code> and <code>win_flex.exe</code> respectively. The <code>win_flex.exe</code> port of flex adds an extra command-line switch (<code>--wincompat</code>) so that the C code it generates uses the standard Windows header <code>io.h</code> instead of <code>unistd.h</code> (which is used on Linux). You can also download older versions of bison and flex for Windows from the <a href="http://gnuwin32.sourceforge.net/packages.html">GnuWin32 project</a>.</p>
<p>I have not yet tried to use the code generated by <code>win_flex.exe</code> and <code>win_bison.exe</code> but to the best of my (current) knowledge the command-line options you need are:</p>
<ul>
<li><code>win_bison -y -d web2c-parser.y</code> to generate the parser (you'll get different file names on output: <code>y.tab.c</code> and <code>y.tab.h</code>)</li>
<li><code>win_flex --wincompat web2c-lexer.l</code> to generate the lexical analyser (you'll get a different file name on output: <code>lex.yy.c</code>)</li>
</ul>
<h1>You need more than just <code>Web2c.exe</code>...</h1>
<p>Assuming that you successfully build <code>Web2c.exe</code>, it is still not the end of the story. Although <code>Web2c.exe</code> does the <em>bulk</em> of the work in converting the Pascal to C, some initial pre-processing of the Pascal source file is needed before you can run it through <code>Web2C.exe</code>, and some further post-processing of the C code output by <code>Web2C.exe</code> is also needed. The details of how these pre- and post-processing steps actually work are contained within an important BASH shell script called <code>convert</code> (it has no extension) &ndash; <code>convert</code> is located within the TeXLive folder containing the Web2C source files. I readily confess that I know very little about Linux shell scripting so if you are well-versed in shell scripts no doubt you can easily understand what is going on in the <code>convert</code> file. However, here are pointers to get you started.</p>
<h3>Pre-processing: adding the <code>*.defines</code> files to the Pascal file</h3>
<p>Before you can actually run <code>Web2C.exe</code> on the Pascal file generated from WEB sources you need to concatenate the Pascal source file with some files having the extension "<code>.defines</code>": you add these files to the <strong>start</strong> of the Pascal file before running <code>Web2C.exe</code>. There are several <code>.defines</code> contained in the Web2C source directory including:</p>
<ul>
<li><code>common.defines</code></li>
<li><code>mfmp.defines</code></li>
<li><code>texmf.defines</code></li>
</ul>
<p>The <code>convert</code> script checks which program, and its options, (TeX, MetaFont, BiBTeX etc) is being built and concatenates the appropriate <code>*.defines</code> file(s) to the start of the corresponding Pascal file. At this time, I don't quite fully understand how/why these files are needed, but for the full details you need to read <code>convert</code>. By way of an example, when processing <code>tangleboot.pin</code> I added the file <code>common.defines</code> to the beginning of <code>tangleboot.pin</code>.  </p>
<h3>Post-processing: <code>fixwrites.exe</code></h3>
<p><code>Web2C.exe</code>'s output is not quite pure C source code &ndash; it may still contain some fragments of Pascal which need a specialist post-processing step to fully convert them to C: enter <code>fixwrites.exe</code>. <code>fixwrites.exe</code> post-processes <code>Web2C.exe</code>'s C output to "...convert Pascal write/writeln's into fprintf's or putc's" (see <code>fixwrites.c</code>). </p>
<h3>Notes on <code>web2c-parser.c</code>, <code>web2c-lexer.c</code> and
<li>main.c</li>
</h3>
<p>Upon reading the <code>convert</code> script, and when I first ran <code>Web2C.exe</code>, it became readily apparent that the whole Pascal --> C tool chain (driven by <code>convert</code>) communicates using <a href="http://www.linfo.org/pipe.html">pipes</a>) with stdout/stderr. The output of one program is "piped" into the input to another, rather than writing the data out to a physical disc file and then reading it back in. My personal preference, certainly whilst learning, is to output data to a file so that I can capture what's going on. </p>
<h4><code>main.c</code> and <code>yyin</code></h4>
<p>Without going into too much detail, I needed to make a number of changes in <code>main.c</code> so that the lexical analyzer <code>web2c-lexer.c</code> was set to read it's data from a disc file rather than through pipes/stdin. The <code>FILE*</code> variable you need to set/define is called <code>yyin</code>. For example, within <code>main.c</code> there is a function called <code>initialize ()</code> which can be used to set <code>yyin</code>. For example:</p>
<p><code>void initialize (void)<br />
{<br />
  register int i;<br />
  for (i = 0; i < hash_prime; hash_list[i++] = -1)<br />
    ;</p>
<p> yyin = xfopen("your_path_to\\tangleboot.p","r");<br />
 ...<br />
 ...<br />
}</code></p>
<p>In addition, within <code>main.c</code> there's a small function called <code>normal ()</code> which does the following:</op></p>
<p><code><br />
void normal (void)<br />
{<br />
  out = stdout;<br />
}</code></p>
<p>The <code>normal ()</code> function is called from within <code>web2c-parser.c</code> to set the output file (<code>FILE *out</code>) to stdout. At present, I'm not sure precisely why this is done, but I guess it is part of the piping between programs as driven by the <code>convert</code> process. For example, code within <code>convert</code> uses <a href="http://www.gnu.org/software/sed/manual/sed.html">sed</a> (the stream editor).  </p>
<p>Other output redirections happen in <code>web2c-parser.c</code> and you can search for these by looking for <code>out = 0</code>. Tracking down and locating these output redirections certainly helped me to better understand the flow of the programs.</p>
<h1>In conclusion</h1>
<p>This post is a little disjointed in places and light on detail in a number of areas, reflecting my own (currently) incomplete understanding of the relatively complex processes involved in converting WEB/Pascal to C. Nevertheless, I hope that it is of some use to someone, at some point. As my understanding develops I'll try to fill in the gaps with future posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2529</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building and using CTIE, CWEAVE and CTANGLE on Windows</title>
		<link>http://www.readytext.co.uk/?p=2475&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=building-and-using-ctie-cweave-and-ctangle-on-windows</link>
		<comments>http://www.readytext.co.uk/?p=2475#comments</comments>
		<pubDate>Mon, 28 Jan 2013 20:23:42 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[Web2C, tangle, weave, WEB]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2475</guid>
		<description><![CDATA[Introduction
Before continuing, I should say that this post is a sort of aide-mémoire for myself but I hope it is useful to others as well. Anyone who has looked into building TeX from the WEB source code soon finds that the process is somewhat "less than straightforward". Life can get a bit more complicated if, [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>Before continuing, I should say that this post is a sort of aide-mémoire for myself but I hope it is useful to others as well. Anyone who has looked into building TeX from the WEB source code soon finds that the process is somewhat "less than straightforward". Life can get a bit more complicated if, like me, you prefer to use Microsoft's Visual Studio rather than MSYS/MinGW &ndash; which gives you a cut down "Linux-like" build environment. I use MSYS/MinGW for building LuaTeX and it works really well, but I confess to being seduced by the nice IDE of Microsoft's Visual Studio. Having used Visual Studio to build a couple of C-based TeX distributions (<a href="http://code.google.com/p/yytex/">Y&amp;Y TeX</a>, now open sourced), together with <a href="http://www.metatex.org/cxtex/">CXTeX</a>, part of <a href="http://www.metatex.org/">MetaTeX</a> (by Taco Hoekwater), I have decided to "bite the bullet" and create a Visual Studio build for LuaTeX. I'm sure this will take quite some time but, you know, sometimes you get one of those itches you just have to scratch! And I've been meaning to attempt this for a long time, purely as an exercise.</p>
<h2>CWEB</h2>
<p>A lot of LuaTeX source code (apart from the libraries it uses) is written in <a href="http://sunburn.stanford.edu/~knuth/cweb.html">CWEB</a>, a dialect of <a href="http://en.wikipedia.org/wiki/Literate_programming">literate programming</a> by Silvio Levy and Donald Knuth. The original WEB sources of TeX use Pascal as the programming language, but for CWEB it is C &ndash; which, thankfully, saves you the painful process of converting Pascal to C via Web2C. Anyone who has looked at the C code generated by Web2C will, I'm sure, be dismayed because it's almost impossible to read. This is not so surprising given that it's machine generated. The joy of LuaTeX's C code, derived from CWEB, it that it is much, much more readable than the C code derived WEB --> Web2C --> C. Not suprising, of course, because CWEB C code is not generated mechanically.</p>
<h2>So, how do you process CWEB code?</h2>
<p>Enter CWEAVE and CTANGLE. What are these, you may well ask. Their task is to process a file written using CWEB (typically, with an extension "<code>.w</code>") and output the C source code (using CTANGLE) or the program's documentation in TeX (using CWEAVE). From <a href="http://sunburn.stanford.edu/~knuth/cweb.html">http://sunburn.stanford.edu/~knuth/cweb.html</a>:</p>
<ul>
<li><strong>CTANGLE</strong>: converts a source file foo.w to a compilable program file foo.c; </li>
<li><strong>CWEAVE</strong>: converts a source file foo.w to a prettily-printable and cross-indexed document file foo.tex. </li>
</ul>
<p>The LuaTeX build process (with MSYS/MinGW) also generates the executable <code>ctangle.exe</code>, so my first thought was "Great, I'll just use that to generate C source from the CWEB <code>*.w</code> files in the LuaTeX distribution". Running <code>ctangle</code> (from the LuaTeX build) using the command line (under the Windows cmd shell, or the MSYS BASH shell):</p>
<p><code>ctangle --help</code></p>
<p>you get the following output:</p>
<p><code>$ ctangle --help<br />
Usage: ctangle [OPTIONS] WEBFILE[.w] [{CHANGEFILE[.ch]|-} [OUTFILE[.c]]]<br />
  Tangle WEBFILE with CHANGEFILE into a C/C++ program.<br />
  Default CHANGEFILE is /dev/null;<br />
  C output goes to the basename of WEBFILE extended with `.c'<br />
  unless otherwise specified by OUTFILE; in this case, '-' specifies<br />
  a null CHANGEFILE.</code></p>
<h2>But first, change files and CTIE</h2>
<p>From the output of <code>ctangle --help</code> you can see that its command line includes reference to <code>CHANGEFILE.ch</code>. So, what is that? Suppose that you have some program <code>foo.w </code> written in CWEB and you want to make some platform-specific modifications to <code>foo.w</code>. Rather than amending <code>foo.w</code> itself (e.g., to keep it platform-independent) you "merge" <code>foo.w</code> with a change file which, for example, may contain Windows-specific code. You would put your Windows CWEB code into, say, <code>win32.ch</code> and merge this code with <code>foo.w</code>. So how do you do this merge? There are two main ways: </p>
<ol>
<li>you can combine <code>foo.w</code> and <code>win32.ch</code> using CWEAVE or CTANGLE, or</li>
<li>you can use another program called CTIE</li>
</ol>
<h3>What is CTIE?</h3>
<p>The idea behind CTIE is that it lets you merge a master CWEB file with <em>multiple </em> change files, whereas CWEAVE or CTANGLE support only 1 change file on their command line. The source code of CTIE is also part of the LuaTeX distribution, in the directory <code>..\source\texk\web2c\ctiedir\</code>. In there you'll find a file called <code>ctie.c</code> which compiles easily using Visual Studio to give you <code>ctie.exe</code>. If you want to read more about CTIE I have processed the documentation which you can <a href="http://www.readytext.co.uk/files/ctiedoc.pdf">download as a PDF</a>.</p>
<h2>Default CHANGEFILE is /dev/null</h2>
<p>On reading the <code>Usage</code> information output by <code>ctangle --help</code> you should note that the <code>Usage</code> instructions state: <code>Default CHANGEFILE is /dev/null</code>. The explanation of <code>/dev/null</code> on <a href="http://en.wikipedia.org/wiki/Dev-null">http://en.wikipedia.org/wiki/Dev-null</a> state that:</p>
<blockquote><p>"In Unix-like operating systems, /dev/null or the null device is a special file that discards all data written to it."</p></blockquote>
<p>The Usage instructions are a little cryptic but what it is saying is that if you want to run CWEAVE or CTANGLE without using a change file, you would run it like this:</p>
<p><code>ctangle foo.w - </code></p>
<p>where the hyphen (<code>-</code>) in effect says "don't use a changefile". Let's take an example CWEB file from the LuaTeX distribution, <code>align.w</code>, and try to generate the C source code using the version of ctangle built during the LuaTeX compilation process using MSYS/MinGW. Here we don't want to apply a change file so we'll use the hyphen option in place of a change file (the C output file will default to <code>align.c</code>):</p>
<p><code>ctangle align.w -</code></p>
<p>The resulting output is:</p>
<p><code>This is CTANGLE, Version 3.64 (TeX Live 2011)<br />
! Cannot open change file NUL. (l. 0)</p>
<p>(That was a fatal error, my friend.)</code></p>
<p>That's a bit annoying, but the fix is very simple and there are a couple of ways to do it.</p>
<h2>How to fix this?</h2>
<p>To build CTANGLE you needs two files: <code>ctangle.c</code> and <code>common.c</code>, both of which are located in the source directory of LuaTeX (<code>..\source\texk\web2c\cwebdir\</code>). The "offending" code which causes the fatal error is located in <code>common.c</code> (and, of course,  <code>common.w</code>). </p>
<p>In <code>common.c</code> (or <code>common.w</code>) you'll find the line:</p>
<p><code>if (found_change<=0) strcpy(change_file_name,"/dev/null")</code>;</p>
<p>and that needs changing for Windows. Fortunately, in the LuaTeX distribution (<code>..\source\texk\web2c\cwebdir\</code>) there is a change file, <code>comm-w32.ch</code>, taking care of this (written by Fabrice Popineau, in February 2002). In <code>comm-w32.ch</code> you'll find the above line replaced with:</p>
<p><code>if (found_change<=0) strcpy(change_file_name,"NUL");</code></p>
<p>Of course, the proper way to fix this is to apply a change file (such as <code>comm-w32.ch</code>) to the CWEB source of <code>common.w</code> and re-generate <code>common.c</code> with the above fix. You can fix <code>common.c</code> in at least two ways:</p>
<ol>
<li>manually edit <code>common.c</code> to replace "/dev/null" with "NUL" in the line above, or</li>
<li> use the LuaTeX-build created version of <code>ctangle</code> but with the <code>comm-w32.ch</code> change file &ndash; it was the <em><strong>absence</strong></em> of a change file that we are trying to fix.
</ol>
<blockquote><p><strong><strong>Note:</strong></strong> If you are experimenting with these CWEB tools I strongly suggest you make a copy of all your <code>*.w</code> files into a working directory in case you make an error and accidentally overwrite any files. </p></blockquote>
<p>Copy <code>ctangle.exe</code>, <code>common.w</code> and <code>comm-w32.ch</code> to a working directory away from your main source code, CD into that directory (make it the current directory), and run the following command line (it works under DOS and the MSYS BASH shell). the "<code>./</code>" simply tells ctangle to look in the current directory. </p>
<p><code>$ ctangle ./common.w ./comm-w32.ch ./mycommon.c</code></p>
<p>If successful, the output should be:</p>
<p><code>ctangle ./common.w ./comm-w32.ch ./mycommon.c<br />
This is CTANGLE, Version 3.64 (TeX Live 2011)<br />
*1*5*7*27*56*67*77*81<br />
Writing the output file (./mycommon.c):.....500.....1000<br />
Done.<br />
(No errors were found.)<br />
</code></p>
<p>giving you a new version of <code>common.c</code> (which I called <code>mycommon.c</code>) with the fix applied by <code>comm-w32.ch</code>. If you look at the last lines of the <code>mycommon.c</code> file you just generated you should see something like this:</p>
<p><code>#line 78 "./comm-w32.ch"<br />
if(found_change<=0)strcpy(change_file_name,"NUL");<br />
#line 1283 "./common.w"</code></p>
<p>You can see that the line 78 of <code>comm-w32.ch</code> has been applied. Now, with the fixed file (<code>mycommon.c</code>) you can proceed to build CTANGLE using Visual Studio to generate an executable that accepts the "NULL" change file. We'll see that in a moment.</p>
<p>Let's use a different approach: using CTIE to merge <code>common.w</code> and <code>comm-w32.ch</code> into, say, <code>mycommon.w</code>. From <code>mycommon.w</code> we'll use our newly compiled CTANGLE to output <code>mycommon.c</code>. The the following CTIE command line does the trick: </p>
<p><code>ctie -m mycommon.w common.w comm-w32.ch</code></p>
<p>The -m option is documented <a href="http://www.readytext.co.uk/files/ctiedoc.pdf">here</a>. If successful, you should see something like this:</p>
<p><code>This is CTIE, Version 1.1<br />
Copyright 2002,2003 Julian Gilbey.  All rights reserved.  There is no warranty.<br />
Run with the --version option for other important information.<br />
(common.w)<br />
(comm-w32.ch)<br />
....500....1000....<br />
(No errors were found.)</code></p>
<p>However, you can of course simply edit the file <code>common.c</code> directly to make the change. Once you fix <code>common.c</code>, both CWEAVE and CTANGLE compile nicely with Visual Studio and work perfectly when you use the "<code>-</code>" option to indicate no change file. With a working CTANGLE you can generate the C source of <code>newcommon.w</code> like this:</p>
<p><code>G:\CWEB\cwebtools\Debug>ctangle newcommon.w -<br />
This is CTANGLE (Version 3.64)<br />
*1*5*7*27*56*67*77*81<br />
Writing the output file (newcommon.c):.....500.....1000<br />
Done.<br />
(No errors were found.)</code></p>
<p>With CTANGLE in place you can now run it on the CWEB <code>*.w</code> sources in LuaTeX to generate the C code. Clearly, for Visual Studio one way to proceed is to incorporate CWEB sources into your project and have a "Custom Build Step" for <code>.w</code> files, processing them with CTANGLE.</p>
<p>Happy TeXing, or should I say ctieing, ctangling and cweaving!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2475</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Adding a UTF-8-capable regular expression library to LuaTeX</title>
		<link>http://www.readytext.co.uk/?p=2428&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=adding-a-utf-8-capable-regular-expression-library-to-luatex</link>
		<comments>http://www.readytext.co.uk/?p=2428#comments</comments>
		<pubDate>Fri, 23 Nov 2012 19:40:49 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[Extending with C and DLLs]]></category>
		<category><![CDATA[LuaTeX]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2428</guid>
		<description><![CDATA[Introduction
In this post I'm going to sketch out adding the free PCRE C library to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I've not tried this in a production environment. So, do please [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>In this post I'm going to sketch out adding the free <a href="http://www.pcre.org/">PCRE C library</a> to LuaTeX through a DLL and outline how you can get PCRE to call LuaTeX! The following is just an outline of an experiment, not a tutorial on PCRE, and I've not tried this in a production environment. So, do please undertake all necessary testing and due diligence in your own code!</p>
<h2>PCRE: Perl Compatible Regular Expressions</h2>
<p><P>PCRE is a mature C library which provides a very powerful regular expression engine. It is also capable of working with UTF-8 encoded strings, which is, of course, very useful because LuaTeX uses UTF-8 input. I'm not going to cover the entire PCRE build process in this post because, frankly, it'll take too long. But in outline...</p>
<h3>Building PCRE as a static library (.lib)</h3>
<ol>
<li>I used <a href="http://www.cmake.org/">CMake</a> to create a Visual Studio 2008 project via the PCRE-supplied CMakeLists.txt file. Using the CMake tool you can set the appropriate compile-time flags for UFT-8 support: PCRE_SUPPORT_UTF and PCRE_SUPPORT_UNICODE_PROPERTIES. The latter is very useful for seaching UTF-8 strings based on their Unicode character properties. Full details are in the PCRE documentation. </li>
<p><img src="http://readytext.co.uk/files/cmake.png" width="100%"/></p>
<li>After you finish configuring the PCRE build, and have selected your build environment, press <code>Generate</code> and CMake will output a complete Visual Studio project that you can open and start working on. Wonderful!</li>
<li>However, getting PCRE to build as a static library was fine but I did have a few hassles getting the library to correctly link against the DLL I was building. It took me a bit of time to figure out which additional PCRE preprocessor directives I needed to set in the DLL C code to ensure everything was <code>#define</code>'d properly. </li>
</ol>
<h1>Building a DLL for LuaTeX</h1>
<p>I wrote a very brief overview of building DLLs for LuaTeX in <a href="http://www.readytext.co.uk/?p=489">this post</a> so I won't repeat the details here. Instead, I'll give a summary indicating how you can get PCRE to call LuaTeX. One word of advice, PCRE comes with <em>a lot</em> of documentation and you'll need to read through it very carefully! Asking PCRE to call LuaTeX sounds strange but indeed you can do it because PCRE provides the ability to register a callback function it will call each time it matches a string. Perl has a similar ability to execute Perl code on matching a string. From the PCRE documentation:</p>
<blockquote><p>"PCRE provides a feature called 'callout', which is a means of temporarily passing control to the caller of PCRE in the middle of pattern matching. The caller of PCRE provides an external function by putting its entry point in the global variable <code>pcre_callout</code>." </p></blockquote>
<h2>Calling LuaTeX</h2>
<p>OK, so how do we do that? There are two parts to this story: create a Lua function you want to call from C and create the C function which calls the Lua function.</p>
<ol>
<li>From within LuaTeX, use <code>\directlua{...}</code> to create a simple Lua function <code>printy</code> that we are going to call from PCRE. This Lua function takes a string and sends it to LuaTeX via tex.print(). In these examples I sent LuaTeX a simple text string <code>"Yo! I was called!"</code>, which LuaTeX then typeset. Of course, you could also send LuaTeX the string that was matched by PCRE!
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
       \directlua{
              function printy (str)
              tex.print(str)
              end
       }
</pre>
</li>
<li>The next part is to create the C code to call a Lua function. This C function is the callout that PCRE will call when it matches a string.
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
       int mycallout(pcre_callout_block *cb){
       lua_State *L;
       L = cb-&gt;callout_data;
       if (L){
              lua_getglobal(cb-&gt;callout_data, &quot;printy&quot;);
              if(!lua_isfunction(L,-1)) {
                     lua_pop(L,1);
                     return 0;
               }

              lua_pushstring(L, &quot;Yo! I was called!&quot;);   /* push 1st argument */
              /* Now make the call to printy with 1 argument and 0 results*/
              if (lua_pcall(L, 1, 0, 0) != 0) {
              // report your error
               return 0;
              }
    }
    return 0;
}
</pre>
<blockquote><p>A few points here are worth noting. </p>
<ul>
<li>From the PCRE documentation:<br />
<blockquote><p>"The external callout function returns an integer to PCRE. If the value is zero, matching proceeds as normal. If the value is greater than zero, matching fails at the current point, but the testing of other matching possibilities goes ahead, just as if a lookahead assertion had failed. If the value is less than zero, the match is abandoned, the matching function returns the negative value"</p></blockquote>
</li>
<li>The <code>lua_State</code> variable, <code>*L</code>, is passed in via a mechanism I'll outline below.</li>
<li>The line <code>lua_getglobal(cb->callout_data, "printy")</code> does the main work of pushing the value of the gloabal variable <code>printy</code> onto Lua's stack. Of course, in effect this is a pointer to the function we defined in LuaTeX, and which we call through <code>lua_pcall(...)</code>. Further details in the Lua documentation. </li>
<li>The above code does near-zero error checking, it is purely to demonstrate the ideas!</li>
</blockquote>
</ul>
</li>
</ol>
<h2>Other PCRE bits and pieces</h2>
<p>There are a few other points to consider, namely how do you setup the callout and how do you pass <code>lua_State *L</code> to the callout? I'm not going to explain in great detail how all these parts hang together in a full application, simply point out some key pieces.</p>
<ol>
<li>You have to set the PCRE global variable <code>pcre_callout</code>, a function pointer, to your callout function. Simply, <code>pcre_callout = mycallout;</code> Yes, it does work. Here, <code>re</code> represents our compiled regular expression pattern. Note that you must use the <code>PCRE_UTF8</code> option if you are searching UTF-8 encoded text.</li>
<li>Before you can start searching, you need to "compile" your regular expression pattern.
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
              re = pcre_compile(pattern,
		      PCRE_UTF8|PCRE_UCP,
		      &amp;err_msg,
		      &amp;err,
		      NULL);
</pre>
</li>
<li>Note, to use PCRE callouts you need to use the appropriate syntax in your regular expression; from the PCRE documentation, "Within a regular expression, (?C) indicates the points at which the external function is to be called." Once you have compiled your search pattern, and done your error checking, you need to run the search engine using the compiled pattern and your target string (<code>s</code>) in the code below.
</li>
<li>
The next step is to create a pointer to something called a <code>pcre_callout_block</code>, which is a struct. This struct has a field called <code>callout_data</code> which is a pointer into which you can store whatever you want to pass into the <code>mycallout</code> function: here, I'm setting it to the <code>lua_State</code> variable, <code>L</code>. By doing this, each time PCRE matches a string and calls the callout funtion, the <code>lua_State</code> variable, <code>L</code> will be available for our use! Clearly, you'll need to do this from within the appropriate function you call from LuaTeX. Once this is done you are ready to begin your searching using <code>pcre_exec(...)</code>. </p>
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
              pcre_extra *p;
              p = (pcre_extra*) malloc(sizeof(pcre_extra));
              memset(p,0, sizeof(pcre_extra));
              p-&gt;callout_data = L;
              p-&gt;flags=PCRE_EXTRA_CALLOUT_DATA;
                     res = pcre_exec(re,
                            p,
                            s,
                            len,
                            0,
                            0,
                            offsets,
                     OVECMAX);
</pre>
</li>
</ol>
<h1>Summary</h1>
<p>PCRE is a marvellous and powerful C library &ndash; with copious documentation that you'll need to read very carefully! The ability to provide LuaTeX with a UTF-8-enabled regex engine could open the way to some useful applications, particularly when combined with LuaTeX's own callback mechanism. In particular, the <code>process_input_buffer</code> callback which allows you to change the contents of the line input buffer just before LuaTeX actually starts looking at it. The mind boggles at the possibilities!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2428</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Browsing LuaTeX source with NetBeans</title>
		<link>http://www.readytext.co.uk/?p=2416&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=browsing-luatex-source-with-netbeans</link>
		<comments>http://www.readytext.co.uk/?p=2416#comments</comments>
		<pubDate>Mon, 19 Nov 2012 22:17:09 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[LuaTeX]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2416</guid>
		<description><![CDATA[Introduction
It's been a long time since I posted anything on this blog, mainly because my job has been keeping me very busy. As time permits I've been reading parts of the LuaTeX source code in an attempt to better understand how it all works: cross-referencing the source code to explanations in the LuaTeX Reference. A [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>It's been a long time since I posted anything on this blog, mainly because my job has been keeping me very busy. As time permits I've been reading parts of the LuaTeX source code in an attempt to better understand how it all works: cross-referencing the source code to explanations in the LuaTeX Reference. A couple of days ago I stumbled on the NetBeans IDE &ndash; a free Integrated Development Environment. I was interested to see that NetBeans has a Subversion Checkout Wizard (i.e., built-in SVN capabilities), so you can checkout a copy of the LuaTeX code repository and import it directly into NetBeans as a new project. So, I downloaded <a href="http://netbeans.org/features/cpp/index.html">NetBeans (with C/C++ support)</a> and checked out a copy of the LuaTeX code base, directly from within NetBeans. After completing the download, NetBeans automatically imported the LuaTeX code to create a new project. Very nice!</p>
<p>However, I have not tried to build LuaTeX using NetBeans (because I need to understand more about the build process) but I have found that it provides excellent tools to search and browse the source code, allowing you to very quickly explore and probe some of the deeper mysteries of TeX.</p>
<h2>Tip: tell NetBeans about .w files</h2>
<p>Much of the LuaTeX code base is written in <a href="http://en.wikipedia.org/wiki/CWEB">CWEB</a> (integrated C source code and documentation); consequently, many of the source files have a .w extension. You'll need to configure NetBeans to tell it about .w files: see Tools --> Options --> Miscellaneous.</p>
<p>Here's a screenshot showing a search for the <code>build_page()</code> function, part of TeX's page-building machinery, showing you where and when TeX exercises the page builder.</p>
<p><img src="http://readytext.co.uk/files/netbeans.png" width="100%"/></p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2416</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>C, the Enduring Legacy of Dennis Ritchie</title>
		<link>http://www.readytext.co.uk/?p=2412&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=c-the-enduring-legacy-of-dennis-ritchie</link>
		<comments>http://www.readytext.co.uk/?p=2412#comments</comments>
		<pubDate>Wed, 19 Sep 2012 21:47:47 +0000</pubDate>
		<dc:creator>Graham Douglas</dc:creator>
				<category><![CDATA[C programming (miscellaneous)]]></category>

		<guid isPermaLink="false">http://www.readytext.co.uk/?p=2412</guid>
		<description><![CDATA[This is an interesting and moving read. A tribute to the late Dennis Ritchie delivered at Dennis Ritchie Day at Bell Labs, Murray Hill, NJ, September 7, 2012 
]]></description>
			<content:encoded><![CDATA[<p>This is an interesting and moving read. <a href="http://www.cs.columbia.edu/~aho/Talks/12-09-07_DMR.pdf">A tribute to the late Dennis Ritchie delivered at Dennis Ritchie Day at Bell Labs, Murray Hill, NJ, September 7, 2012 </a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.readytext.co.uk/?feed=rss2&#038;p=2412</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
