2023-11-10 How to add an index to your PDF

I mean, it's great if you want to write a bunch of Perl scripts. But

really… do you? Or indeed: Would you rather like to learn Perl or

LaTeX? 🥶 ❓ 🤷‍♂️ ❓ 🥵

I like my Markdown → HTML + CSS → PDF pipeline. For new PDFs of mine, I try to use this setup rather than relying on LaTeX. It's true, LaTeX documents probably look better in the end. But I don't write enough LaTeX at the end of the day. Everything is always tricky to find. Packages are hard to pick. I always end up on some StackExchange site and nothing is ever simple.

Now, the Markdown → HTML + CSS → PDF pipeline isn't simple, either. But it uses HTML and CSS and I use those two more often. I can look at the temporary HTML file using my browser. When I have questions, I end up on the Mozilla Developer Network (MDN) and it's not too bad. It's the kind of bad that I'm used to.

I'm not sure I'm doing a great job selling this. Remember how many years ago I tried to be objective about it all and concluded that using Libre Office would be the most efficient tool. You can go back and read the blog post from 2010. But I guess I got burned back in the last millenium when Word 5.1 was new and liked to crash, and Open Office was not great either, and Abi Word was too limited.

blog post from 2010

I learned to love Emacs and LaTeX and I don't want to go back to those graphical user interfaces. Somehow they make it hard to use styles correctly and consistenly. So text-based it is!

What follows is a short summary of how the Markdown → HTML + CSS → PDF pipeline works.

There are a number of things you need:

a bunch of Markdown files, which you write using your favourite text editor
a program to turn Markdown into HTML; I use Python with the Markdown module but any other command line tool would do
a CSS file to format the HTML generated
a program to turn HTML and CSS into PDF; the one I use is called `weasyprint`

Weasyprint is also written in Python, which shouldn't matter too much – except that I have Debian installed and the weasyprint it comes with doesn't know how to hyphenate my text, which is bad news when you're writing a German text with long words. And what German text doesn't have long words? We love smashing words together!

This leads me to an immediate problem that LaTeX solves but that weasyprint does not: In German, you can't have ligatures connecting parts of a word that are themselves smashed-together words. For example, the word Auffahrt (up-drive, also known as Ascension Day) consists of the prefix "auf" and the word "fahrt" so you can't use the ﬀ ligature. There's a LaTeX package for that, selnolig.

selnolig

A while ago I wrote a Perl script that takes this file and does the right thing for HTML: it inserts ZERO WIDTH NON-JOINER characters in all those places. This Perl script is called keine-ligaturen, no ligatures.

keine-ligaturen

So I need that.

Now, the Python's Markdown module doesn't generate a stand-alone HTML file. I need to provide my own prefix and suffix.

`prefix` is where I define the language to use for hyphenation and the CSS file to use for the formatting.

<!doctype html>
<html lang=de>
  <head>
    <meta charset="utf-8"/>
    <link type="text/css" rel="stylesheet" href="Horte.css"/>
  </head>
  <body>

`suffix` is the file where I close the `html` and `body` tags I opened in the `prefix` file.

</body>
</html>

I tie all of this together in my `Makefile`. Here's how it might work.

SHELL=/bin/bash
KAPITEL=$(sort $(wildcard [0-9A-F]-*.md))

all: Horte.pdf

%.pdf: %.html %.css
	weasyprint {body}lt; $@

%.html: %.html.tmp prefix suffix append-index
	cat prefix {body}lt; suffix | perl append-index > $@

%.html.tmp: %.md
	python3 -m markdown \n
		--extension=markdown.extensions.attr_list \
		--extension=markdown.extensions.tables \
		--extension markdown.extensions.smarty \
		{body}lt; \
	| keine-ligaturen > $@

Horte.md: Titelblatt.md Lizenz.md $(KAPITEL)
	date '+<p class="timestamp">%F</p>' > timestamp
	cat Titelblatt.md timestamp $(KAPITEL) Lizenz.md > $@

clean:
	rm -f Horte.md Horte.html Horte.html.tmp \
	      Horte.pdf

So this is what happens:

all the Markdown files are concatenated into `Horte.md` (and a timestamp is added after the titlepage)
the Markdown file is turned into a temporary HTML file called `Horte.html.tmp` using the Python Markdown module and the German ligature breaks are added by the `keine-ligaturen` Perl script
the temporary HTML file is concatenated with the `prefix` and `sufix` files to form the real HTML file, called `Horte.html` and here yet another Perl script is used: `append-index` (more about that below)
finally, the HTML file is turned into PDF using `weasyprint` resulting in `Horte.pdf`

Now we're finally getting to the script I wanted to talk about this entire time. What's the role of `append-index`? It parses the HTML file and determines which terms should be in the index using XPath. And then it adds an index at the end.

There are two parts to the script. In the first part, the terms to index are collected.

the HMTL is parsed and an XPath expression is used to search for strings to index (in this case: H3 headings and bold text in blockquotes if they have the `ref` class)
if the term has no `id`, an `id` is computed based on the term to index; a number is appended if the `id` computed turns out to be a duplicate
every term is associated with a list of ids (in case a term appears multiple times)

In the second part, the HTML for the Index page is assembled. Every term is printed, followed by a link for every `id` recorded above. The text of the link is a ZERO WIDTH SPACE so that it doesn't look weird in the PDF. The actual page number to use is still unknown at this point because only the PDF generator knows about the pages! See below for more.

use Modern::Perl '2018';
use XML::LibXML;

undef $/;
my $doc = XML::LibXML->load_html(string => <STDIN>);

my @nodes = $doc->findnodes('//h3 | //blockquote/p/strong[@class="ref"]'),

my %terms;
my %n;
for my $node (@nodes) {
  my $content = $node->getAttribute('ref') || $node->textContent;
  next unless length($content) > 1;
  my $id = $node->getAttribute('id');
  if (not $id) {
    $id = lc($content);
    $id =~ tr/A-Za-z//cd;
    $n{$id}++;
    $id .= $n{$id} if $n{$id} > 1;
    $node->setAttribute('id', $id);
  }
  $terms{$content} //= [];
  push(@{$terms{$content}}, $id);
  $node->setAttribute('class', 'indexed');
}
my @body = $doc->findnodes('//body');
$body[0]->appendTextNode("\n");
my $div = XML::LibXML::Element->new('div');
$div->setAttribute('id', 'index');
$body[0]->appendChild($div);
$div->appendTextNode("\n");
my $h2 = XML::LibXML::Element->new('h2');
$h2->appendTextNode('Index');
$div->appendChild($h2);
$div->appendTextNode("\n");
for my $term (sort keys %terms) {
  my $p = XML::LibXML::Element->new('p');
  $div->appendChild($p);
  $p->appendTextNode($term);
  for my $id (@{$terms{$term}}) {
    my $an = XML::LibXML::Element->new('a');
    $an->setAttribute('class', 'ref');
    $an->setAttribute('href', '#' . $id);
    $an->appendTextNode(""); # zero-width space to prevent minimizing
    $p->appendChild($an);
  }
  $div->appendTextNode("\n");
}

print $doc;

Here is some example Markdown illustrating the functionality:

Elfen verwenden gerne grosse [Raubkatzen](#katze) zum Schutz ihrer Lager.

> **Pumas**{: .ref ref="Puma"} (1W6-3) TW 3 RK 14 1W6 RW +1 BW 24 ML 7
> EP 300

### Raubkatzen {: #katze}

> **Puma**{: .ref} TW 3 RK 14 1W6 RW +1 BW 24 ML 7 EP 300

The XPath expression find the following terms:

"Pumas" because of `class="ref"` and determines that the term to use is "Puma" (singular) because of `ref="Puma"`
"Raubkatzen" because it's a H3 heading
"Puma" because of `class="ref"` and this time the term is the text content

HTML generated for the Index:

<div id="index">
<h2>Index</h2>
…
<p>Puma<a class="ref" href="#puma"></a><a class="ref" href="#puma2"></a></p>
<p>Raubkatzen<a class="ref" href="#katze"></a></p>
…
</div>

The magic for page numbers is in the CSS, namely in the last rule where it says that the content of index links is a space and `target-counter(attr(href), page)`.

/* index */
#index {
    columns: 3;
    column-gap: 2ex;
    font-size: 11pt;
    line-height: 13pt;
    text-align: left;
}
#index h2 {
    column-span: all;
}
#index p {
    margin: 0;
    padding-left: 1em;
    text-indent: -1em;
}
#index a {
    color: inherit;
    text-decoration: none;
}
#index a::after {
    content: ' ' target-counter(attr(href), page);
}

See for yourself:

a three-column index page

#Markdown #Perl #Programming

*2024-09-03**. In case you are wondering why I don't use Org Mode, since I like Emacs so much: Markup is useful outside of Emacs. Back when Org Mode started, I gave it a try but Org Mode → PDF relies on LaTeX templates which eventually required me to know both Org Mode and LaTeX so there was no benefit at all. Markdown on its own, however, is useful on sites like GitHub and other software forges and it is a common format used by Markdown editors elsewhere.

On a wiki, for example, that means Markdown is understood by many people. Org Mode is limited to Emacs users, most of the time.