I mean, it's great if you want to write a bunch of Perl scripts. But
really⦠do you? Or indeed: Would you rather like to learn Perl or
LaTeX? π₯ΆΒ βΒ π€·ββοΈΒ βΒ π₯΅
I like my Markdown β HTML + CSS β PDF pipeline. For new PDFs of mine, I try to use this setup rather than relying on LaTeX. It's true, LaTeX documents probably look better in the end. But I don't write enough LaTeX at the end of the day. Everything is always tricky to find. Packages are hard to pick. I always end up on some StackExchange site and nothing is ever simple.
Now, the Markdown β HTML + CSS β PDF pipeline isn't simple, either. But it uses HTML and CSS and I use those two more often. I can look at the temporary HTML file using my browser. When I have questions, I end up on the Mozilla Developer Network (MDN) and it's not too bad. It's the kind of bad that I'm used to.
I'm not sure I'm doing a great job selling this. Remember how many years ago I tried to be objective about it all and concluded that using Libre Office would be the most efficient tool. You can go back and read the blog post from 2010. But I guess I got burned back in the last millenium when Word 5.1 was new and liked to crash, and Open Office was not great either, and Abi Word was too limited.
I learned to love Emacs and LaTeX and I don't want to go back to those graphical user interfaces. Somehow they make it hard to use styles correctly and consistenly. So text-based it is!
What follows is a short summary of how the Markdown β HTML + CSS β PDF pipeline works.
There are a number of things you need:
Weasyprint is also written in Python, which shouldn't matter too much β except that I have Debian installed and the weasyprint it comes with doesn't know how to hyphenate my text, which is bad news when you're writing a German text with long words. And what German text doesn't have long words? We love smashing words together!
This leads me to an immediate problem that LaTeX solves but that weasyprint does not: In German, you can't have ligatures connecting parts of a word that are themselves smashed-together words. For example, the word Auffahrt (up-drive, also known as Ascension Day) consists of the prefix "auf" and the word "fahrt" so you can't use the ο¬ ligature. There's a LaTeX package for that, selnolig.
A while ago I wrote a Perl script that takes this file and does the right thing for HTML: it inserts ZERO WIDTH NON-JOINER characters in all those places. This Perl script is called keine-ligaturen, no ligatures.
So I need that.
Now, the Python's Markdown module doesn't generate a stand-alone HTML file. I need to provide my own prefix and suffix.
`prefix` is where I define the language to use for hyphenation and the CSS file to use for the formatting.
<!doctype html> <html lang=de> <head> <meta charset="utf-8"/> <link type="text/css" rel="stylesheet" href="Horte.css"/> </head> <body>
`suffix` is the file where I close the `html` and `body` tags I opened in the `prefix` file.
</body> </html>
I tie all of this together in my `Makefile`. Here's how it might work.
SHELL=/bin/bash KAPITEL=$(sort $(wildcard [0-9A-F]-*.md)) all: Horte.pdf %.pdf: %.html %.css weasyprint {body}lt; $@ %.html: %.html.tmp prefix suffix append-index cat prefix {body}lt; suffix | perl append-index > $@ %.html.tmp: %.md python3 -m markdown \n --extension=markdown.extensions.attr_list \ --extension=markdown.extensions.tables \ --extension markdown.extensions.smarty \ {body}lt; \ | keine-ligaturen > $@ Horte.md: Titelblatt.md Lizenz.md $(KAPITEL) date '+<p class="timestamp">%F</p>' > timestamp cat Titelblatt.md timestamp $(KAPITEL) Lizenz.md > $@ clean: rm -f Horte.md Horte.html Horte.html.tmp \ Horte.pdf
So this is what happens:
Now we're finally getting to the script I wanted to talk about this entire time. What's the role of `append-index`? It parses the HTML file and determines which terms should be in the index using XPath. And then it adds an index at the end.
There are two parts to the script. In the first part, the terms to index are collected.
In the second part, the HTML for the Index page is assembled. Every term is printed, followed by a link for every `id` recorded above. The text of the link is a ZERO WIDTH SPACE so that it doesn't look weird in the PDF. The actual page number to use is still unknown at this point because only the PDF generator knows about the pages! See below for more.
use Modern::Perl '2018'; use XML::LibXML; undef $/; my $doc = XML::LibXML->load_html(string => <STDIN>); my @nodes = $doc->findnodes('//h3 | //blockquote/p/strong[@class="ref"]'), my %terms; my %n; for my $node (@nodes) { my $content = $node->getAttribute('ref') || $node->textContent; next unless length($content) > 1; my $id = $node->getAttribute('id'); if (not $id) { $id = lc($content); $id =~ tr/A-Za-z//cd; $n{$id}++; $id .= $n{$id} if $n{$id} > 1; $node->setAttribute('id', $id); } $terms{$content} //= []; push(@{$terms{$content}}, $id); $node->setAttribute('class', 'indexed'); } my @body = $doc->findnodes('//body'); $body[0]->appendTextNode("\n"); my $div = XML::LibXML::Element->new('div'); $div->setAttribute('id', 'index'); $body[0]->appendChild($div); $div->appendTextNode("\n"); my $h2 = XML::LibXML::Element->new('h2'); $h2->appendTextNode('Index'); $div->appendChild($h2); $div->appendTextNode("\n"); for my $term (sort keys %terms) { my $p = XML::LibXML::Element->new('p'); $div->appendChild($p); $p->appendTextNode($term); for my $id (@{$terms{$term}}) { my $an = XML::LibXML::Element->new('a'); $an->setAttribute('class', 'ref'); $an->setAttribute('href', '#' . $id); $an->appendTextNode("β"); # zero-width space to prevent minimizing $p->appendChild($an); } $div->appendTextNode("\n"); } print $doc;
Here is some example Markdown illustrating the functionality:
Elfen verwenden gerne grosse [Raubkatzen](#katze) zum Schutz ihrer Lager. > **Pumas**{: .ref ref="Puma"} (1W6-3) TW 3 RK 14 1W6 RWΒ +1 BW 24 ML 7 > EP 300 ### Raubkatzen {: #katze} > **Puma**{: .ref} TW 3 RK 14 1W6 RWΒ +1 BW 24 ML 7 EP 300
The XPath expression find the following terms:
HTML generated for the Index:
<div id="index"> <h2>Index</h2> β¦ <p>Puma<a class="ref" href="#puma">β</a><a class="ref" href="#puma2">β</a></p> <p>Raubkatzen<a class="ref" href="#katze">β</a></p> β¦ </div>
The magic for page numbers is in the CSS, namely in the last rule where it says that the content of index links is a space and `target-counter(attr(href), page)`.
/* index */ #index { columns: 3; column-gap: 2ex; font-size: 11pt; line-height: 13pt; text-align: left; } #index h2 { column-span: all; } #index p { margin: 0; padding-left: 1em; text-indent: -1em; } #index a { color: inherit; text-decoration: none; } #index a::after { content: ' ' target-counter(attr(href), page); }
See for yourself:
β#Markdown β#Perl β#Programming
On a wiki, for example, that means Markdown is understood by many people. Org Mode is limited to Emacs users, most of the time.