2024-11-01 Gzip them all!

I keep an archive of the One Page Dungeon Contest. It's pretty big.

an archive

One Page Dungeon Contest

+------+------+
| Size | Year |
+------+------+
| 48M  | 2009 |
| 193M | 2010 |
| 121M | 2011 |
| 199M | 2012 |
| 149M | 2013 |
| 278M | 2014 |
| 364M | 2015 |
| 233M | 2016 |
| 250M | 2017 |
| 493M | 2018 |
| 345M | 2019 |
| 472M | 2020 |
| 217M | 2021 |
| 353M | 2022 |
| 492M | 2023 |
| 486M | 2024 |
+------+------+

So I decided I wanted to gzip the PDF files ("pre-compress") them. I found the answer I was looking for in Serving pre-compressed files using Apache by FranΓ§ois Marier. Sometimes searching for stuff is hard just because you don't know what it's called. πŸ˜…

Serving pre-compressed files using Apache

AddEncoding gzip gz
Options +Multiviews
SetEnv force-no-vary
Header set Cache-Control "private"
<FilesMatch "\.pdf\.gz$">
ForceType application/pdf
</FilesMatch>

OK, time to gzip them all!

for d in 2*; cd /home/alex/campaignwiki.org/1pdc/$d; echo $d; gzip *.pdf; end

Aaaaand … the gains are abysmal! πŸ˜“

+------+------+
| Size | Year |
+------+------+
| 46M  | 2009 |
| 173M | 2010 |
| 110M | 2011 |
| 190M | 2012 |
| 126M | 2013 |
| 261M | 2014 |
| 351M | 2015 |
| 226M | 2016 |
| 225M | 2017 |
| 471M | 2018 |
| 325M | 2019 |
| 448M | 2020 |
| 206M | 2021 |
| 339M | 2022 |
| 472M | 2023 |
| 471M | 2024 |
+------+------+

The PDFs really are that big! 🀨

Somebody should put a size limit on submissions!

The whole collection is still 4.4G. 😞

​#RPG ​#1PDC ​#Administration

I started reading Optimizing PDFs on the Ghostscript blog and my head started smoking.

Optimizing PDFs

I ended up writing the following:

markdown-links

zip-original

shrink-pdfs

pdf-shrink

zip-dir

upload-dir

To this, @mxp@mastodon.acm.org replied:

My invocation is less elaborate (w/o the threshold, filter settings, etc.), but similar in that I also downsample images to 150 dpi. In addition, I have `-dSubsetFonts=true -dCompressFonts=true`, but since I use this for my own LaTeX-generated documents, I guess I could drop this.

I didn't look into fonts because I don't mind people using weird fonts; for the moment images are a bigger problem than fonts.

Then I went through my local directories and called `pdf-shrink` on them all, regenerated the zip file containing the year's entries and gzipped the individual files.

As I was going through the files for 2024 I noticed that sometimes the filenames betray different names (from email senders, I presume), leaking privacy related information. I wanted to make sure that the filenames reflected the authors of the works and that made me realize two things:

1. not every entry has the license URL clearly visible

2. not every entry has the author names clearly visible

Then again, anonymous works are OK, but it would have saved me some time if it said "anonymous" somewhere. 😏

In any case, if you publish PDF files somewhere, here's what I'm planning to do from here on out:

1. add copyright information and the license (if any), i.e. a date or at least a year and the names of all the people that have rights to the work (text, art, layout, editing, maps, and so on)

2. add the work's title and the names of the all the people to the PDF's metadata ("properties")

3. put the work's title, data or version and the name of the main author (your name?) into the filename