💾 Archived View for idiomdrottning.org › comic-snarfer captured on 2024-09-29 at 02:13:03. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-03)

-=-=-=-=-=-=-

Comic Snarfer

comic-snarfer --start-page=[URL] --image-path=[XPATH] --next-path=[XPATH] --for-real

So I’m trying to release more of the stuff I write even if it’s somewhat, uh, “bespoke” stuff. (“Bespoke” is best backhanded compliment!)

This is a snarfer that trawls through a series of web pages. It saves images from them and then finds the link to the next page and recurses from there. It rips web comics, pretty much. It could also snarf other media (including just normal html pages) because it uses xpath to dispatch, not file endings.

It assumes you’re making an implicit “dry run” until you supply the argument --for-real. My advice is to hold off on that until the output for the first page looks right to you.

It’ll download directly to your current working directory, so make sure you are in a good clean empty place that you can fill with images.

I usally snarf to a directory, back it up, clean the names up with perl’s rename script, do zip ../some-name.cbz *, and remove the image directory and its backup. The backup step is only because I have mabla up the perl expression too many times…

mcomix is the reader I like. With it, the renaming and zipping is optional, it can handle directories.

There are three required options.

--start-page=URL : Just a plain URL to whatever the page you want to start at.

--image-path=XPATH : An xpath pointing to the main snarfable content of the page. If this matches multiple things (multiple images for example), they will all be saved.

--next-path=XPATH : An xpath pointing to the next page. If this matches multiple things, the snarfer follows the first one. If it doesn’t match anything, the snarfer terminates.

There are two non-required options.

--start-issue=NUMBER : The files are renamed to include their domain and their paths, because some webcomic sites just call it “comic.jpg” and depend on the paths to disambiguate. The snarfer also prefixes them with a number; this number is internal to the snarfer and just increments for everything it saves. It starts at the --start-issue number, but defaults at zero which is what you want most of the time. The point of this option is in case the snarfer crashed or was terminated and you want to resume with the same numbering.

--for-real : The snarfer assumes a dry-run unless you supply this. I.e. without this flag, it only shows you what it would have saved from the first page, and it doesn’t follow any links.

Source code

git clone https://idiomdrottning.org/comic-snarfer