💾 Archived View for ax.flounder.online › tech › scraping.gmi captured on 2023-11-14 at 07:39:48. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Scraping and archiving the web

Video - youtube-dlp

Video downloader which supports more than just youtube

Supported sites

Windows

Scoop highly recommended

scoop update

scoop install yt-dlp

Update at any time with

scoop update *

Now type:

yt-dlp <link to your video or profile>

youtube-dlp will detect existing files.

Filenames

Filenames include their URL by default

Images - gallery-dl Windows instructions

gallery-dl is the image equivalent of youtube-dl

Supported sites

Installation

Scoop highly recommended

scoop update

scoop install gallery-dl

Update at any time with

scoop update *

Cmder recommended (better interface for cmd/powershell)

https://scoop.sh/#/apps?q=cmder&s=0&d=1&o=true

Creating the config file

Create a config.json file in your appdata folder

C:\Users\yourname\AppData\Roaming\gallery-dl\config.json

MUST BE config.json ! DO NOT CONFUSE IT WITH gallery-dl.conf !

Full template config

Barebones example config (for experienced JSON users)

Full completed config reference (Do not use)

For windows, replace directories \ with \\ OR /

Example:

"base-directory": "E:/Home/Pictures/gallery-dl/",
"cookies": "E:/Home/Pictures/gallery-dl/cookies.txt",
"archive": "C:/Users/USERNAME/AppData/Roaming/gallery-dl/{category}.sqlite3",

Create a database of downloaded files

Get a major speed boost and future-proof your collection by writing saved files to a database in your config folder.

https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractor-archive

Example:

"archive": "C:/Users/USERNAME/AppData/Roaming/gallery-dl/{category}.sqlite3",

This will create a file for each site such as pixiv.sqlite3, twitter.sqlite3 ect.

It's not necessary, but sqlite files can be opened with sqlitebrowser (available on scoop)

sqlitebrowser

Getting your cookies.txt

Firefox

Chromium

Make sure you can see the source code and trust these Chrome extensions or disable them immediately after use.

Recommendations

Make sure fetching retweets is off
Consider fetching from replies if there is a chance for there to be anything of value other than "funny giphy reaction image"

Sites

Twitter

Use cookies or the username/password in the config file

Pixiv

gallery-dl oauth:pixiv

Now follow the instructions in the terminal

Deviantart

https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractordeviantartclient-id--client-secret

gallery-dl oauth:deviantart

Starting the download

Now when you use the command:

gallery-dl.exe <URL of user profile, gallery, folder>

it will download into your base-directory/<website>/<username>/, no matter which directory your terminal is in.

Mass download

Create a .sh file (for example gallery.sh)

Edit it as if it were a text file and start the file with "gallery-dl" followed by the URLs you would like to fetch. Separate them with spaces. Example:

gallery-dl.exe https://twitter.com/username https://www.deviantart.com/username https://www.pixiv.net/en/users/221515

Now in the terminal you can type

bash.exe E:\Home\Pictures\gallery-dl\gallery.sh (your own location)

and it will download every URL in the file, skipping existing files if they are in the database file. Some text editors (eg. Notepad++) can show duplicate entries when highlighting text, this is useful for checking if you've already added someones profile to the script.

Mass download alias

You can make a shortcut to download your .sh file with

alias dlg=bash.exe E:\Home\Pictures\gallery-dl\gallery.sh

Now, just type "dlg" in the terminal to start mass downloading.

Additional tips

Viewing all images at once

Tip: you can search for "." in all subfolders to view all images at once, while keeping everything tidy in folders.

Filenames

Files from twitter are named as tweet ID's

Websites

Saving a web page as a single file

Built in browser save as will save a folder. You can use an addon or wget to save to a single .html file.

Extenstion

Chromium will not support this extension in the future due to internal addon changes, so stick with Firefox.

SingleFile by gildas

SingleFile by default embeds a timestamp onto the saved page. It can be turned off in it's settings.

wget

wget example.com

wget will not give the downloaded file a filetype. You can open it in your browser or add .html to the name.

Saving an entire website

Windows installation

Scoop highly recommended

scoop update

scoop install wget

Update at any time with

scoop update *

wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com