💾 Archived View for ax.flounder.online › tech › scraping.gmi captured on 2024-08-18 at 17:16:14. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
Video downloader which supports more than just youtube
scoop update
scoop install yt-dlp
Update at any time with
scoop update *
Now type:
yt-dlp <link to your video or profile>
youtube-dlp will detect existing files.
Filenames include their URL by default
gallery-dl is the image equivalent of youtube-dl
scoop update
scoop install gallery-dl
Update at any time with
scoop update *
Cmder recommended (better interface for cmd/powershell)
https://scoop.sh/#/apps?q=cmder&s=0&d=1&o=true
Create a config.json file in your appdata folder
C:\Users\yourname\AppData\Roaming\gallery-dl\config.json
MUST BE config.json ! DO NOT CONFUSE IT WITH gallery-dl.conf !
Barebones example config (for experienced JSON users)
Full completed config reference (Do not use)
For windows, replace directories \ with \\ OR /
Example:
"base-directory": "E:/Home/Pictures/gallery-dl/", "cookies": "E:/Home/Pictures/gallery-dl/cookies.txt", "archive": "C:/Users/USERNAME/AppData/Roaming/gallery-dl/{category}.sqlite3",
Get a major speed boost and future-proof your collection by writing saved files to a database in your config folder.
https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractor-archive
Example:
"archive": "C:/Users/USERNAME/AppData/Roaming/gallery-dl/{category}.sqlite3",
This will create a file for each site such as pixiv.sqlite3, twitter.sqlite3 ect.
It's not necessary, but sqlite files can be opened with sqlitebrowser (available on scoop)
Make sure you can see the source code and trust these Chrome extensions or disable them immediately after use.
Use cookies or the username/password in the config file
gallery-dl oauth:pixiv
Now follow the instructions in the terminal
gallery-dl oauth:deviantart
Now when you use the command:
gallery-dl.exe <URL of user profile, gallery, folder>
it will download into your base-directory/<website>/<username>/, no matter which directory your terminal is in.
Create a .sh file (for example gallery.sh)
Edit it as if it were a text file and start the file with "gallery-dl" followed by the URLs you would like to fetch. Separate them with spaces. Example:
gallery-dl.exe https://twitter.com/username https://www.deviantart.com/username https://www.pixiv.net/en/users/221515
Now in the terminal you can type
bash.exe E:\Home\Pictures\gallery-dl\gallery.sh (your own location)
and it will download every URL in the file, skipping existing files if they are in the database file. Some text editors (eg. Notepad++) can show duplicate entries when highlighting text, this is useful for checking if you've already added someones profile to the script.
You can make a shortcut to download your .sh file with
alias dlg=bash.exe E:\Home\Pictures\gallery-dl\gallery.sh
Now, just type "dlg" in the terminal to start mass downloading.
Tip: you can search for "." in all subfolders to view all images at once, while keeping everything tidy in folders.
Files from twitter are named as tweet ID's
Built in browser save as will save a folder. You can use an addon or wget to save to a single .html file.
Chromium will not support this extension in the future due to internal addon changes, so stick with Firefox.
SingleFile by default embeds a timestamp onto the saved page. It can be turned off in it's settings.
wget example.com
wget will not give the downloaded file a filetype. You can open it in your browser or add .html to the name.
scoop update
scoop install wget
Update at any time with
scoop update *
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com