đŸ Archived View for 7irb.tk âș misc âș 211025.gmi captured on 2021-12-17 at 13:26:06. Gemini links have been rewritten to link to archived content
âŹ ïž Previous capture (2021-11-30)
-=-=-=-=-=-=-
There is a great place on the Internet called "the Web Archive". It is known to many as a place to view old web pages, but it also hosts many other forms of digital collection, such as digitized books.
It's all very wonderful especially for people like me who don't have much money to buy books. When I have a book in mind that I may want to read, one of the first things that I do is to try my luck on the Web Archive. Chances are that the book will be there.
However, recently I noticed that many books available on the Web Archive can only be "borrowed" for 1 hour, which is totally ridiculous.
Simply put, the concept of "borrowing" is not compatible with this digital world. It is invented by the Web Archive for the sole purpose of imposing restrictions on its users. In the age of physical books, the concept of borrowing goes without saying. I borrow a book from a library. As a result, the library doesn't keep that book, and others can no longer borrow it. But this is no longer the case with digitized books, as those can be distributed to any number of people simultaneously at virtually no cost. The thing is not "borrowed", it is simply copied. It is already absurd to call this borrowing, and certainly more laughable to set a time limit of 1 hour. What's the point? The user has unlimited renewal, so even the copyright argument won't stand.
Although I don't understand the logic behind all of this, I know the trend all too well. The open and shared Internet is no longer. With the rise of tech giants, the Internet is becoming more closed by the day, and this is only one of its innumerable symptoms. What can we do?
I had to come up with a way to save the book within that 1 hour period so that I can import it to my tablet and read whenever I want. Here I outline my attempt using Python and Applescript (since I am, somewhat ironically, on Mac OS). The basic idea is to use `selenium`, a web browser automation package in python, to control a firefox browser and make it download the book page by page for us, with some additional help from Applescript.
In the following I will explain my code in 6 steps. Applescript only comes in step 6 to "click the download button", which might be implemented differently in different systems. Steps 1 to 5 are system independent. The disadvantage of this method is that you cannot use your computer for other things when the download is in progress, as the browser should always stay upfront.
=== FIRST STEP: setting up the browser ===
Use `webdriver_manager` to help us install the web driver if missing. A web driver can be thought of as an interface through which we control the web browser using code. The browser we are controling is Firefox, which needs to be pre-installed as a regular app on the system.
from webdriver_manager.firefox import GeckoDriverManager
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType
myProxy = 'xxx.xxx.xxx.xxx:xxxx'
proxy = Proxy({
'proxyType': ProxyType.MANUAL,
'httpProxy': myProxy,
'ftpProxy': myProxy,
'sslProxy': myProxy,
'noProxy': ''
})
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), proxy=proxy)
Note that I am using a proxy. Simply delete anything related to proxy if direct connection is desired.
=== SECOND STEP: login ===
import time
url1 = 'https://archive.org/account/login'
username = 'username@mail.com'
password = 'passwordInPlainText'
driver.get(url1)
inputElement = driver.find_element_by_name('username')
inputElement.send_keys(username)
inputElement = driver.find_element_by_name('password')
inputElement.send_keys(password)
submitButton = driver.find_element_by_name('submit-to-login')
submitButton.click()
time.sleep(5)
The code is self-explanatory. It does require the password in plain text but hey, it's just a Web Archive account, what is there to lose? (Always use different passwords for different accounts.) The next several lines of code fills in that login credential automatically and clicks the login button for you. It waits for 5 seconds for the login to complete.
=== THIRD STEP: find the damn book ===
import selenium.webdriver.support.ui as ui
book_url = 'https://archive.org/path/to/the/book'
driver.get(book_url)
wait = ui.WebDriverWait(driver, 10)
borrowButton = wait.until(lambda driver: driver.find_element_by_class_name('lending-primary-action'))
borrowButton.click()
We go to the book's url and wait for the webpage to load. The `wait = ui.WebDriverWait(...)` object tells the driver to allow for a waiting period of 10 seconds. The `wait.until` expression returns the result of the lambda expression unless such element is not found within the waiting period, in which case it throws an error. In the lambda expression we look for the big blue "borrow for 1 hour" button, and click it afterwards.
=== FOURTH STEP: prepare for download ===
wait = ui.WebDriverWait(driver, 10)
onepageButton = wait.until(lambda driver: driver.find_element_by_class_name('onepg'))
onepageButton.click()
time.sleep(3)
wait = ui.WebDriverWait(driver, 10)
magnifyButton = wait.until(lambda driver: driver.find_element_by_class_name('zoom_in'))
for ii in range(5):
magnifyButton.click()
To prepare for download, we set the display mode to "one page" and then magnify the page until it is large enough so that we can get high quality images.
=== FIFTH STEP: download the page images in jpg format ===
The following code only downloads one page. I omitted the outer loop that allows it to download the whole book which is not hard to implement.
pageImages = driver.find_elements_by_class_name('BRpageimage')
for pageImage in pageImages:
pageImageURL = pageImage.get_attribute('src')
if pageImageURL in pageImageURLs:
continue
else:
pageImageURLs.append(pageImageURL)
driver.execute_script('window.open("' + pageImageURL + '")')
wait = ui.WebDriverWait(driver, 10)
driver.switch_to.window(driver.window_handles[1])
time.sleep(0.5)
wait.until(lambda driver: driver.find_element_by_tag_name('img'))
## DOWNLOAD THE IMAGE ##
driver.close()
driver.switch_to.window(driver.window_handles[0])
break
nextPageButton = driver.find_element_by_class_name('book_flip_next')
nextPageButton.click()
Note that even if there is only one page in display, there can be multiple objects of the class 'BRpageimage' in the DOM. Among these, we want to get the first one that we have not yet downloaded. So we keep a list of downloaded URLs in the `pageImageURLs` list. We go to the next page by clicking the `nextPageButton` every time after we have downloaded one image.
The whole thing inside the `else` statement does the following: (1) open a new tab and load the image by executing a javascript code. (2) switch to that window and look for the image element. (3) download the image under the help of Applescript which I will explain later. (4) close the current tab. (5) switch back to previous tab.
=== SIXTH STEP: the Applescript ===
To download the image we simply press "command + S" and click "save". These two actions are automated with the help of Applescript.
Python code for ## DOWNLOAD THE IMAGE ## part:
cmd = """osascript ./scpt1.scpt"""
os.system(cmd)
time.sleep(0.5)
cmd = """osascript ./scpt2.scpt"""
os.system(cmd)
time.sleep(0.5)
Here two Applescripts are invoked: `scpt1.scpt` and `scpt2.scpt`. Here are their contents:
Contents of scpt1.scpt:
tell application "System Events"
tell process "Firefox" to keystroke "s" using command down
end tell
This is a simple script to simulate a "command + S" key combination, which will fire up the "Save As" window.
Contents of scpt2.scpt:
tell application "System Events"
repeat while exists (window 2 of process "Firefox")
click at {xxx, yyy}
delay 1
end repeat
end tell
This is somewhat more complicated. It tests if Firefox has two windows (the second window being the "Save As" window). As long as the second window is there, it clicks the save button repeatedly. {xxx, yyy} should be replaced by the on-screen coordinate of the save button, which can be measured with the aid of the screenshot tool (evoked by "command + shift + 4").
Keep in mind that these .scpt files should only be created and saved using the "script editor" app that comes with the Mac OS system, as they are not stored in plain text format. I have added `delay` and `sleep` everywhere in this code to limit the speed of the process, letting it run in a more orderly manner.
The files will be downloaded to the default download directory, which can then be compressed and imported into an ebook reader. Fortunately the filenames of the saved image files are already in good order.