💾 Archived View for gemini.bortzmeyer.org › software › lupa › index.gmi captured on 2023-04-26 at 16:43:59. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
Lupa is a Gemini crawler. Starting from a few
given URLs, it retrieves them, analyzes the links in gemtext (Gemini
format) files and adds them to the database of URLs to crawl. It is
not a search engine, it does not store the content of the resources,
just the metadata, for research and statistics.
The instance of the crawler that I manage currently operates from `2001:41d0:302:2200::180` (and `193.70.85.11` on the old networks).
Logbook of the production crawler
If you want the list of capsules known to Lupa:
If you notice a missing capsule, write me (address at the end of this
page).
If you want the entire content of the database, you'll have to write
me (address at the end of this page) and explain why. I tend to be
liberal with such requests since, after all, it is public data and
anyone could gather it.
Lupa is written in:
No real installation procedure, you have to get the sources, put them
where you want and setup PYTHONPATH and PATH. Pre-requisites (all of
them on PyPi): psycopg2, pyopenssl, scfg, public_suffix_list and
agunua.
(On a Debian machine, the packaged prerequitises are packages python3-pip,
python3-psycopg2 and python3-openssl, agunua, public_suffix_list and
scfg have to be installed with pip or manually.)
Usage requires a PostgreSQL database,
to store the URLs and the result of crawling. Once you've created the
database, prepare it with the `create.sql` file:
createdb lupa psql -f ./admin-scripts/create.sql lupa export PYTHONPATH=$(pwd) ./admin-scripts/lupa-insert-url gemini://start.url.example/ ./admin-scripts/lupa-insert-url gemini://second-start.url.example/
At the present time, you need a separate script to retrieve robots.txt
exclusion files. It is *not* done by the crawler. This script must be
run from time to time, for instance from cron, every two hours:
./admin-scripts/lupa-add-robotstxt
You run the crawler with `./lupa-crawler`. The crawler does not run
forever, you need to start it from cron. Locking is done by the
database, so it is not an issue if two instances run at the same time.
You can have a list of options with `--help` but, at this time, you
need to read the source to understand them. Some interesting options:
to allow testing, so you may want to set it to a more reasonable
value such as 1000.
at random. You typically set it to the size of the database, but it
can be smaller.
can slow it down with this parameter. Between two URLs, the crawler
will sleep at a time randomly choosen between 0 and this number of
seconds.
retrieved or retrieved more than this number of days. Default is 14 days.
it is not blocked forever if there is a blocking operation. It is
one hour by default.
Also, you can use a configuration file, using the scfg syntax. An
example is in the sources, `sample-lupa.conf`.
A log file is created in `/var/tmp/Lupa.log`. It is up to you to
ensure it is replaced from time to time.
Lupa means she-wolf in latin. It refers to the wolf who took care of
the twins Romulus and Remus. (Many Gemini programs have names related
to twins, gemini in latin.)
Stéphane Bortzmeyer stephane+gemini@bortzmeyer.org