💾 Archived View for radia.bortzmeyer.org › software › lupa › logbook.gmi captured on 2023-12-28 at 16:01:24. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-05-24)

-=-=-=-=-=-=-

Logbook of the running Lupa crawler at gemini.bortzmeyer.org

12 january 2022

Completely new handling of exclusion files "robots.txt". We now use the code in the Python standard library instead of a custom code. It should work better with some complicated robots.txt files (those using both directives Allow and Disallow:, for instance).

The issue

The Python package

There is currently no proper standard for robots.txt. The Internet-Draft is still under evaluation.

State of the Internet-Draft

Note that many robots.txt files in the wild are wrong (for instance, having several user agents on one line) so will be ignored.

Example of a broken file

14 december 2021

One year of Lupa! We now have 334,000 working URLs, 1,500 working

capsules (in 1,000 registered domains), using 1,000 different IP addresses.

28 november 2021

We no longer record and display the fact that there was no proper TLS shutdown (close_notify). This is because it does not seem that Agunua returned reliable information.

The Agunua issue

10 october 2021

We now have more than one thousand (1,000) registered domains (the capsules foo.flounder.online and bar.flounder.online are in the same registered domain, so it is two capsules but one registered domain).

19 may 2021

We now have more than one thousand (1,000) working capsules.

(This is partly because we now keep the capsules whoses robots.txt prevented any crawling; before that, they were regarded as non-working.)

The bug report

8 may 2021

List of known capsules are now published

As a text file

As a gemtext, with links

31 march 2021

URLs whose status code is 31 ("Permanent redirect") are now purged.

The issue

29 march 2021

Lupa now displays separately the language statistics for the language only and for the full language tag.

Remember: tag wisely

26 march 2021

Lupa now connect to .onion capsules (capsules reachable only through the Tor network). Currently, there are only two.

The Tor project

This capsule, on .onion, to see if your Gemini browser can do it

How to set up a .onion capsule

24 march 2021

The number of URLs decreased because Lupa automatically deleted URLs that returned an error for too long. Remember that the "geminispace" is small so just one big capsule changing its content/policies can seriously impact the figures.

14 march 2021

We now have 800 working capsules. And 180,000 working URLs although I

believe this number is less important (any capsule can generate a lot

of dynamic URLs).

10 march 2021

We now display the TLS versions used by capsules. (A majority uses TLS 1.3.) We also display the percentage of capsules that use an expired certificate (more than 2 %). And we also report the URL without a proper TLS shutdown.

9 march 2021

We now display the maximum and average number of links pointing to URLs in our database. We do not display a list of URLs with most links towards them, to avoid popularity contests.

The issue

12 february 2021

We now display TLD (Top-Level Domains) also per number of registered domains, not just per number of capsules. We use Mozila's Public Suffix List (not perfect but there is no better resource).

The Public Suffix List

26 january 2021

We start to purge old and stale data from the database. Therefore, several numbers will decrease.

The original issue

20 january 2021

A bug in the counting of Let's Encrypt certificates have been fixed. Therefore, the percentage of Let's Encrypt will increase.

The patch

19 january 2021

The statistics page is now much more strict with the freshness of the data. We ignore, for instance, capsules that were not contacted recently (currently 31 days). As a result, several numbers decreased.

The stats

The ticket #12

4 january 2021

A bug prevented robots.txt to be retrieved from capsules with an invalid certificate. Now that it is fixed, it will probably lead to a decrease in the number of retrieved URLs.

The bug

21 december 2020

The crawler now uses the Agunua library instead of its own internal Gemini library.

Agunua

16 december 2020

The database now contains 31 145 URIs (16 273 successfully retrieved) and 484 capsules (270 successfully contacted).

16 december 2020

Stupid bug when updating the state of the capsules after a successful connect.

The bug

14 december 2020

The crawler entered in production state.

All about the crawler