💾 Archived View for kennedy.gemi.dev › docs › crawling.gmi captured on 2023-03-20 at 17:38:25. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

🔭 Notes on Crawling and Indexing

Home

Kennedy creates its search index by crawling content only within Geminispace. It will not crawl or index other content with other protocols like Gopher or HTTP.

Crawler details

Kennedy crawls Geminispace using the following IP addresses:

IPv4: 64.149.155.184
IPv6: 2600:1700:1731:d0f:8ce4:29f1:a378:af4c

Crawler speed

Kennedy throttles itself and waits 1.5 seconds between making requests to the same IP address. This increases the amount of time it takes to crawl multiple capsules hosted from the same IP address, such as Flounder.online.

Robots.txt Support

Kennedy will respect sites that are using the simplified robots.txt protocol defined for Gemini.

Robots.txt subset for Gemini

Specifically, Kennedy will follow the Deny rules defined for the follow user-agents:

*
indexer

Note: There are a number of robots.txt files in Geminispace which use rules outside of the simplified standard above. These include:

Allow Rules
Deny Rules with wildcard characters in the middle
Crawl-Delay directives

Kennedy does not currently respect these rules.

Crawler Limits

Kennedy has the following limits:

Will not download responses larger than 10 MB.
Will not fully download non-text resources such as images. You may notice Kennedy closing its connection to a capsule once it has received the MIME type for the response.
Closes a connection if a URL takes more than 45 seconds to fully respond.