💾 Archived View for her.st › holy-texts › bots.gmi captured on 2023-07-22 at 16:33:40. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

archived 2022-09-06

robots.txt for Gemini

Introduction

This document describes an adaptation of the web's de-facto standard robots.txt mechanism for controlling access to Gemini resources by automated clients (hereafter "bots").

Gemini server admins may use robots.txt to convey their desired bot policy in a machine-readable format.

Authors of automated Gemini clients (e.g. search engine crawlers, web proxies, etc.) are strongly encouraged to check for such policies and to comply with them when found.

Server admins should understand that it is impossible to *enforce* a robots.txt policy and must be prepared to use e.g. firewall rules to block access by misbehaving bots. This is equally true of Gemini and the web.

Basics

Gemini server admins may serve a robot policy for their server at the URL with path /robots.txt, i.e. the server example.net should serve its policy at gemini://example.net/robots.txt.

The robots.txt file should be served with a MIME media type of text/plain.

The format of the robots.txt file is as per the original robots.txt specification for the web, i.e.:

Lines beginning with # are comments
Lines beginning with "User-agent:" indicate a user agent to which subsequent lines apply
Lines beginning with "Disallow:" indicate URL path prefixes which bots should not request
All other lines are ignored

The only non-trivial difference between robots.txt on the web and on Gemini is that, because Gemini admins cannot easily learn which bots are accessing their site and why (because Gemini clients do not send a user agent), Gemini bots are encouraged to obey directives for "virtual user agents" according to their purpose/function. These are described below.

Despite this difference, Gemini bots should still respect robots.txt directives aimed at a User-agent of *, and may also respect directives aimed at their own individual User-agent which they, e.g., prominently advertise at the Gemini page of any public services they provide.

Virtual user agents

Below are definitions of various "virtual user agents", each of which corresponds to a common category of bot. Gemini bots should respect directives aimed at any virtual user agent which matches their activity. Obviously, it is impossible to come up with perfect definitions for these user agents which allow unambiguous categorisation of bots. Bot authors are encouraged to err on the side of caution and attempt to follow the "spirit" of this system, rather than the "letter". If a bot meets the definition of multiple virtual user agents and is not able to adapt its behaviour in a fine grained manner, it should obey the most restrictive set of directives arising from the combination of all applicable virtual user agents.

Archiving crawlers

Gemini bots which fetch content in order to build public long-term archives of Geminispace, which will serve old Gemini content even after the original has changed or disappeared (analogous to archive.org's "Wayback Machine"), should respect robots.txt directives aimed at a User-agent of "archiver".

Indexing crawlers

Gemini bots which fetch content in order to build searchable indexes of Geminispace should respect robots.txt directives aimed at a User-agent of "indexer".

Research crawlers

Gemini bots which fetch content in order to study large-scale statistical properties of Geminispace (e.g. number of domains/pages, distribution of MIME media types, response sizes, TLS versions, frequency of broken links, etc.), without rehosting, linking to, or allowing search of any fetched content, should respect robots.txt directives aimed at a User-agent of "researcher".

Web proxies

Gemini bots which fetch content in order to translate said content into HTML and publicly serve the result over HTTP(S) (in order to make Geminispace accessible from within a standard web browser) should respect robots.txt directives aimed at a User-agent of "webproxy".