2019-05-24 Robots and Gopher

Here’s the proposal that was discussed on the Gopher mailing list, recently.

discussed on the Gopher mailing list

Motivation

We love gopher apps and we love seeing them, but it is very hard for robots crawling gopher-space to automatically recognize them, requiring lots of manual work to pull stuff out of the index that should never have been there in the first place. Please use a `robots.txt` selector to keep spiders out of these areas.

Location

A robot MUST check `robots.txt`. A robot MAY check `0/robots.txt` if `robots.txt` is not found.

The reason for those two selectors is almost every server interprets a selector of “robots.txt” as a file in its root. The reason for the second in particular is UMN or UMN-alike gopherds that like to have the itemtype repeated. The first takes precedence.

Note that this doesn’t include a leading slash!

How to test? The following should return the contents of the site’s `robots.txt`.

echo robots.txt | nc alexschroeder.ch 70

Caching

A robot SHOULD cache the `robots.txt` file for 24h.

Lines

A `robots.txt` file consists of lines separated by a newline (`\n`) or a carriage return and a newline (`\r\n`).

Disallow

A robot MUST consider all lines starting with `Disallow:`. Each such line specifies a pattern indicating that all selectors matching the pattern are to be ignored by robots.

1. Whitespace after `Disallow:` MUST be ignored

2. Patterns match from the beginning of the selector

Example:

The following line disallows robots from indexing any links starting with a slash:

Disallow: /

Note that the selector `robots.txt` does not start with a slash.

In terms of regular expressions, this means that every pattern implicitly starts with `^`.

Globbing

Patterns MAY contain one or more asterisks (`*`). These are wildcards matching zero or more characters.

Example:

The following line disallows robots from indexing any links containing a slash:

Disallow: */

Note that there is no way to specify that a pattern must match up to the end of the selector.

In terms of regular expressions, this means that there is no way to specify ` gemini - alexschroeder.ch in a pattern. Every `*` in a pattern is the equivalent of `.*` in a selector if we assume that `/s` is in effect and `.` matches any character whatsoever, even a newline.

Comments

Authors SHOULD use the `#` character to indicate comment up to the end of the line.

Other Keywords

There is currently no support for other keywords we know from the web’s `robots.txt` standard. [1]

References

[1] https://en.wikipedia.org/wiki/Robots_exclusion_standard

https://en.wikipedia.org/wiki/Robots_exclusion_standard

​#Gopher