Crawlers on Gemini and best practices

Stephane Bortzmeyer <stephane (a) sources.org>

I just developed a simple crawler for Gemini. Its goal is not to build
another search engine but to perform some surveys of the
geminispace. A typical result will be something like (real data, but
limited in size):

gemini://gemini.bortzmeyer.org/software/crawler/

Currently, I did not yet let it loose on the Internet, because there
are some questions I have.

Is it "good practice" to follow robots.txt? There is no mention of it
in the specification but it could work for Gemini as well as for the
Web and I notice that some programs query this name on my server.

Since Gemini (and rightly so) has no User-Agent, how can a bot
advertise its policy and a point of contact?

Link to individual message.

Solene Rapenne <solene (a) perso.pw>

On Tue, 8 Dec 2020 14:36:56 +0100
Stephane Bortzmeyer <stephane at sources.org>:

> I just developed a simple crawler for Gemini. Its goal is not to build
> another search engine but to perform some surveys of the
> geminispace. A typical result will be something like (real data, but
> limited in size):
> 
> gemini://gemini.bortzmeyer.org/software/crawler/
> 
> Currently, I did not yet let it loose on the Internet, because there
> are some questions I have.
> 
> Is it "good practice" to follow robots.txt? There is no mention of it
> in the specification but it could work for Gemini as well as for the
> Web and I notice that some programs query this name on my server.
> 
> Since Gemini (and rightly so) has no User-Agent, how can a bot
> advertise its policy and a point of contact?

depending on what you try, you may add your contact info
in the query.

First contact with a new server before you start crawling you
could get gemini://hostname/CRAWLER_FROM_SOMEONE_AT_HOST_DOT_COM

This is what I do for a gopher connectivity check.

I have to admit, it's a really poor solution but I didn't
find better way.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 8, 2020, at 14:58, Solene Rapenne <solene at perso.pw> wrote:
> 
> depending on what you try, you may add your contact info
> in the query.

Like so?

gemini://example.com/robots.txt?user-agent=DetailsOfTheRobot ?

Previously on the list:

https://lists.orbitalfox.eu/archives/gemini/2020/000511.html

Link to individual message.

Natalie Pendragon <natpen (a) natpen.net>

Yes, you should respect robots.txt in my opinion. It's not compulsory,
but it's currently the best way we have to respect servers' wishes and
bandwidth constraints. There is even a companion spec for doing so,
which accompanies the main Gemini spec.

gemini://gemini.circumlunar.space/docs/companion/robots.gmi

Read the companion spec for more detail, but you're indeed correct
that bots don't advertise who they are since there's no user-agent.
Instead, we have some agreed-upon crawler categories, like
`researcher`, `indexer`, `archiver`. It sounds like you may want to
respect `researcher` and call it a day :)

Nat

On Tue, Dec 08, 2020 at 02:36:56PM +0100, Stephane Bortzmeyer wrote:
> I just developed a simple crawler for Gemini. Its goal is not to build
> another search engine but to perform some surveys of the
> geminispace. A typical result will be something like (real data, but
> limited in size):
>
> gemini://gemini.bortzmeyer.org/software/crawler/
>
> Currently, I did not yet let it loose on the Internet, because there
> are some questions I have.
>
> Is it "good practice" to follow robots.txt? There is no mention of it
> in the specification but it could work for Gemini as well as for the
> Web and I notice that some programs query this name on my server.
>
> Since Gemini (and rightly so) has no User-Agent, how can a bot
> advertise its policy and a point of contact?

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Tue, Dec 08, 2020 at 09:47:57AM -0500,
 Natalie Pendragon <natpen at natpen.net> wrote 
 a message of 32 lines which said:

> Yes, you should respect robots.txt in my opinion. It's not compulsory,
> but it's currently the best way we have to respect servers' wishes and
> bandwidth constraints.

It is interesting to note that some robots.txt are quite broken (see
gemini://houston.coder.town/robots.txt)

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Tue, Dec 08, 2020 at 09:47:57AM -0500,
 Natalie Pendragon <natpen at natpen.net> wrote 
 a message of 32 lines which said:

> Yes, you should respect robots.txt in my opinion. It's not compulsory,
> but it's currently the best way we have to respect servers' wishes and
> bandwidth constraints. There is even a companion spec for doing so,
> which accompanies the main Gemini spec.
> 
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi

The spec is quite vague about the *order* of directives. For instance,
<gemini://gempaper.strangled.net/robots.txt> is:

User-agent: *
Disallow: /credentials.txt
User-agent: archiver
Disallow: /

The intented semantics is probably to disallow archivers but my parser
regarded the site as available because it stopped at the first match,
the star. Who is right?

<gemini://gemini.circumlunar.space/docs/companion/robots.gmi> and
<http://www.robotstxt.org/robotstxt.html> are unclear.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 14:43, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> The spec is quite vague about the *order* of directives.

Perhaps of interest:

While by standard implementation the first matching robots.txt pattern 
always wins, Google's implementation differs in that Allow patterns with 
equal or more characters in the directive path win over a matching 
Disallow pattern. Bing uses the Allow or Disallow directive which is the most specific.

In order to be compatible to all robots, if one wants to allow single 
files inside an otherwise disallowed directory, it is necessary to place 
the Allow directive(s) first, followed by the Disallow.

http://en.wikipedia.org/wiki/Robots_exclusion_standard

Also:

https://developers.google.com/search/reference/robots_txt

Link to individual message.

CΓ΄me Chilliet <come (a) chilliet.eu>

I don't see anything in the spec saying to stop at first match. I think 
you should read the whole response and apply all lines that matches your 
virtual user agent.
So in this case for an archiver, all lines.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 03:00:28PM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 24 lines which said:

> Perhaps of interest:

Not exactly the same thing since my email was about order of
User-Agent (when there is both "*" and "archiver") but, yes, Robot
Exclusion Standard is a mess.

> In order to be compatible to all robots, if one wants to allow
> single files inside an otherwise disallowed directory, it is
> necessary to place the Allow directive(s) first, followed by the
> Disallow.

Note that Allow is not even
standard. <http://www.robotstxt.org/robotstxt.html>: 

there is no "Allow" field.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 03:07:50PM +0100,
 C?me Chilliet <come at chilliet.eu> wrote 
 a message of 4 lines which said:

> I don't see anything in the spec saying to stop at first match. I think 
you should read the whole response and apply all lines that matches your 
virtual user agent.
> So in this case for an archiver, all lines.

Then this example in <http://www.robotstxt.org/robotstxt.html> would
not work:

User-agent: Google
Disallow:
User-agent: *
Disallow: /

Because with your algorithm, Google would be disallowed (while the
comment in the page says "To allow a single robot[Google]").

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 15:09, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> Not exactly the same thing since my email was about order of
> User-Agent (when there is both "*" and "archiver") but, yes, Robot
> Exclusion Standard is a mess.

Right. Perhaps best to look at how things are actually implemented in the wild :)

Given your example -and a user agent of archiver- robots.txt Validator and 
Testing Tool at https://technicalseo.com/tools/robots-txt/ says Disallow.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 15:15, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> Because with your algorithm, Google would be disallowed

An empty Disallow means allow all, no?

Link to individual message.

Philip Linde <linde.philip (a) gmail.com>

On Thu, 10 Dec 2020 14:43:11 +0100
Stephane Bortzmeyer <stephane at sources.org> wrote:

> The spec is quite vague about the *order* of directives. For instance,
> <gemini://gempaper.strangled.net/robots.txt> is:
> 
> User-agent: *
> Disallow: /credentials.txt
> User-agent: archiver
> Disallow: /
> 
> The intented semantics is probably to disallow archivers but my parser
> regarded the site as available because it stopped at the first match,
> the star. Who is right?

According to the spec, lines beginning with "User-agent:" indicate a
user agent to which subsequent lines apply

My interpretation is that your example expresses:

For any user agent:
disallow access to /credentials.txt
For archiver user agents:
disallow access to /

It is unclear from that specification alone whether the user agent
applies to all Disallow lines after it, or only until the next
User-agent line. The spec refers to the web standard for robot
exclusion. In the web standard, you can think of 1+ User-agent lines
followed by 1+ Allow/Disallow lines as a single record which specifies
that all the the user agents should follow the Allow/Disallow rules
that follow them. For example:

User-agent: archiver
User-agent: search engine
Disallow /articles
Disallow /uploads
User-agent: something
User-agent: someother
Disallow /whatever

This expresses:

For user agent archiver or search engine:
disallow access to /articles,
disallow access to /uploads
For user agent something or someother:
disallow access to /whatever

Refer to: http://www.robotstxt.org/norobots-rfc.txt

The way "robots.txt for Gemini" specifies it is rather confusing. It's
not indicated exactly how it differs from the web robot exclusion
standard and taken alone it is ambiguous.

-- 
Philip

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 03:20:18PM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 8 lines which said:

> > On Dec 10, 2020, at 15:15, Stephane Bortzmeyer <stephane at sources.org> wrote:
> > 
> > Because with your algorithm, Google would be disallowed
> 
> An empty Disallow means allow all, no?

Yes, but the next Disallow add a restriction to everything.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 02:43:11PM +0100,
 Stephane Bortzmeyer <stephane at sources.org> wrote 
 a message of 26 lines which said:

> The spec is quite vague about the *order* of directives.

Another example of the fact that you cannot rely on robots.txt:
regexps. The official site <http://www.robotstxt.org/robotstxt.html>
is crystal-clear: "Note also that globbing and regular expression are
not supported in either the User-agent or Disallow lines".

But in the wild you find things like
<gemini://drewdevault.com/robots.txt>:

User-Agent: gus
Disallow: /cgi-bin/web.sh?*

Opinion: may be we should specify a syntax for Gemini's robots.txt,
not relying on the broken Web one?

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 17:24, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> Yes, but the next Disallow add a restriction to everything.

User-agent: Google
Disallow:
User-agent: *
Disallow: /

mwahahaha. a work of beauty indeed. this would seem to read as disallow 
everything, for everyone, but  google. go figure.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 03:18:31PM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 16 lines which said:

> Perhaps best to look at how things are actually implemented in the
> wild :)

I have a big disagreement with this approach. As a matter of principle
(this approach allow big actors to set de facto standards and forcing
the others into a race which was precisely what made the Web the
bloated horror it is) and also because you cannot check every possible
implementation, and, anyway, they disagree among them.

So, no, I want a clear specification of what a crawler is supposed to
do.

> Given your example -and a user agent of archiver- robots.txt
> Validator and Testing Tool at
> https://technicalseo.com/tools/robots-txt/ says Disallow.

I was not able to make it work. It keeps telling me that
http://t.example/foo is "Invalid URL" and I find no way to enter an
arbitrary User-Agent. And, anyway, it will not be an official test,
just one implementation with some proprietary extensions.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 17:42, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> So, no, I want a clear specification of what a crawler is supposed to do.

Admirable :D

This is perhaps why academics should not write software. Practically speaking :P

Let's face it... there is none... it's all a gigantic sausage machine... 
haphazardly scotch-taped together... barely holding up, if at all...

That said, circa 2004, this is what Mark Pilgrim had to say about the matter:

Why Specs Matter
https://pwn.ersh.io/notes/why_specs_matter/



Here is a copy for your enjoyment. It never gets old:



Most developers are morons, and the rest are assholes. I have at various 
times counted myself in both groups, so I can say this with the utmost confidence.

Assholes

Assholes read specs with a fine-toothed comb, looking for loopholes, 
oversights, or simple typos. Then they write code that is meticulously 
spec-compliant, but useless. If someone yells at them for writing useless 
software, they smugly point to the sentence in the spec that clearly 
spells out how their horribly broken software is technically correct, and 
then they crow about it on their blogs.

There is a faction of assholes that write test cases. These people are 
good to have around while writing a spec, because they can occasionally be 
managed into channeling their infinite time and energy into finding 
loopholes before the spec is final. Unfortunately, managing assholes is 
even harder and more time-consuming than it sounds. This is why writing 
good specs takes so long: most of the time is frittered away on asshole management.

Morons

Morons, on the other hand, don?t read specs until someone yells at them. 
Instead, they take a few examples that they find ?in the wild? and write 
code that seems to work based on their limited sample. Soon after they 
ship, they inevitably get yelled at because their product is nowhere near 
conforming to the part of the spec that someone else happens to be using. 
Someone points them to the sentence in the spec that clearly spells out 
how horribly broken their software is, and they fix it.

Besides the run-of-the-mill morons, there are two factions of morons that 
are worth special mention. The first work from examples, and ship code, 
and get yelled at, just like all the other morons. But then when they 
finally bother to read the spec, they magically turn into assholes and 
argue that the spec is ambiguous, or misleading in some way, or ignoreable 
because nobody else implements it, or simply wrong. These people are 
called sociopaths. They will never write conformant code regardless of how 
good the spec is, so they can safely be ignored.

The second faction of morons work from examples, ship code, and get yelled 
at. But when they get around to reading the spec, they magically turn into 
advocates and write up tutorials on what they learned from their mistakes. 
These people are called experts. Virtually every useful tutorial in the 
world was written by a moron-turned-expert.

Angels

Some people would argue that not all developers are morons or assholes, 
but they are mistaken. For example, some people posit the existence of 
what I will call the ?angel? developer. ?Angels? read specs closely, write 
code, and then thoroughly test it against the accompanying test suite 
before shipping their product. Angels do not actually exist, but they are 
a useful fiction to make spec writers to feel better about themselves.

Why specs matter

If your spec isn?t good enough, morons have no chance of ever getting 
things right. For everyone who complains that their software is broken, 
there will be two assholes who claim that it?s not. The spec, whose 
primary purpose is to arbitrate disputes between morons and assholes, will 
fail to resolve anything, and the arguments will smolder for years.

If your spec is good enough, morons have a fighting chance of getting 
things right the second time around, without being besieged by assholes. 
Meanwhile, the assholes who have nothing better to do than look for 
loopholes won?t find any, and they?ll eventually get bored and wander off 
in search of someone else to harass.

?

Link to individual message.

Sudipto Mallick <smallick.dev (a) gmail.com>

On 12/10/20, Stephane Bortzmeyer <stephane at sources.org> wrote:
> Opinion: may be we should specify a syntax for Gemini's robots.txt,
> not relying on the broken Web one?
Here it is:

'bots.txt' for gemini bots and crawlers.

- know who you are: archiver, indexer, feed-reader, researcher etc.
- ask for /bots.txt
- if 20 text/plain then
-- allowed = set()
-- denied = set()
-- split response by newlines, for each line
--- split by spaces and tabs into fields
---- paths = field[0] split by ','
---- if fields[2] is "allowed" and you in field[1] split by ',' then
allowed = allowed union paths
----- if field[3] is "but" and field[5] is "denied" and you in
field[4] split by ',' then denied = denied union paths
---- if fields[2] is "denied" and you in field[1] split by ',' then
denied = denied union paths
you always match all, never match none
union of paths is special:
    { "/a/b" } union { "/a/b/c" } ==> { "/a/b" }

when you request for path, find the longest match from allowed and
denied; if it is in allowed you're allowed, otherwise not;; when a
tie: undefined behaviour, do what you want.

examples:
default, effectively:
    / all allowed
or
    / none denied
complex example:
    /priv1,/priv2,/login all denied
    /cgi-bin indexer allowed but archiver denied
    /priv1/pub researcher allowed but blabla,meh,heh,duh denied

what do you think?

Link to individual message.

Robert khuxkm Miles <khuxkm (a) tilde.team>

December 10, 2020 8:43 AM, "Stephane Bortzmeyer" <stephane at sources.org> wrote:

> - snip -
> 
> The spec is quite vague about the *order* of directives. For instance,
> <gemini://gempaper.strangled.net/robots.txt> is:
> 
> User-agent: *
> Disallow: /credentials.txt
> User-agent: archiver
> Disallow: /
> 
> The intented semantics is probably to disallow archivers but my parser
> regarded the site as available because it stopped at the first match,
> the star. Who is right?

Not you. The idea is that you start with the most direct User-Agent that 
applies to you (in this case, `archiver`), and then if that doesn't say 
you can't access the file, go up a level (in this case, `*`), and if 

on. If you run out of levels and haven't been told you can't access a file 
(or a parent directory of the file) then go ahead and access the file. In 
this case, a web proxy has no specific rules, so it would follow the `*` 
rule and not serve credentials.txt. However, an archiver *does* have 
specific rules: don't access anything, and so the archiver doesn't access anything.

An example in "A Standard For Robot Exclusion" (as archived by 
robotstxt.org[1]) uses an ordering like this and insinuates that the 
intent is to (in this case) allow `cybermapper` to access everything but 
block anyone else from accessing the cyberworld application:

 ```
# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:
 ```

Just my 2 cents,
Robert "khuxkm" Miles

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Tue, Dec 08, 2020 at 03:05:42PM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 14 lines which said:

> Like so?
> 
> gemini://example.com/robots.txt?user-agent=DetailsOfTheRobot ?

Good idea, this is what I do now.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 11:37:34PM +0530,
 Sudipto Mallick <smallick.dev at gmail.com> wrote 
 a message of 40 lines which said:

> 'bots.txt' for gemini bots and crawlers.

Interesting. The good thing is that it moves away from robots.txt
(underspecified, full of variants, impossible to know what a good bot
should do).

> - know who you are: archiver, indexer, feed-reader, researcher etc.
> - ask for /bots.txt
> - if 20 text/plain then
> -- allowed = set()
> -- denied = set()
> -- split response by newlines, for each line
> --- split by spaces and tabs into fields
> ---- paths = field[0] split by ','
> ---- if fields[2] is "allowed" and you in field[1] split by ',' then
> allowed = allowed union paths
> ----- if field[3] is "but" and field[5] is "denied" and you in
> field[4] split by ',' then denied = denied union paths
> ---- if fields[2] is "denied" and you in field[1] split by ',' then
> denied = denied union paths
> you always match all, never match none
> union of paths is special:
>     { "/a/b" } union { "/a/b/c" } ==> { "/a/b" }
> 
> when you request for path, find the longest match from allowed and
> denied; if it is in allowed you're allowed, otherwise not;; when a
> tie: undefined behaviour, do what you want.

It seems perfect.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 09:44:50PM +0000,
 Robert "khuxkm" Miles <khuxkm at tilde.team> wrote 
 a message of 24 lines which said:

> Not you. The idea is that you start with the most direct User-Agent
> that applies to you (in this case, `archiver`), and then if that
> doesn't say you can't access the file, go up a level (in this case,
> `*`),

Reasonable interpretation (more-specific to less-specific). Too bad
the "standard" is so vague.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 11:37:34PM +0530,
 Sudipto Mallick <smallick.dev at gmail.com> wrote 
 a message of 40 lines which said:

> - ask for /bots.txt

Speaking of this, I suggest it could be better to have a /.well-known
(or equivalent) to put all these "meta" files. The Web does it (RFC
5785) and it's cool since it avoids colliding with "real"
resources. (Also, crawling the geminispace shows strange robots.txt
which are probably "wildcards" or "catchall", created by a program
which replies for every possible path. Having a /.well-known would
allow to define an exception.)

It requires no change in clients (except bots) or servers, it is just
a convention.

=> gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5785.txt RFC 5785 
"Defining Well-Known URIs"

Meta-remark: is there a place with all the "Gemini good practices" or
"Gemini conventions", which do not change the protocol or the format
but are useful?

Link to individual message.

CΓ΄me Chilliet <come (a) chilliet.eu>

Le vendredi 11 d?cembre 2020, 09:26:54 CET Stephane Bortzmeyer a ?crit :
> > - know who you are: archiver, indexer, feed-reader, researcher etc.
> > - ask for /bots.txt
> > - if 20 text/plain then
> > -- allowed = set()
> > -- denied = set()
> > -- split response by newlines, for each line
> > --- split by spaces and tabs into fields
> > ---- paths = field[0] split by ','
> > ---- if fields[2] is "allowed" and you in field[1] split by ',' then
> > allowed = allowed union paths
> > ----- if field[3] is "but" and field[5] is "denied" and you in
> > field[4] split by ',' then denied = denied union paths
> > ---- if fields[2] is "denied" and you in field[1] split by ',' then
> > denied = denied union paths
> > you always match all, never match none
> > union of paths is special:
> >     { "/a/b" } union { "/a/b/c" } ==> { "/a/b" }
> > 
> > when you request for path, find the longest match from allowed and
> > denied; if it is in allowed you're allowed, otherwise not;; when a
> > tie: undefined behaviour, do what you want.
> 
> It seems perfect.

I guess I?m not the only one needing some examples to fully understand how 
this would work?

If I get it it?s something like so:
path1,path2 archiver,crawler allowed but path3 denied
path4 * denied

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 11, 2020, at 11:16, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
>  I suggest it could be better to have a /.well-known

+1 for the /.well-known convention.

This was mentioned several time previously, but inertia is strong with 
that one. Go figure.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Fri, Dec 11, 2020 at 11:18:00AM +0100,
 C?me Chilliet <come at chilliet.eu> wrote 
 a message of 33 lines which said:

> I guess I?m not the only one needing some examples to fully understand 
how this would work?

There were examples at the end of the original message.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Fri, Dec 11, 2020 at 11:25:31AM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 11 lines which said:

> This was mentioned several time previously, but inertia is strong
> with that one. Go figure.

This is one of the interesting things with Gemini: it is a social
experience, more than a technical one. I love observing how Gemini
governance works (or fails).

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 11, 2020, at 11:31, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> This is one of the interesting things with Gemini: it is a social
> experience, more than a technical one. I love observing how Gemini
> governance works (or fails).

Huis clos meets Lord of the Flies? :D

Link to individual message.

CΓ΄me Chilliet <come (a) chilliet.eu>

Le vendredi 11 d?cembre 2020, 11:16:05 CET Stephane Bortzmeyer a ?crit :
> Meta-remark: is there a place with all the "Gemini good practices" or
> "Gemini conventions", which do not change the protocol or the format
> but are useful?

Some are listed in gemini://gemini.circumlunar.space/docs/

So I?d expect this page to regroup all validated specifications, and 
best/common practices/conventions.

I also started gemini://gemlog.lanterne.chilliet.eu/specifications.en.gmi 
to list actual specifications and proposals without mixing them with best 
practice documents.

C?me

Link to individual message.

Sudipto Mallick <smallick.dev (a) gmail.com>

what i wrote was a rough algorithm, now here is a human readable
description for bots.txt:

every line has the following format:

    <paths> <bots> ("allowed" | "denied")
OR
    <paths> <bots> "allowed" "but" <bots> "denied"
<paths> is comma seperated paths to be allowed or denied
<bots> is comma seperated bot ''descriptors'' (think of better word
for this) matching [A-Za-z][A-Za-z0-9_-]*

is the case


an ideal bot creates a set of allowed and denied paths for it from its
real and virtual user agent and the "all" group.
before requesting for a path, this ideal bot finds the longest match
from both the allowed and denied path sets. if the longest match is
from the allowed set, it proceeds to request that path. if both sets
have the longest match, then follow the most specific match of the
"descriptor" (name of bot > virtual agent > "all")
for example:

    /a,/p all denied
    /a/b,/p/q indexer,researcher allowed
    /a/b/c researcher denied
    /a/b/c heh allowed

now the researcher 'heh' may access /p/q/* and and /a/b/c and it may
not access /a/b/{X} when {X} != 'c'
other researchers may only access /p/q and /a/b/{Z} when {Z} != 'c' so
they may not access /a/b/c
indexers may access /a/b and /p/q


Q. do we need to support comments in bots.txt

Link to individual message.

Sudipto Mallick <smallick.dev (a) gmail.com>

> for example:
>
>     /a,/p all denied
>     /a/b,/p/q indexer,researcher allowed
>     /a/b/c researcher denied
>     /a/b/c heh allowed
>
now the researcher 'heh' may access /p/q/* and and /a/b/*
> and it may not access /a/b/{X} when {X} != 'c'
err. sorry. that should be: may not access /a/{X} when {X} != 'b' and
/p/{Y} when {Y} != 'q' (for all indexers and researchers, hmm.)
every one other that researchers and indexers may not access /a/* and /p/*

> other researchers may only access /p/q and /a/b/{Z} when {Z} != 'c' so
> they may not access /a/b/c
indexers may access /a/b/* and /p/q/*
ah.

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

(Sorry if this is the wrong place to reply.)

Why are we defining new standards and filenames? bots.txt, .well-know, etc.
We don't need this.

Gemini is based around the idea of radical familiarity. Creating a new robots
standard breaks that, and makes things more complicated. There are existing
complete robots.txt standards, are there not? I admit I'm not well-versed in
this, but let's just pick a standard that works and make it offical.

After doing some quick research, I found that Google has submitted a draft
spec for robots.txt to the IETF. The original draft was submitted on July 07,
2019, and the most recent draft was submitted ~3 days ago, on the 8th.

https://developers.google.com/search/reference/robots_txt
https://tools.ietf.org/html/draft-koster-rep-04

I am no big fan of Google, but they are the kings of crawling and it makes sense
to go with them here.

The spec makes many example references to HTTP, but note that it is fully
protocol-agnostic, so it works fine for Gemini.


makeworld

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 11, 2020, at 21:38, colecmac at protonmail.com wrote:
> 
> We don't need this.

hmmm? who are you again? rhetorical.

Link to individual message.

Leo <list (a) gkbrk.com>

> Why are we defining new standards and filenames? bots.txt, .well-known,
> etc.

Just want to point out that the .well-known path for machine-readable
data about an origin is a proposed IETF standard that has relatively
widespread use today. It is filed under RFC8615, and is definitely not a
standard that was invented in this thread.

The first paragraph of the introduction even references the robots file.

While I don't necessarily agree with the naming of bots.txt, I see no
problem with putting these files under a .well-known directory.

> We don't need this.

Thanks for making this mailing list a lot more efficient and talking
about what the Gemini community needs in a 4-word sentence.

Even if the original path of /robots.txt is kept, I think it makes sense
to clarify an algorithm in non-ambiguous steps in order to get rid of
the disagreements in edge-cases.

> let's just pick a standard that works and make it offical.

The point is that the standard works for simple cases, but leaves a lot
to be desired when it comes to clarifying more complex cases. This
results in a lot of robots.txt implementations disagreeing about what is
allowed and not allowed.

Additionally by crawling the Web, you can see that people tend to extend
robots.txt in non-standard ways and this only gets incorporated into
Google's crawlers if the website is important enough.

> I am no big fan of Google, but they are the kings of crawling and it
> makes sense to go with them here.

The kings of crawling deemed HTTP and AMP the most suitable protocol and
markup format for content to be crawled, why don't we stop inventing
standards like Gemini and Gemtext and go with them here.

> The spec makes many example references to HTTP, but note that it is
> fully protocol-agnostic, so it works fine for Gemini.

Gemtext spec makes references to Gemini, but it is fully
protocol-agnostic, so it works fine with HTTP. Similarly, Gemini makes
many references to Gemtext, but it is content-type agnostic so it works
fine with HTML. But we thought we could be better off shedding
historical baggage and reinvented not one, but two main concepts of the
traditional Web.

--
Leo

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Fri, Dec 11, 2020 at 09:20:25AM +0100,
 Stephane Bortzmeyer <stephane at sources.org> wrote 
 a message of 10 lines which said:

> > Like so?
> > 
> > gemini://example.com/robots.txt?user-agent=DetailsOfTheRobot ?
> 
> Good idea, this is what I do now.

Note that it is not guaranteed to work because of broken (IMHO)
servers. For instance,
<gemini://alexschroeder.ch/robots.txt?robot=true> redirects to
<gemini://alexschroeder.ch/page/robots.txt?robot=true> which returns a
code 50 :-(

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Fri, Dec 11, 2020 at 08:38:12PM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 29 lines which said:

> Gemini is based around the idea of radical familiarity. Creating a
> new robots standard breaks that, and makes things more
> complicated. There are existing complete robots.txt standards

The problem is precisely the final S. I can add that most of the
"standards" are poorly written and very incomplete.

> but let's just pick a standard that works and make it offical.

OK, that's fine with me.

> https://tools.ietf.org/html/draft-koster-rep-04

(Better to indicate the URL without the version number, to get the
latest version.)

It seems well-specified, although quite rich, so more difficult to
implement.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 12, 2020, at 15:28, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> Note that it is not guaranteed to work because of broken (IMHO)
> servers. For instance,
> <gemini://alexschroeder.ch/robots.txt?robot=true> redirects to
> <gemini://alexschroeder.ch/page/robots.txt?robot=true> which returns a
> code 50 :-(

Bummer. But to be expected. Specially from Alex who likes to, hmmm, experiment wildly :)

Still seems to be a reasonable approach. Sean mentioned using #fragment to 
the same effect.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Sat, Dec 12, 2020 at 05:52:39PM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 17 lines which said:

> But to be expected.

I had to add blacklist support in my crawler to explicitely exclude
some... creative capsules.

> Sean mentioned using #fragment to the same effect.

That would be wrong since URI fragments are not sent to the server,
unlike queries.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 11, 2020, at 21:38, colecmac at protonmail.com wrote:
> 
> I am no big fan of Google, but they are the kings of crawling and it makes sense
> to go with them here.

Interesting read:

Google?s Got A Secret
https://knuckleheads.club

TL;DR: There Should Be A Public Cache Of The Web

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

On Saturday, December 12, 2020 4:46 AM, Leo <list at gkbrk.com> wrote:

> > Why are we defining new standards and filenames? bots.txt, .well-known,
> > etc.
>
> Just want to point out that the .well-known path for machine-readable
> data about an origin is a proposed IETF standard that has relatively
> widespread use today. It is filed under RFC8615, and is definitely not a
> standard that was invented in this thread.

Yes, I'm aware of .well-known. I meant that using it to hold a robots.txt-type
file would be a new filepath.

> The first paragraph of the introduction even references the robots file.
>
> While I don't necessarily agree with the naming of bots.txt, I see no
> problem with putting these files under a .well-known directory.

My only problem was that it would be reinventing something that already
exists, but I likewise have no issue with the idea of .well-known in general.

> > We don't need this.
>
> Thanks for making this mailing list a lot more efficient and talking
> about what the Gemini community needs in a 4-word sentence.

Sorry, perhaps that was too curt. I don't intend to speak for everyone,
it's only my opinion. However it wasn't just a 4-word sentence, I backed
up my opinion with the rest of my email.

> Even if the original path of /robots.txt is kept, I think it makes sense
> to clarify an algorithm in non-ambiguous steps in order to get rid of
> the disagreements in edge-cases.
>
> > let's just pick a standard that works and make it offical.
>
> The point is that the standard works for simple cases, but leaves a lot
> to be desired when it comes to clarifying more complex cases. This
> results in a lot of robots.txt implementations disagreeing about what is
> allowed and not allowed.

I agree, that's why I was trying to pick a standard instead of developing
our own. Picking a well-written standard will cover all these cases.

> Additionally by crawling the Web, you can see that people tend to extend
> robots.txt in non-standard ways and this only gets incorporated into
> Google's crawlers if the website is important enough.

Ok? I don't see how that's relevant. By picking a standard here and sticking
to it, we can avoid that.

> > I am no big fan of Google, but they are the kings of crawling and it
> > makes sense to go with them here.
>
> The kings of crawling deemed HTTP and AMP the most suitable protocol and
> markup format for content to be crawled, why don't we stop inventing
> standards like Gemini and Gemtext and go with them here.

Again I don't see how that's relevant. Yes "we" don't like AMP here, yes
we don't like HTTP, etc. This doesn't make the robots.txt standard I sent
a bad one.

> > The spec makes many example references to HTTP, but note that it is
> > fully protocol-agnostic, so it works fine for Gemini.
>
> Gemtext spec makes references to Gemini, but it is fully
> protocol-agnostic, so it works fine with HTTP. Similarly, Gemini makes
> many references to Gemtext, but it is content-type agnostic so it works
> fine with HTML. But we thought we could be better off shedding
> historical baggage and reinvented not one, but two main concepts of the
> traditional Web.
>

This quote, along with this line:

> why don't we stop inventing standards like Gemini and Gemtext and go with
> them [Google] here.

makes me think you misunderstand some of Gemini's ideas. Solderpunk has
talked about the idea of "radical familiarity" on here, and how Gemini uses
well known (ha) protocols as much as possible, like TLS, media types, URLs,
etc. Gemini tries to *avoid* re-inventing! Obviously the protocol itself
and gemtext are the exceptions to this, but otherwise, making niche protocols
only for Gemini is not the way to go (in my opinion). It makes things harder
to implement and understand for developers.


Cheers,
makeworld

Link to individual message.

---

Previous Thread: Three possible uses for IRIs

Next Thread: Some reading on IRIs and IDNs