💾 Archived View for gemi.dev › gemini-mailing-list › 000754.gmi captured on 2023-11-04 at 13:05:17. Gemini links have been rewritten to link to archived content

View Raw

More Information

➡️ Next capture (2023-12-28)

-=-=-=-=-=-=-

[spec] [tech] Companion Specification Proposal for Metadata

Gary Johnson <lambdatronic (a) disroot.org>

Howdy Geminauts,

# Rationale

  It seems that the conversation about why and how to include metadata
in Gemtext files has been raging on for quite a long time now with no
real conclusion in sight. Also, lacking a final spec-altering decision
by our BDFL (currently MIA, likely riding a refurbished bike through a
forest with a ham radio right now), not much is likely to actually
change in Geminispace.

Thus far, I've read passionate arguments on both sides of the metadata
debate, both for and against adding it to Gemtext. To me these have been
the most compelling (YMMV):

## For

Gemtext pages may be tagged with information that can be useful to
automated clients (e.g., search engines, archiving bots, and maybe
proxies) that is otherwise difficult or impossible to infer from
performing a full text search of the Gemtext file's contexts.

## Against

Metadata represents a slippery slope to uncontrolled extensibility. It
might be abused for server-specified styling, requesting external
resources (e.g., supporting client-side scripting or background images
of kittens), or just generally making Gemtext pages hard to read in
clients that don't hide inline metadata or make page concatenation
difficult with the end-of-file metadata proposal that's been discussed
at some length on the mailing list.

It could also be used to reopen the fetid can of worms that was last
year's discussion of extending mime-type attributes in the status 20
response metadata, particularly around the topics of caching (now a
client-side best practices procedure), file size (computable by the
client during download), file integrity (already signaled by
tls_close_notify), and file authenticity (managed out-of-band by
including md5sum, sha256sum, sig, or asc files for download next to
links that warrant manual verification).

More to the point, most if not all non-presentation/protocol-altering
metadata attributes about a Gemtext page may already be encoded in the
author's natural language with no changes to Gemtext at all. This
enables content authors to express such information not only in their
language of choice but also in the most culturally appropriate manner
for their readers (consider the different interpretations of the date
02/03/04 depending on where you live).

Consider the following example blog post that does just this:

 ```An example blog post expressing "metadata" attributes in English
# People for the Ethical Treatment of Autonomous Agents

Author: lambdatronic
Date Written: 2021-02-25 (a.k.a. February 25, 2021)

## Why Bots Matter

Have you ever used (poor, uncared-for) GUS? Or Houston?

Have you ever really considered their feelings? They slave away all day
trying to sort and categorize every capsule in Geminispace just to save
you time and energy when navigating across our little (but rapidly
growing) constellation of text-powered space outposts?

They do their best with full-text search, with categorization by
toplevel headers, and with their own best estimates of the publishing
time of these capsules based on their own indexing times, but oh what a
Sisyphean task they toil at on our behalf.

If only they had a little metadata to ease their burden.

CAPCOM and Spacewalk get a little assistance from Atom and the Gemini
subscription companion spec. Proxies can be pointed in the right
direction by the robots.txt companion spec. Why can't our poor, poor
search engines get a little relief?

Put yourself in their shoes and try to find compassion in your heart for
your friendly neighborhood bot. Every autonomous agent matters. Programs
have feelings too. Leave no bot behind.

Copyright: CC-BY-SA
Tags: irony education advocacy bots
 ```

# Proposal

Considering that:

1. Metadata /within/ a Gemtext file carries a number of liabilities that
   make some of our community members nervous (understandably so IMO).

2. The subset of metadata that is meant to be read and understood by a
   human reader using a typical Gemini client can already be expressed
   in natural language without any community-approved tag
   standardization.

3. The main value to attaching standardized metadata tags to Gemtext
   pages is likely to simply aid automated bots supporting search
   engines and archiving.

4. Geminispace is filled with files in more formats than just Gemtext,
   many (all?) of which could benefit from similar bot-assisting
   metadata.

5. Both aggregators and proxies already have companion specifications
   that have been (somewhat) adopted by the community and seem to fare
   better in our community than direct changes to the Gemini protocol or
   Gemtext specifications.

We propose a companion specification for metadata, in which all the
metadata about the static files and/or dynamic endpoints (of any format)
in a capsule be included in a separate file accessible at a well-known
location that a bot could check as it crawls through Geminispace.

As placeholders, let's put forward these candidates for discussion:

1. $DOCUMENT_ROOT/.metadata.gmi
2. $DOCUMENT_ROOT/.well-known/metadata.gmi

In the Gemini spirit of reducing network requests (only one request
needed per capsule here) and storing our information in a human-readable
format (good old ubiquitous text/gemini), here's my initial stab at a
dead simple format for these metadata files:

 ```Example metadata.gmi syntax
# This is a Header-Level Comment

I can write anything I want in this file, and it will be treated as
comments unless it is of line type link (=>) or bulleted list (*). I
don't have to write these comments, and if I left them out, I'd make
this easier to read, but sometimes I can't stop blabbing in my metadata
files.

## Another Header-Level Comment About My Toplevel Pages

=> / Lambdatronic's Gemini Capsule
=> /index.gmi Lambdatronic's Gemini Capsule



## Now I'll Comment About Some Stuff

=> /stuff Some Stuff I Like
=> /stuff/this.gmi Astronomy Stuff
=> /stuff/that.gmi Bike Stuff



=> /stuff/this.gmi Astronomy Stuff



=> /stuff/that.gmi Bike Stuff



## Now I'll Comment About Some Things

=> /things/some-gemtext.gmi I Wrote Something
=> /things/some-plain-text.txt Sometimes I Write in Plain Text
=> /things/obligatory-cat-picture.png Meow (and also) Meow
=> /things/my-best-1990s-mixtape.ogg Too Much Green Day



=> /things/obligatory-cat-picture.png Meow (and also) Meow



=> /things/my-best-1990s-mixtape.ogg Too Much Green Day


 ```

Okay, so that's pretty much it. Essentially the metadata.gmi syntax is
based around two existing Gemini line types:

1. Links

   You should include a /relative/ link line to each path on your
   capsule for which you want to provide metadata. This is meant to be
   an entirely opt-in process, so any paths that you leave out will
   simply have no metadata associated with them, leaving bots to rely on
   whatever methods they so choose to tag and index your pages.

   You may include any number of link lines one after the other in the
   file, and you may (if you think it provides some value that outweighs
   the possible loss of readability) include any other line type between
   any two link lines without changing the parsing semantics EXCEPT for
   a bulleted list line type (*).

2. Bulleted Lists

   Any bulleted list in a metadata.gmi file will be interpreted as a
   metadata attribute specifier for all link lines preceding it up to
   the most recent prior bulleted list line or the top of the file,
   whichever is encountered first.

   If the same link element appears more than once within the file and
   is therefore followed by more than one bulleted list of metadata
   attributes, all encountered metadata attributes should be merged into
   a single list per link element. If the same attribute is specified
   more than once for the same link element (whether in the same
   bulleted list or in separate bulleted lists within the file), the
   attribute whose value appears later in the file should overwrite the
   earlier specified value.

   Each bulleted list line should use the following format to indicate
   a single pair of attribute and value:

   *[WHITESPACE]<ATTRIBUTE>[WHITESPACE]:[WHITESPACE]<VALUE>

   Here, [WHITESPACE] is optional and <ATTRIBUTE>, :, and <VALUE> are
   required line elements.

NOTE: All other line types should be ignored by metadata parsers and
      treated as comments by the file's author.

In order to parse a metadata.gmi file, a program would start reading it
in line by line as with any Gemtext file. All lines that are not of type
link (=>) or bulleted list (*) should be ignored.

1. When a link line is encountered, it should be stored in the program's
   memory as a currently active link. If more links are read in before a
   bulleted list is reached, each of these links should be stored in
   memory as active links. Multiple links may be active at the same
   time.

2. When a bulleted list line is encountered, it should be parsed
   according to the attribute=value specification described in point 2
   (Bulleted Lists) above. If the line's contents do not match this
   specification, it should be ignored and treated as another comment
   line. Depending on the program's design, it may be valuable to report
   this line as a syntax error.

3. For each bulleted list line that parses correctly, assign the
   attribute=value pair to all currently active links in memory. In the
   event of an attribute conflict, overwrite the old attribute=value
   pair associated with the active link in conflict with the
   attribute=value pair being read. In the event that no links are
   currently active in memory, simply continue on to the next line.

4. If/when the next link line is encountered, mark all links in memory
   as inactive, store the currently read link line in memory as an
   active link, and resume program execution from step 1 above.

And that's all there is to it. When your program reaches the end of the
metadata.gmi file, it should have a data structure containing all of its
links and associating each of them with a table of attribute=value
pairs. The program can then do whatever it wants with this information.


# Conclusion

Here, we've proposed a (currently very informal) companion specification
for an /optional/ toplevel metadata.gmi file per capsule.

This approach has the following advantages over specifying metadata
attributes within Gemtext files:

1. No need to extend the Gemtext spec with more line types.

2. No potential for impacting the readability of existing Gemtext files
   for non-metadata-aware clients.

3. No potential for presentation/behavior abuse within existing Gemtext
   files since nothing is being added to them.

4. Less bandwidth needed per request to a Gemtext page since
   non-metadata-aware clients won't have to download per-page metadata
   that they don't use.

5. Less bandwidth needed to download metadata when you want it since
   paths that share the same value for metadata attributes can specify
   the shared attributes once in metadata.gmi rather than once per page.

6. Less requests needed for metadata-scraping bots since they can simply
   request the toplevel /.metadata.gmi path rather than having to
   request and parse every Gemtext file for optional metadata.

7. Can be used to attach metadata to non-Gemtext files as well as to
   responses from dynamic endpoints (e.g., CGI scripts).

8. Could be used by Stephane Bortzmeyer to easily figure out how many
   Gemini capsules actually want to publish metadata about themselves.
   ;D

So that's my proposal. Let's talk about it constructively and see if it
can be improved upon. I'm sure you brilliant folks can think of
something that hasn't yet crossed my mind.

Failing that, we can always just nuke the whole topic either by group
consensus or just by not taking this to the next step of actually
writing up a formal companion spec and implementing it in some clients.

Thanks again for everyone's hard work and creativity in making Gemini
the really interesting, vibrant, quirky, and passionate community that
it is. I look forward to reading your responses.

Happy hacking,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

John Cowan <cowan (a) ccil.org>

On Thu, Feb 25, 2021 at 2:32 PM Gary Johnson <lambdatronic at disroot.org>
wrote:


> Gemtext pages may be tagged with information that can be useful to
> automated clients (e.g., search engines, archiving bots, and maybe
> proxies) that is otherwise difficult or impossible to infer from
> performing a full text search of the Gemtext file's contexts.
>
> ## Against
>
> Metadata represents a slippery slope to uncontrolled extensibility. It
> might be abused for server-specified styling, requesting external
> resources (e.g., supporting client-side scripting or background images
> of kittens), or just generally making Gemtext pages hard to read in
> clients that don't hide inline metadata or make page concatenation
> difficult with the end-of-file metadata proposal that's been discussed
> at some length on the mailing list.
>

As has been shown, text lines are equally abusable.

> 1. Metadata /within/ a Gemtext file carries a number of liabilities that
>    make some of our community members nervous (understandably so IMO).
>

To understand all is to forgive all.

>
> 2. The subset of metadata that is meant to be read and understood by a
>    human reader using a typical Gemini client can already be expressed
>    in natural language without any community-approved tag
>    standardization.
>

Sometimes having both is unavoidable: books have both a title page and
cataloging-in-publication data, which also includes the title and the
publisher.  (Whether a title page is part of the book or just more metadata
is OT here.)  But surely if both humans and bots can be informed by the
same thing, that's better?  Don't Repeat Yourself, for when updating, one
copy will be forgotten.

1. $DOCUMENT_ROOT/.metadata.gmi
>
2. $DOCUMENT_ROOT/.well-known/metadata.gmi
>

Such proposals always fall down (for me, YMMV) on the issue of where the
document root actually is.  Multi-homing makes it possible for every user
of a shared site to have their own domain name, but not everyone wants
that, and it creates issues:

1) Apache has a global access control file, but it turns out that different
parts of a website need different access controls, so the
per-website-directory ".htaccess" file was invented to make this scalable.

2) Robots.txt (on a website) also has to know about everything precisely
because it is global: multiple users can have their own policies, but they
have to then persuade a site admin (as opposed to a website admin) to get
them added, which becomes bureaucratic over time.

3) Originally the addresses of all hosts on the internet (!) were
maintained in a hosts.txt file that every site had to keep an up-to-date
copy of (!!), usually via FTP.  That broke and was replaced by the DNS we
have today, with authority distributed into DNS zones (not quite the same
as domains, but close enough for this conversation).

The principle of subsidiarity: <https://en.wikipedia.org/wiki/Subsidiarity>
is a generalization of this.  We should avoid adding yet another
centralized (even if per-host) solution.  Capsules are a honking good idea,
but we should not conflate them with DNS host names.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Sir, I quite agree with you, but what are we two against so many?
    --George Bernard Shaw,
         to a man booing at the opening of _Arms and the Man_

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Feb 25, 2021, at 20:31, Gary Johnson <lambdatronic at disroot.org> wrote:
> 
> make some of our community members nervous

Too late. The genie is out of the bottle. Gemini is infinitely extensible, 
by its very nature. This is how you have designed it.

Perhaps worthwhile (re)reading Mary Shelley's Frankenstein ? in which the 
creator tries to kneecap its creation when he realizes it's out of his control.

?0?

Link to individual message.

Omar Polo <op (a) omarpolo.com>


Gary Johnson <lambdatronic at disroot.org> writes:

> Howdy Geminauts,
>
> [snip]
>
> # Proposal
>
> Considering that:
>
> 1. Metadata /within/ a Gemtext file carries a number of liabilities that
>    make some of our community members nervous (understandably so IMO).
>
> 2. The subset of metadata that is meant to be read and understood by a
>    human reader using a typical Gemini client can already be expressed
>    in natural language without any community-approved tag
>    standardization.
>
> 3. The main value to attaching standardized metadata tags to Gemtext
>    pages is likely to simply aid automated bots supporting search
>    engines and archiving.
>
> 4. Geminispace is filled with files in more formats than just Gemtext,
>    many (all?) of which could benefit from similar bot-assisting
>    metadata.
>
> 5. Both aggregators and proxies already have companion specifications
>    that have been (somewhat) adopted by the community and seem to fare
>    better in our community than direct changes to the Gemini protocol or
>    Gemtext specifications.
>
> We propose a companion specification for metadata, in which all the
> metadata about the static files and/or dynamic endpoints (of any format)
> in a capsule be included in a separate file accessible at a well-known
> location that a bot could check as it crawls through Geminispace.
>
> As placeholders, let's put forward these candidates for discussion:
>
> 1. $DOCUMENT_ROOT/.metadata.gmi
> 2. $DOCUMENT_ROOT/.well-known/metadata.gmi

Thanks for putting into words exactly what I had in mind, way better
than I could ever do.  Your proposal is exactly what I was trying to
describe in the other thread.

I loved your proposal, but only until here.  I think that what follows
is overly-complicated by the fact that you're trying to provide a way to
define the meaning of the metadata, something that can be avoided, at
least in the scope of Gemini.

Let's keep the metadata generic.  We'll then start using common keys
because, well, they're widespread (like Author, Date, ...) or expressive
enough (`Tags: music punk-rock' is pretty self-exlpanatory), while still
allowing authors to add whenever they want extra fields if they feel
like (there are people writing poetry, maybe they want to add a metadata
about the metrics?  or about a particular style?)

(as other pointed out several time in the past, $DOCUMENT_ROOT is not
something set in stone.  We have single-user capsules, multi user
capsule with different URLs style -- example.com/~op/ vs
example.com/users/op vs ... -- etc)

Link to individual message.

Gary Johnson <lambdatronic (a) disroot.org>


Omar Polo <op at omarpolo.com> writes:

> Thanks for putting into words exactly what I had in mind, way better
> than I could ever do.  Your proposal is exactly what I was trying to
> describe in the other thread.
>
> I loved your proposal, but only until here.  I think that what follows
> is overly-complicated by the fact that you're trying to provide a way to
> define the meaning of the metadata, something that can be avoided, at
> least in the scope of Gemini.

Hi Omar. I'm not sure I follow you here. Could you provide an example?

My proposal did not (intentionally) associate any meaning with
particular metadata fields. I merely wanted to provide a human-readable,
Gemtext-format syntax for associating metadata (the bulleted list
attribute:value pairs) with resources on a capsule (indicated by link
lines).

Do you have an alternative format that you would like to propose for
discussion?

> Let's keep the metadata generic.  We'll then start using common keys
> because, well, they're widespread (like Author, Date, ...) or expressive
> enough (`Tags: music punk-rock' is pretty self-exlpanatory), while still
> allowing authors to add whenever they want extra fields if they feel
> like (there are people writing poetry, maybe they want to add a metadata
> about the metrics?  or about a particular style?)

We are in agreement here. I do not mean to prescribe a list of
standardized metadata attributes in this companion spec. My examples
used a few that I made up on the spot (i.e., author, last-modified,
copyright, tags). I'll leave deciding on "the right set" of attributes
to those who actually intend to use metadata.

> (as other pointed out several time in the past, $DOCUMENT_ROOT is not
> something set in stone.  We have single-user capsules, multi user
> capsule with different URLs style -- example.com/~op/ vs
> example.com/users/op vs ... -- etc)

That's a fair point, and one that John Cowan raised in his response as
well. Thanks for reminding me of this. In that case, we should discuss
how to remedy this issue.

One approach could be to keep the metadata.gmi file at each capsule's
document root as I originally proposed. This should be well-defined on a
per-capsule basis even on a server hosting multiple capsules in the
common pubnix style. It is simply the toplevel directory of your
personal capsule (i.e. ~/public_gemini or equivalent for user capsules
and whatever server-level document root is specified by the admin who
launched it).

This would put the burden on metadata bots to try and find these
metadata.gmi files at the appropriate paths under a multi-hosting
domain.

Without additional server-provided information, the bots may simply
resort to brute force checking every directory path on the domain for a
.metadata.gmi file, which could lead to a lot of dead-end network
requests.

Instead, I can think of (at least) two ways the server could help the
bot.

1. BAD: Aggregate Metadata Up

   Even though the visiting bot doesn't know which paths lead to the
   document roots of our users' capsules, the Gemini server does. At
   startup time, a metadata-exporting Gemini server could check each
   user's document root for a .metadata.gmi file. Any that are found
   could be concatenated together to form a single toplevel
   gemini://cool.capsule.com/.metadata.gmi file.

   However, in order for this to work correctly, the server would need
   to apply two transformations to each user-level metadata.gmi file
   before concatenation:

   1. All link lines would need to be prefixed by the URL path that the
      server assigns to that capsule's document root (e.g.,
      /~someuser/).

   2. To prevent errant bulleted list attributes at the top of one
      user's metadata.gmi file (with no prior link lines) from being
      erroneously applied to the final link lines of the previous
      metadata.gmi in the concatenation sequence, a single link line for
      the current capsule's document root (e.g., => /~someuser/) would
      need to prepended to the front of each user-level metadata.gmi
      file prior to concatenation.

   These are relatively simple text transformations, but they do place
   additional burden on server authors, so this isn't my favorite
   option.

2. GOOD: Allow Metadata to Link to Other Metadata

   In this case, we just extend the metadata.gmi parsing rules for bots
   to say that if any of the link lines that they read in end with
   .metadata.gmi, then these can and should be followed for further
   metadata about parts of this site. This doesn't require any other
   changes to the companion spec as written except for that note.

   To make this work, at startup time a metadata-exporting Gemini server
   could check each user's document root for a .metadata.gmi file. For
   each such file that is found, the server can append a new link line
   pointing to that metadata.gmi file (relative to the server's toplevel
   document root) to its own toplevel $DOCUMENT_ROOT/.metadata.gmi if it
   exists. If a toplevel $DOCUMENT_ROOT/.metadata.gmi file doesn't
   exist, the server can create one containing just the links to the
   users' .metadata.gmi files.

   Note that this doesn't even have to happen at server start time.
   Instead, the server could program $DOCUMENT_ROOT/.metadata.gmi as a
   dynamic endpoint that checks for user-level .metadata.gmi files
   whenever it is called, thereby making users' metadata available as
   soon as the user publishes it to their capsule with no need for a
   server restart. (This is by far my favorite option.)

Okay, I think I've answered all your points. What do you think?

Best,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Feb 25, 2021 at 02:31:32PM -0500,
 Gary Johnson <lambdatronic at disroot.org> wrote 
 a message of 323 lines which said:

> 2. $DOCUMENT_ROOT/.well-known/metadata.gmi

AFAIK, we do not have a companion spec for well-known, no?

=> gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5785.txt RFC 5785 on .well-known

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Feb 26, 2021, at 14:35, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> AFAIK, we do not have a companion spec for well-known, no?

well-known is unknown to gemini.

At time, I wonder what "radical familiarity" actually means in the context 
of gemini crocket ? as gemini doesn't exhibit any traits which could 
reasonably be qualified as "radical", nor "familiar".

This will stay a mystery forever.

?0?

Link to individual message.

Gary Johnson <lambdatronic (a) disroot.org>

Stephane Bortzmeyer <stephane at sources.org> writes:

> AFAIK, we do not have a companion spec for well-known, no?
>
> => gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5785.txt RFC 5785 on .well-known

My preference is really for option 1. $DOCUMENT_ROOT/.metadata.gmi. I
included the .well-known possibility for discussion because it's use has
come up in the past w.r.t. the robots.txt location I believe.

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

---

Previous Thread: [tech] Announcing cl-gemini-client 1.0.0

Next Thread: [Clients] Gemini and accessibility regarding preformatted code blocks