<-- back to the mailing list

[spec] [tech] Companion Specification Proposal for Metadata

Gary Johnson lambdatronic at disroot.org

Thu Feb 25 19:31:32 GMT 2021

- - - - - - - - - - - - - - - - - - - 

Howdy Geminauts,

Rationale

It seems that the conversation about why and how to include metadatain Gemtext files has been raging on for quite a long time now with noreal conclusion in sight. Also, lacking a final spec-altering decisionby our BDFL (currently MIA, likely riding a refurbished bike through aforest with a ham radio right now), not much is likely to actuallychange in Geminispace.

Thus far, I've read passionate arguments on both sides of the metadatadebate, both for and against adding it to Gemtext. To me these have beenthe most compelling (YMMV):

For

Gemtext pages may be tagged with information that can be useful toautomated clients (e.g., search engines, archiving bots, and maybeproxies) that is otherwise difficult or impossible to infer fromperforming a full text search of the Gemtext file's contexts.

Against

Metadata represents a slippery slope to uncontrolled extensibility. Itmight be abused for server-specified styling, requesting externalresources (e.g., supporting client-side scripting or background imagesof kittens), or just generally making Gemtext pages hard to read inclients that don't hide inline metadata or make page concatenationdifficult with the end-of-file metadata proposal that's been discussedat some length on the mailing list.

It could also be used to reopen the fetid can of worms that was lastyear's discussion of extending mime-type attributes in the status 20response metadata, particularly around the topics of caching (now aclient-side best practices procedure), file size (computable by theclient during download), file integrity (already signaled bytls_close_notify), and file authenticity (managed out-of-band byincluding md5sum, sha256sum, sig, or asc files for download next tolinks that warrant manual verification).

More to the point, most if not all non-presentation/protocol-alteringmetadata attributes about a Gemtext page may already be encoded in theauthor's natural language with no changes to Gemtext at all. Thisenables content authors to express such information not only in theirlanguage of choice but also in the most culturally appropriate mannerfor their readers (consider the different interpretations of the date02/03/04 depending on where you live).

Consider the following example blog post that does just this:


Author: lambdatronicDate Written: 2021-02-25 (a.k.a. February 25, 2021)

## Why Bots Matter

Have you ever used (poor, uncared-for) GUS? Or Houston?

Have you ever really considered their feelings? They slave away all daytrying to sort and categorize every capsule in Geminispace just to saveyou time and energy when navigating across our little (but rapidlygrowing) constellation of text-powered space outposts?

They do their best with full-text search, with categorization bytoplevel headers, and with their own best estimates of the publishingtime of these capsules based on their own indexing times, but oh what aSisyphean task they toil at on our behalf.

If only they had a little metadata to ease their burden.

CAPCOM and Spacewalk get a little assistance from Atom and the Geminisubscription companion spec. Proxies can be pointed in the rightdirection by the robots.txt companion spec. Why can't our poor, poorsearch engines get a little relief?

Put yourself in their shoes and try to find compassion in your heart foryour friendly neighborhood bot. Every autonomous agent matters. Programshave feelings too. Leave no bot behind.

Copyright: CC-BY-SATags: irony education advocacy bots```

# Proposal

Considering that:

1. Metadata /within/ a Gemtext file carries a number of liabilities that   make some of our community members nervous (understandably so IMO).

2. The subset of metadata that is meant to be read and understood by a   human reader using a typical Gemini client can already be expressed   in natural language without any community-approved tag   standardization.

3. The main value to attaching standardized metadata tags to Gemtext   pages is likely to simply aid automated bots supporting search   engines and archiving.

4. Geminispace is filled with files in more formats than just Gemtext,   many (all?) of which could benefit from similar bot-assisting   metadata.

5. Both aggregators and proxies already have companion specifications   that have been (somewhat) adopted by the community and seem to fare   better in our community than direct changes to the Gemini protocol or   Gemtext specifications.

We propose a companion specification for metadata, in which all themetadata about the static files and/or dynamic endpoints (of any format)in a capsule be included in a separate file accessible at a well-knownlocation that a bot could check as it crawls through Geminispace.

As placeholders, let's put forward these candidates for discussion:

1. $DOCUMENT_ROOT/.metadata.gmi2. $DOCUMENT_ROOT/.well-known/metadata.gmi

In the Gemini spirit of reducing network requests (only one requestneeded per capsule here) and storing our information in a human-readableformat (good old ubiquitous text/gemini), here's my initial stab at adead simple format for these metadata files:

I can write anything I want in this file, and it will be treated ascomments unless it is of line type link (=

) or bulleted list (*). Idon't have to write these comments, and if I left them out, I'd makethis easier to read, but sometimes I can't stop blabbing in my metadatafiles.

Another Header-Level Comment About My Toplevel Pages

=

/ Lambdatronic's Gemini Capsule=
/index.gmi Lambdatronic's Gemini Capsule

Now I'll Comment About Some Stuff

=

/stuff Some Stuff I Like=
/stuff/this.gmi Astronomy Stuff=
/stuff/that.gmi Bike Stuff

=

/stuff/this.gmi Astronomy Stuff

=

/stuff/that.gmi Bike Stuff

Now I'll Comment About Some Things

=

/things/some-gemtext.gmi I Wrote Something=
/things/some-plain-text.txt Sometimes I Write in Plain Text=
/things/obligatory-cat-picture.png Meow (and also) Meow=
/things/my-best-1990s-mixtape.ogg Too Much Green Day

=

/things/obligatory-cat-picture.png Meow (and also) Meow

=

/things/my-best-1990s-mixtape.ogg Too Much Green Day

Okay, so that's pretty much it. Essentially the metadata.gmi syntax isbased around two existing Gemini line types:

1. Links

You should include a /relative/ link line to each path on your capsule for which you want to provide metadata. This is meant to be an entirely opt-in process, so any paths that you leave out will simply have no metadata associated with them, leaving bots to rely on whatever methods they so choose to tag and index your pages.

You may include any number of link lines one after the other in the file, and you may (if you think it provides some value that outweighs the possible loss of readability) include any other line type between any two link lines without changing the parsing semantics EXCEPT for a bulleted list line type (*).

2. Bulleted Lists

Any bulleted list in a metadata.gmi file will be interpreted as a metadata attribute specifier for all link lines preceding it up to the most recent prior bulleted list line or the top of the file, whichever is encountered first.

If the same link element appears more than once within the file and is therefore followed by more than one bulleted list of metadata attributes, all encountered metadata attributes should be merged into a single list per link element. If the same attribute is specified more than once for the same link element (whether in the same bulleted list or in separate bulleted lists within the file), the attribute whose value appears later in the file should overwrite the earlier specified value.

Each bulleted list line should use the following format to indicate a single pair of attribute and value:

*[WHITESPACE]<ATTRIBUTE>[WHITESPACE]:[WHITESPACE]<VALUE>

Here, [WHITESPACE] is optional and <ATTRIBUTE>, :, and <VALUE> are required line elements.

NOTE: All other line types should be ignored by metadata parsers and treated as comments by the file's author.

In order to parse a metadata.gmi file, a program would start reading itin line by line as with any Gemtext file. All lines that are not of typelink (=

) or bulleted list (*) should be ignored.

1. When a link line is encountered, it should be stored in the program's memory as a currently active link. If more links are read in before a bulleted list is reached, each of these links should be stored in memory as active links. Multiple links may be active at the same time.

2. When a bulleted list line is encountered, it should be parsed according to the attribute=value specification described in point 2 (Bulleted Lists) above. If the line's contents do not match this specification, it should be ignored and treated as another comment line. Depending on the program's design, it may be valuable to report this line as a syntax error.

3. For each bulleted list line that parses correctly, assign the attribute=value pair to all currently active links in memory. In the event of an attribute conflict, overwrite the old attribute=value pair associated with the active link in conflict with the attribute=value pair being read. In the event that no links are currently active in memory, simply continue on to the next line.

4. If/when the next link line is encountered, mark all links in memory as inactive, store the currently read link line in memory as an active link, and resume program execution from step 1 above.

And that's all there is to it. When your program reaches the end of themetadata.gmi file, it should have a data structure containing all of itslinks and associating each of them with a table of attribute=valuepairs. The program can then do whatever it wants with this information.

Conclusion

Here, we've proposed a (currently very informal) companion specificationfor an /optional/ toplevel metadata.gmi file per capsule.

This approach has the following advantages over specifying metadataattributes within Gemtext files:

1. No need to extend the Gemtext spec with more line types.

2. No potential for impacting the readability of existing Gemtext files for non-metadata-aware clients.

3. No potential for presentation/behavior abuse within existing Gemtext files since nothing is being added to them.

4. Less bandwidth needed per request to a Gemtext page since non-metadata-aware clients won't have to download per-page metadata that they don't use.

5. Less bandwidth needed to download metadata when you want it since paths that share the same value for metadata attributes can specify the shared attributes once in metadata.gmi rather than once per page.

6. Less requests needed for metadata-scraping bots since they can simply request the toplevel /.metadata.gmi path rather than having to request and parse every Gemtext file for optional metadata.

7. Can be used to attach metadata to non-Gemtext files as well as to responses from dynamic endpoints (e.g., CGI scripts).

8. Could be used by Stephane Bortzmeyer to easily figure out how many Gemini capsules actually want to publish metadata about themselves. ;D

So that's my proposal. Let's talk about it constructively and see if itcan be improved upon. I'm sure you brilliant folks can think ofsomething that hasn't yet crossed my mind.

Failing that, we can always just nuke the whole topic either by groupconsensus or just by not taking this to the next step of actuallywriting up a formal companion spec and implementing it in some clients.

Thanks again for everyone's hard work and creativity in making Geminithe really interesting, vibrant, quirky, and passionate community thatit is. I look forward to reading your responses.

Happy hacking, Gary

-- GPG Key ID: 7BC158EDUse `gpg --search-keys lambdatronic' to find meProtect yourself from surveillance: https://emailselfdefense.fsf.org=======================================================================() ascii ribbon campaign - against html e-mail/\ www.asciiribbon.org - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.See http://www.gnu.org/philosophy/no-word-attachments.html