<-- back to the mailing list

Metadata Without A Proposal

nothien at uber.space nothien at uber.space

Fri Feb 26 10:51:08 GMT 2021

- - - - - - - - - - - - - - - - - - - 

Hi!

I've lost track of the currently raging metadata thread entirely, and soI've started this as a new post.

Thus far, I think there's general consensus on the following needs forany metadata proposal:

1. Must degrade gracefully for clients that don't understand metadata.

2. Must not be English-specific. Although the majority of gemtext/Gemini content is in English at the moment, we want more diversity. "Forcing" (by convention) the usage of English upon non-English users is unwanted. This rules out some of the current proposals which are oriented around 'tags', e.g. 'author' or 'license'. Theoretically, you could have a list of tags for different languages, but that would grow into a horrifically long list, and is generally unsustainable. 3. Must be machine-parsable. Search engines, archivers, and other crawler-style clients need to be attended to. Some of the information they need is: date, author, and license. 4. Should affect presentation. gemtext as a whole is about separating content from presentation. Some of the earlier metadata proposals referred to metadata for presentation, e.g. to specify a color to view the text in. This is against the spirit of gemtext/Gemini (if not the spec). 5. Must be difficult to extend. Again, this comes from the general Gemini philosophy that anything that can be misused will be misused. This rules out lots of current proposals because they specify tags, and the usage of tags can only be controlled by convention, which is subject to change. 6. Must be accessible. Some proposals discussed the usage of emojis, and others have opted for creating new unofficial line types. These don't degrade gracefully for things like screen readers, until they adopt the metadata proposal. That's not great.

I think that we don't need a "metadata proposal" to solve any of theseproblems. We already have everything we need in pre-existing formatsand specifications. Only three metadata fields are really necessary:date, author, and license. New fields, if completely necessary, need tobe handled on a case-by-case basis.

Dates

Dating content is mostly relevant to search engines, so that old (ornew) results can be filtered out. My proposal with dates is to use whatwe already have - the gmisub companion spec. If any content (e.g. anarticle) has an associated date, the index page should in gmisub formatlist the content page with the date. If content pages don't have anyassociated date, simply don't list a date in the index. Search enginesand crawlers can still choose to include date information based on whenthey last crawled the page.

=

gemini://gemini.circumlunar.space/docs/companion/

One question this raises is what index page to use. I think that theengine should search through parent directories until it finds one whichfits the gmisub format and has the content page (they would need to dothis anyways in order to crawl the capsule containing the content page).If the engine already knows about an index page which is on the samecapsule and that has the content page, it can use that.

Licenses

We already have a great convention for licenses: giving it on the lastline of the document, with the line starting with `--`. For example:


-- CC-BY-SA nothien```

All we need to do with this convention is to formalize it as a companionspecification, maybe as `-- [SPDX license identifier] [owner]`.

## Authors

There are two possibilities I see with author metadata: either take itfrom the license line, discussed above, or extend the gmisub spec toalso allow for an optional author field.

URL YYYY-MM-DD (Author) Title```

We can tweak the format around a bit so that currently existing titleswhich start with parenthesized text aren't misinterpreted. In addition,one shouldn't have to repeat the author field for every line; we canhave some system like only requiring the author field when it isdifferent from the immediately previous author. I prefer the firstoption, but I haven't explored when the license owner would differ fromthe author (which I think is the case for e.g. news companies).

Other Fields

Clearly, other fields aren't supported by this. If you want to placeadditional metadata in your content, then I suggest writing it innatural language. If it is absolutely necessary to have itmachine-parsable (so that it can be specially understood by e.g. searchengines) then we can talk about that here on the ML, but others haveargued against e.g. tags because they allow easily manipulating searchresults. Expect resistance.

Metadata for Storage

Author and license metadata is stored within the page itself, and sothat's not a problem. Personally, I store date information in the filename of the document (e.g. 2021-02-26-proposal.gmi), but I understandthat this doesn't work for everyone: in that case, see below.

There are legitimate uses for additional metadata when storing gemtext,such as for capsule-local tagging. These fields should be stored usingany arbitrary convention in the content: after all, these fields are notmeant to be parsed by external client software (i.e. search engines andcrawlers), but are only parsed by capsule-local software (such as toorganize content by tag).

Conclusion

I don't think we need a 'metadata proposal' to achieve the goals we'relooking for. The format conventions are already mostly in place; wejust need to formalize them.

~aravk | ~nothien