(proposal) on metadata in documents

1. smlckz (a) tilde.pink (smlckz (a) tilde.pink)

I am proposing a convention of putting human and machine readable metadata 
in documents (in ''geminisphere''). This is completely optional for 
document writers.

The metadata should be placed at the end of the document so that the 
viewers can view the content first.

For now, I am proposing the following metadata for inclusion in documents 
(all of which is optional):







IMO, we should use ISO 8601 for the date/time in metadata.

The clients may use the information, but may not hide the metadata.
The spiders/bots can also use the information
(when indexing/archiving documents) as well.

Now the question for you is how the metadata is formatted?
Please share your thoughts on it.


~smlckz

Link to individual message.

2. Drew DeVault (sir (a) cmpwn.com)

On Sat Nov 14, 2020 at 12:07 PM EST,  wrote:
> * the date (and maybe time) when the document was published
>
> * the date (and maybe time) when the document was last modified

Can already pull this out of RSS

Link to individual message.

3. Sean Conner (sean (a) conman.org)

It was thus said that the Great Drew DeVault once stated:
> On Sat Nov 14, 2020 at 12:07 PM EST,  wrote:
> > * the date (and maybe time) when the document was published
> >
> > * the date (and maybe time) when the document was last modified
> 
> Can already pull this out of RSS

  Which version?  There are around half a dozen variations that are not all
compatible with each other.

  -spc (And that also assumes you have an RSS feed for every page on the
	site)

Link to individual message.

4. Sean Conner (sean (a) conman.org)

It was thus said that the Great smlckz at tilde.pink once stated:
> I am proposing a convention of putting human and machine readable metadata 
> in documents (in ''geminisphere''). This is completely optional for 
> document writers.
> 
> The metadata should be placed at the end of the document so that the 
> viewers can view the content first.
> 
> For now, I am proposing the following metadata for inclusion in documents 
> (all of which is optional):
> 
> *  the date (and maybe time) when the document was published
> 
> *  the date (and maybe time) when the document was last modified
> 
> *  copyright information and/or license of the document
> 
> IMO, we should use ISO 8601 for the date/time in metadata.
> 
> The clients may use the information, but may not hide the metadata.
> The spiders/bots can also use the information
> (when indexing/archiving documents) as well.
> 
> Now the question for you is how the metadata is formatted?
> Please share your thoughts on it.

  Okay.

Created 2020-11-14T17:34:19-0500
Modified 2020-11-14T17:50:03-0500
Copyright 2020 by Sean Conner.

  The timestamp was created with the following Unix command: "date +%FT%T%z"
so that's pretty easy.  And you know, if you move the lines to the top of
the document, put the Modified: header first, a client would only have to
read the first 34 bytes of the document to see if it's modified, and if it
hasn't since the client last read it, the client can close the connection. 
Caching solved!

  -spc (Add a Size header and you solve the size problem as well!)

Link to individual message.

5. Petite Abeille (petite.abeille (a) gmail.com)



> On Nov 14, 2020, at 23:57, Sean Conner <sean at conman.org> wrote:
> 
> And you know, if you move the lines to the top of the document

Furthermore, if you add an empty line after all these, hmmm, lines, you 
have reinvented MIME! Hurray! Smells like 1982 all over again! Long live RFC822!

Link to individual message.

6. Petite Abeille (petite.abeille (a) gmail.com)



> On Nov 14, 2020, at 23:52, Sean Conner <sean at conman.org> wrote:
> 
>  Which version?

Atom (Web standard), RFC 4287, December 2005.

>  There are around half a dozen variations that are not all compatible with each other.

Ignore them. No point in replaying a decade of bickering. 

> 
>  -spc (And that also assumes you have an RSS feed for every page on the
> 	site)

...

Link to individual message.

7. smlckz (a) tilde.pink (smlckz (a) tilde.pink)

On Sat, 14 Nov 2020, Drew DeVault wrote:
> Can already pull this out of RSS

Every document does not and need not have to have a RSS feed associated with it.
In those pages which have a RSS feed, you need to parse XML and who likes that?
The proposed convention is meant to be simple to parse, write and understand. 
No need for another library.

~smlckz

Link to individual message.

8. smlckz (a) tilde.pink (smlckz (a) tilde.pink)

On Sat, 14 Nov 2020, Sean Conner wrote:
>  Okay.
>
> Created 2020-11-14T17:34:19-0500
> Modified 2020-11-14T17:50:03-0500
> Copyright 2020 by Sean Conner.
>
>  The timestamp was created with the following Unix command: "date +%FT%T%z"
> so that's pretty easy.

That's one way of doing that. Can we do better than that?

> And you know, if you move the lines to the top of
> the document, put the Modified: header first, a client would only have to
> read the first 34 bytes of the document to see if it's modified, and if it
> hasn't since the client last read it, the client can close the connection.
> Caching solved!
>
>  -spc (Add a Size header and you solve the size problem as well!)
>

We don't want or need anything like that. That is a breaking change to the 
spec so breaks all existing clients.

Let me change my wordings a little bit.

>> The metadata should be placed at the end of the document so that the
>> viewers can view the content first.
I should have said ''must'' instead of ''should''.
>> The clients may use the information, but may not hide the metadata.
''must'' not hide.

mmmh..

~smlckz

Link to individual message.

9. Jon (jon (a) shit.cx)

On 14/11/20 17:07, smlckz at tilde.pink wrote:
> For now, I am proposing the following metadata for inclusion in
> documents (all of which is optional):
> 
> *  the date (and maybe time) when the document was published
> 
> *  the date (and maybe time) when the document was last modified
> 
> *  copyright information and/or license of the document

I would also like a field for the source of the document. This will
allow people to take local copies of documents without loosing track of
where they came from and where updated versions may be found. It may
also simplify mirroring sites too, especially given their is a field for
the license information.

I've had a dilemma about whether I should be placing the sites name at
the top-level heading or elsewhere. If there were a location field, I
would avoid placing the site's name at the top-most header, reclaiming
another layer of heading depth in the process.

Regardless of whether this becomes an official standard, I will probably
adopt this or something like it because it makes so much sense.

I would additionally like a field for the author and authors email
address, but that's less important to me.

I'm also proposing that these additional fields are optional.

-- 
Jon

Link to individual message.

10. Sean Conner (sean (a) conman.org)

It was thus said that the Great smlckz at tilde.pink once stated:
> On Sat, 14 Nov 2020, Sean Conner wrote:
> > Okay.
> >
> >Created 2020-11-14T17:34:19-0500
> >Modified 2020-11-14T17:50:03-0500
> >Copyright 2020 by Sean Conner.
> >
> > The timestamp was created with the following Unix command: "date +%FT%T%z"
> >so that's pretty easy.
> 
> That's one way of doing that. Can we do better than that?

  What's wrong with that format?  It's an ISO standard, it's locale neutral,
easy to parse, and it's easy for humans to read.  I don't think you really
can do better than that.  Unless you really want to parse dates like

	vuos, sk?b 16. b. 2020 02:27:32 CET

> >And you know, if you move the lines to the top of
> >the document, put the Modified: header first, a client would only have to
> >read the first 34 bytes of the document to see if it's modified, and if it
> >hasn't since the client last read it, the client can close the connection.
> >Caching solved!
> >
> > -spc (Add a Size header and you solve the size problem as well!)
> >
> 
> We don't want or need anything like that. That is a breaking change to the 
> spec so breaks all existing clients.

  Like adding headers isn't a breaking change?  And I'm not adding the size
to the MIME type, but to the other "fields", something like:

Created 2020-11-15T20:29:01-0500
Modifie 2020-11-15T20:29:01-0500
Copyright 2020 by Sean Conner
Size 806
Cache not-on-your-life
User-Agent myGeminiClient-1.13
MD5sum fd888c3218f34e71dc57221143d44ccb

> Let me change my wordings a little bit.
> 
> >>The metadata should be placed at the end of the document so that the
> >>viewers can view the content first.
> I should have said ''must'' instead of ''should''.

  Kill joy.

> >>The clients may use the information, but may not hide the metadata.
> ''must'' not hide.
> 
> mmmh..

  Mmmmmh indeed ...

  -spc

Link to individual message.

11. Philip Linde (linde.philip (a) gmail.com)

On Sat, 14 Nov 2020 17:57:34 -0500
Sean Conner <sean at conman.org> wrote:

> -spc (Add a Size header and you solve the size problem as well!)

Nope, not for as long as it's optional (can never reliably tell that
you've received the complete document if the connection dies before
receiving the full header) and only part of text/gemini (can never tell
what the size is for other document types, which IMO are more likely
to be a concern size-wise; a photo's size easily exceeds that of a
Bible sized text/gemini document).

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201116/801c
6031/attachment.sig>

Link to individual message.

12. Philip Linde (linde.philip (a) gmail.com)

On Sun, 15 Nov 2020 19:41:47 +0000 (UTC)
smlckz at tilde.pink wrote:

> We don't want or need anything like that. That is a breaking change to 
the spec so breaks all existing clients.

How so? My client will never interpret such meta-data, and I see the
lack of provisions for it in text/gemini as a feature, but I don't see
how adding it would break my client. To my client, it's regular text
lines in the document body.

The main advantage of this proposal is that the spec really doesn't need
to be concerned with it. It's still text/gemini and there are no
changes to the protocol. That makes it a great honeypot. Hopefully,
most future suggestions for changes to the protocol can instead be
added to an evergrowing list of in-document header fields that no one
will implement.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201116/8397
e6ff/attachment.sig>

Link to individual message.

13. smlckz (a) tilde.pink (smlckz (a) tilde.pink)

On Mon, 16 Nov 2020, Philip Linde wrote:

> On Sun, 15 Nov 2020 19:41:47 +0000 (UTC)
> smlckz at tilde.pink wrote:
>
>> We don't want or need anything like that. That is a breaking change to 
the spec so breaks all existing clients.
>
> How so? My client will never interpret such meta-data, and I see the
> lack of provisions for it in text/gemini as a feature, but I don't see
> how adding it would break my client. To my client, it's regular text
> lines in the document body.
What I thought was that if header metadata were to be introduced, they need to
be hidden (at least) as they degrade user experience. You as a visitor do not want
to see the document is X bytes in size or the md5sum or sha512sum of the
document before actual content.

And if they need to be hidden, the spec needs to be changed and that
would break existing clients. Hopefully you can see what I meant.

For this reason, I had to be ''Kill joy'' and amend my proposal so that
the metadata must be placed at the end of the document.

> The main advantage of this proposal is that the spec really doesn't need
> to be concerned with it. It's still text/gemini and there are no
> changes to the protocol. That makes it a great honeypot. Hopefully,
> most future suggestions for changes to the protocol can instead be
> added to an evergrowing list of in-document header fields that no one
> will implement.

Not only text/gemini, but also any other text/* format. I have clearly
stated that this is just metadata. As each field is optional, 
unrecognised fields would be ignored.

> -- 
> Philip
>

~smlckz

Link to individual message.

14. smlckz (a) tilde.pink (smlckz (a) tilde.pink)

On Sun, 15 Nov 2020, Sean Conner wrote:

>  What's wrong with that format?  It's an ISO standard, it's locale neutral,
> easy to parse, and it's easy for humans to read.  I don't think you really
> can do better than that.  Unless you really want to parse dates like
>
> 	vuos, sk?b 16. b. 2020 02:27:32 CET

I am not against ISO 8601 format and don't want to dive into l10n mess. 
I wonder if we need a seperator between content and metadata or not, or
it'd be better to put the whole metadata into a preformatted text block
with alt-text of `metadata`, or using some prefix for each line of metadata
field. What do you think?


~smlckz

Link to individual message.

15. cbabcock (a) asciiking.com (cbabcock (a) asciiking.com)

November 16, 2020 4:17 AM, smlckz at tilde.pink wrote:

> I am not against ISO 8601 format and don't want to dive into l10n mess. 
I wonder if we need a
> seperator between content and metadata or not, or
> it'd be better to put the whole metadata into a preformatted text block
> with alt-text of `metadata`, or using some prefix for each line of metadata
> field. What do you think?
> 
> ~smlckz

The least friction way to implement metadata would be to present it a preformatted yaml:

 ``` yaml
title:  'This is the title: it contains a colon'
author:
- Author One
- Author Two
keywords: [nothing, nothingness]
abstract: |
  This is the abstract.

  It consists of two paragraphs.
 ```

A variation that I believe would have utility the most utility in clients 
would be a new syntactic feature called metadata implemented similar to 
preformatted blocks, like:

--- yaml
title:  'This is the title: it contains a colon'
author:
- Author One
- Author Two
keywords: [nothing, nothingness]
abstract: |
  This is the abstract.

  It consists of two paragraphs.
---

Not evangelizing yaml in this context, though it's not a bad fit. Just 
taking my example from Pandoc documentation - 
https://pandoc.org/MANUAL.html#metadata-blocks 

Obviously document writers *can* include metadata in any document they 
write, so the question would be whether the value added by encouraging a 
uniform presentation is worth defining a metadata line type in the 
specification. I think doing so enables interesting options for client 
code, and that creating a line type as opposed to dictating page placement 
is more idiomatically gemini text

Chris

Link to individual message.

16. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

On Saturday, November 14, 2020 6:24 PM, Petite Abeille <petite.abeille at 
gmail.com> wrote:

> > On Nov 14, 2020, at 23:52, Sean Conner sean at conman.org wrote:
> > Which version?
>
> Atom (Web standard), RFC 4287, December 2005.
>
> > There are around half a dozen variations that are not all compatible with each other.
>
> Ignore them. No point in replaying a decade of bickering.


Seconded. Atom is a strong spec and has not other versions, extensions, date issues, etc
to deal with. It's already used for much of Gemini, thanks to CAPCOM.


makeworld

Link to individual message.

17. text (a) sdfeu.org (text (a) sdfeu.org)

Reading about favicon in Gemini by way of an extra request to the host 
serving a requested document, I searched for "metadata" in old threads.

It seems to me the quoted message from 2020-11-14 below could serve as a 
solution to the debate.

Note that the favicon RFC basically uses this approach itself, stating 
some `key: value` pairs within the document.

=> gemini://mozz.us/files/rfc_gemini_favicon.gmi RFC: Adding Emoji 
Favicons to Gemini

Why not use this kind of structured metadata lines for an (as per RFC 
still unmotivated) favicon convention?

Favicon: #


On Sat, 14 Nov 2020 17:07:30 +0000, smlckz wrote:

> I am proposing a convention of putting human and machine readable
> metadata in documents (in ''geminisphere''). This is completely optional
> for document writers.
> 
> The metadata should be placed at the end of the document so that the
> viewers can view the content first.
> 
> For now, I am proposing the following metadata for inclusion in
> documents (all of which is optional):
> 
> *  the date (and maybe time) when the document was published
> 
> *  the date (and maybe time) when the document was last modified
> 
> *  copyright information and/or license of the document
> 
> IMO, we should use ISO 8601 for the date/time in metadata.
> 
> The clients may use the information, but may not hide the metadata.
> The spiders/bots can also use the information (when indexing/archiving
> documents) as well.
> 
> Now the question for you is how the metadata is formatted?
> Please share your thoughts on it.
> 
> 
> ~smlckz

Link to individual message.

18. Solene Rapenne (solene (a) perso.pw)

On Sun, 21 Feb 2021 13:06:53 -0000 (UTC)
text at sdfeu.org:

> Reading about favicon in Gemini by way of an extra request to the host 
> serving a requested document, I searched for "metadata" in old threads.
> 
> It seems to me the quoted message from 2020-11-14 below could serve as a 
> solution to the debate.
> 
> Note that the favicon RFC basically uses this approach itself, stating 
> some `key: value` pairs within the document.
> 
> => gemini://mozz.us/files/rfc_gemini_favicon.gmi RFC: Adding Emoji   
> Favicons to Gemini
> 
> Why not use this kind of structured metadata lines for an (as per RFC 
> still unmotivated) favicon convention?
> 
> Favicon: #
> 

It seems your are suggesting implementing equivalent of http headers
that are key: values pair and are not part of the document but is transmitted
in the reply. Currently gemini only returns the status code, the content type
and potentially the language (this is not mandatory).

That's an endless rabbithole that the Gemini protocol should better
not explore because it allows endless extendability.

Link to individual message.

19. Oliver Simmons (oliversimmo (a) gmail.com)

On Sun, 21 Feb 2021 at 13:13, Solene Rapenne <solene at perso.pw> wrote:
>
> It seems your are suggesting implementing equivalent of http headers
> that are key: values pair and are not part of the document but is transmitted
> in the reply. Currently gemini only returns the status code, the content type
> and potentially the language (this is not mandatory).
>
> That's an endless rabbithole that the Gemini protocol should better
> not explore because it allows endless extendability.

This isn't headers of any sort, it's document metadata, similar to
HTML's <head> and <meta>.

Link to individual message.

20. text (a) sdfeu.org (text (a) sdfeu.org)

On Sun, 21 Feb 2021 19:51:39 +0000, Oliver Simmons wrote:
> This isn't headers of any sort, it's document metadata, similar to
> HTML's <head> and <meta>.

I loved Opera's native navigation support for HTML's rel prev/next tags.

https://www.w3.org/TR/2018/SPSD-html32-20180315/#link states:
> LINK provides a media independent method for defining relationships 
with other documents and resources. LINK has been part of HTML since the 
very early days, although few browsers as yet take advantage of it (most 
still ignore LINK elements). 

https://news.ycombinator.com/item?id=11515888 has some comments on it:
> Literally one of the greatest things about the Opera browser was that 
you could browse an entire forum or whatever (longform article etc) with 
the Space key

Link to individual message.

21. Oliver Simmons (oliversimmo (a) gmail.com)

On Sun, 21 Feb 2021 at 20:29, <text at sdfeu.org> wrote:
>
> I loved Opera's native navigation support for HTML's rel prev/next tags.
>
> https://www.w3.org/TR/2018/SPSD-html32-20180315/#link states:
> > LINK provides a media independent method for defining relationships
> with other documents and resources. LINK has been part of HTML since the
> very early days, although few browsers as yet take advantage of it (most
> still ignore LINK elements).
>
> https://news.ycombinator.com/item?id=11515888 has some comments on it:
> > Literally one of the greatest things about the Opera browser was that
> you could browse an entire forum or whatever (longform article etc) with
> the Space key
>
>

That sounds neat, would be useful for orbits/webrings (such as LEO).
The current link system works ok, but can be a bit clunky.

Link to individual message.

22. Sean Conner (sean (a) conman.org)

It was thus said that the Great Oliver Simmons once stated:
> On Sun, 21 Feb 2021 at 13:13, Solene Rapenne <solene at perso.pw> wrote:
> >
> > It seems your are suggesting implementing equivalent of http headers
> > that are key: values pair and are not part of the document but is transmitted
> > in the reply. Currently gemini only returns the status code, the content type
> > and potentially the language (this is not mandatory).
> >
> > That's an endless rabbithole that the Gemini protocol should better
> > not explore because it allows endless extendability.
> 
> This isn't headers of any sort, it's document metadata, similar to
> HTML's <head> and <meta>.

  One can go overboard on metadata.  Check out the source on this joker's
website:

	http://boston.conman.org/2021/02/17.1

Almost a *hundred* lines of metadata!  Madness!  Madness I say!

  -spc (and he's not even sure he has enough!  The fool!)

Link to individual message.

23. BjΓΆrn WΓ€rmedal (bjorn.warmedal (a) gmail.com)

I've seen some people put key/value type metadata in their gmi files
already ("tags: this,that,whatevs" for example). Go ahead and do it if
you want; I personally like that you want to put them at the end of
the document, where they won't bother anybody.

As for including it in the spec... I'd rather not. Treat them as
optional extensions :)

Cheers,
ew0k

Link to individual message.

24. Oliver Simmons (oliversimmo (a) gmail.com)

On Mon, 22 Feb 2021 at 07:57, Bj?rn W?rmedal <bjorn.warmedal at gmail.com> wrote:
>
> I've seen some people put key/value type metadata in their gmi files
> already ("tags: this,that,whatevs" for example). Go ahead and do it if
> you want; I personally like that you want to put them at the end of
> the document, where they won't bother anybody.
>
> As for including it in the spec... I'd rather not. Treat them as
> optional extensions :)
>

The spec has "advanced line types" which are treated as optional:
> 5.5 Advanced line types
> The following advanced line types MAY be recognised by advanced clients. 
Simple clients may treat them all as text lines as per 5.4.1 without any 
loss of essential function.

Having the format as part of the spec would be good, I don't think
having an official list of key:values in the spec should be a thing
though, that should be separate.

Link to individual message.

25. Petite Abeille (petite.abeille (a) gmail.com)



> On Feb 22, 2021, at 11:24, Oliver Simmons <oliversimmo at gmail.com> wrote:
> 
> Having the format as part of the spec would be good

text/parameters

   The ABNF [RFC5234] grammar for "text/parameters" content is:

   file             = *((parameter / parameter-value) CRLF)
   parameter        = 1*visible-except-colon
   parameter-value  = parameter *WSP ":" value
   visible-except-colon = %x21-39 / %x3B-7E    ; VCHAR - ":"
   value            = *(TEXT-UTF8char / WSP)
   TEXT-UTF8char    = <as defined in Section 20.1>
   WSP              = <See RFC 5234> ; Space or HTAB
   VCHAR            = <See RFC 5234>
   CRLF             = <See RFC 5234>


https://tools.ietf.org/html/rfc7826#page-305 
<https://tools.ietf.org/html/rfc7826#page-305>


?0?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20210222/e498
c7f6/attachment.htm>

Link to individual message.

26. Oliver Simmons (oliversimmo (a) gmail.com)

On Mon, 22 Feb 2021 at 10:31, Petite Abeille <petite.abeille at gmail.com> wrote:
>
> text/parameters
>
>    The ABNF [RFC5234] grammar for "text/parameters" content is:
>
>    file             = *((parameter / parameter-value) CRLF)
>    parameter        = 1*visible-except-colon
>    parameter-value  = parameter *WSP ":" value
>    visible-except-colon = %x21-39 / %x3B-7E    ; VCHAR - ":"
>    value            = *(TEXT-UTF8char / WSP)
>    TEXT-UTF8char    = <as defined in Section 20.1>
>    WSP              = <See RFC 5234> ; Space or HTAB
>    VCHAR            = <See RFC 5234>
>    CRLF             = <See RFC 5234>
>
>
> https://tools.ietf.org/html/rfc7826#page-305
>

That's perfect! It's pretty much what I was describing, following an
existing spec would be great! :)

Link to individual message.

27. Petite Abeille (petite.abeille (a) gmail.com)



> On Feb 22, 2021, at 11:51, Oliver Simmons <oliversimmo at gmail.com> wrote:
> 
> That's perfect! It's pretty much what I was describing, following an
> existing spec would be great! :)

In terms of keys, RFC822 & Co. is not a bad place to start.

+ modernization, i.e. ISO 8601.

?0?

Link to individual message.

---

Previous Thread: Why not use the markdown way to deal with long lines?

Next Thread: [ANN] -- kinda? LEO, Molniya, and the greater question of Gemini webrings