💾 Archived View for gemini.conman.org › gRFC › 0004 captured on 2020-10-31 at 00:48:17. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2020-09-24)

-=-=-=-=-=-=-

Subject: Gemini index format

From: Sean Conner <sean@conman.org>

Date: Thu, 6 Sep 2019

Content-Type: text/gemini

Status: PROPOSED

A Proposed Formatting Specification

for Gemini Index files.

by Sean Conner

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and

"OPTIONAL" in this document are to be interpreted as described in BCP 14

[RFC2119] [RFC8174] when, and only when, they appear in all capitals, as

shown here.

[RFC2119] BCP 14

[RFC8174] (update)

A Gemini index file, regardless of character encoding [1], shall only

consist of the space character [2], graphic characters [3] and a limited set

of control characters out of the C0 set [4]; the C1 control set [5] is

outright rejected and MUST NOT appear in a Gemini index file.

The following C0 control set characters are allowed:

HT (character 9) Horizontal Tab.

Classified as "whitespace"

LF (character 10) Line Feed.

Classified as "end of line marker"

CR (character 13) Carriage Return.

Classified as "end of line marker"

SP (character 32) Space.

Classified as "whitespace"

Any other C0 control character MUST NOT appear in a Gemini index file.

Characters not defined as "end of line marker" or "whitespace" is

considered, per this specification, to be a "graphical character".

The two characters LF and CR MUST appear in that order in a Gemini index

file. It is unspecified (at this time) what should happen if a single LF or

CR is encountered. Both characters together constitute the "end of line

marker". It is also unspecified (at this time) what should happen if a C0

control character not listed above, or a C1 control character is encountered

in a Gemini index file.

A "line of text" is any sequence of "whitespace" and "graphical characters"

followed by an "end of line marker".

A "line of text" that starts with the character sequence => is considered a

"link line" and contains a link to another document. The BNF [RFC5234] for

a "link line" is:

link = mark WSP url [ WSP text ] CRLF

mark = "=>"

url = %x21-7E

; see [RFC3986] for syntax

text = %x20-FF

; see [RFC3629] for format

CR = %x0D

LF = %x0A

CRLF = CR LF

SP = %x20

HTAB = %x09

WSP = SP / HTAB

[RFC5234] BNF syntax

[RFC3986] URL syntax

[RFC3639] UTF-8 format

For maximum interoperability, the text portion (if present) should be at

most 40 characters in length; if longer, it is up to the client to handle it

as it sees fit. It MAY "wrap", it SHOULD "reflow", it MAY "cut off" the text.

If the text portion doesn't appear, then the URL MUST be displayed as the

text portion, subject to the same limitations just mentioned.

To "wrap" text, once the text has reached the right edge of the screen [6],

the text resumes at the left edge, even if it cuts a word in half. Upon

encoutering an "end of line marker", move to the next line. For example, to

"wrap" the following paragraphs:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum mauris

leo, condimentum vitae varius at, elementum sed odio. Sed commodo felis

lacinia blandit vestibulum.

Duis vel sagittis massa. Maecenas sodales dui

tristique velit luctus tincidunt a sit amet neque. Sed vitae velit in

sapien semper accumsan. Nulla sem odio, malesuada a viverra at, tristique

eu tortor.

"Quisque auctor porta enim, eget tincidunt augue cursus non. Nulla at

condimentum purus. Curabitur maximus malesuada risus, at ultrices nisl

luctus vel. Nulla eget est luctus, dignissim urna vel, luctus felis. Donec

facilisis malesuada porta. Nulla elementum felis ut justo sollicitudin

pellentesque. Vestibulum faucibus, ipsum tincidunt volutpat lacinia, turpis

libero bibendum sem, in malesuada turpis lectus et neque."

with a width of 30 characters:

123456789012345678901234567890

------------------------------

"Lorem ipsum dolor sit amet, c

onsectetur adipiscing elit. Ve

stibulum mauris leo, condiment

um vitae varius at, elementum

sed odio. Sed commodo felis la

cinia blandit vestibulum.

Duis vel sagittis massa. Maece

nas sodales dui tristique veli

t luctus tincidunt a sit amet

neque. Sed vitae velit in sapi

en semper accumsan. Nulla sem

odio, malesuada a viverra at,

tristique eu tortor.

"Quisque auctor porta enim, eg

et tincidunt augue cursus non.

Nulla at condimentum purus. C

urabitur maximus malesuada ris

us, at ultrices nisl luctus ve

l. Nulla eget est luctus, dign

issim urna vel, luctus felis.

Donec facilisis malesuada port

a. Nulla elementum felis ut ju

sto sollicitudin pellentesque.

Vestibulum faucibus, ipsum ti

ncidunt volutpat lacinia, turp

is libero bibendum sem, in mal

esuada turpis lectus et neque.

"

To "reflow" text, lines are broken at whitespace [7], where an "end of line

marker" is placed to start the next line, and any existing "end of line

markers" are ignored unless there are two in a row. For example, the two

example paragraphs "reflowed" at 30 characters:

123456789012345678901234567890

------------------------------

"Lorem ipsum dolor sit amet,

consectetur adipiscing elit.

Vestibulum mauris leo,

condimentum vitae varius at,

elementum sed odio. Sed

commodo felis lacinia blandit

vestibulum. Duis vel

sagittis massa. Maecenas

sodales dui tristique velit

luctus tincidunt a sit amet

neque. Sed vitae velit in

sapien semper accumsan.

Nulla sem odio, malesuada a

viverra at, tristique eu

tortor.

"Quisque auctor porta enim,

eget tincidunt augue cursus

non. Nulla at condimentum

purus. Curabitur maximus

malesuada risus, at ultrices

nisl luctus vel. Nulla eget

est luctus, dignissim urna

vel, luctus felis. Donec

facilisis malesuada porta.

Nulla elementum felis ut

justo sollicitudin

pellentesque. Vestibulum

faucibus, ipsum tincidunt

volutpat lacinia, turpis

libero bibendum sem, in

malesuada turpis lectus et

neque."

If a suitable breaking point cannot be found (no whitespace or "end of line

markers" found), then the line MUST be wrapped to the next line, at which

the "reflow" algorithm is picked up again. An example of a paragraph that

exhibits such behavior, again at 30 characters:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum mauris

leo, condimentumvitaevariusat,elementum sed odio. Sed commodo felis

lacinia blandit vestibulum. Duis vel sagittis massa. Maecenas sodales dui

tristique velit luctus tincidunt a sit amet neque. Sed vitae velit in

sapien semper accumsan. Nulla sem odio, malesuada a viverra at, tristique

eu tortor."

123456789012345678901234567890

------------------------------

"Lorem ipsum dolor sit amet,

consectetur adipiscing elit.

Vestibulum mauris leo,

condimentumvitaevariusat,eleme

ntum sed odio. Sed commodo

felis lacinia blandit

vestibulum. Duis vel

sagittis massa. Maecenas

sodales dui tristique velit

luctus tincidunt a sit amet

neque. Sed vitae velit in

sapien semper accumsan.

Nulla sem odio, malesuada a

viverra at, tristique eu

tortor."

To "cut off", any characters past the right edge of the screen are simple

discared until the next "end of line marker":

123456789012345678901234567890

------------------------------

"Lorem ipsum dolor sit amet, c

leo, condimentum vitae varius

lacinia blandit vestibulum.

Duis vel sagittis massa. Maec

tristique velit luctus tincidu

sapien semper accumsan. Nulla

eu tortor.

"Quisque auctor porta enim, eg

condimentum purus. Curabitur

luctus vel. Nulla eget est lu

facilisis malesuada porta. Nu

pellentesque. Vestibulum fauc

libero bibendum sem, in malesu

A "line of text" that starts with one or more "whitespace" characters,

followed by "graphical characters" is a "fixed line" and MUST NOT be

"reflowed"; it MAY be "wrapped" or it MAY be "cut off". For maximum

interoperability, such "fixed" lines SHOULD be 40 characters or less. The

BNF for a "fixed" line:

fixed = 1*WSP VCHAR *text CRLF

VCHAR = %x21-FF

; see [RFC3629] for format

[RFC3639] UTF-8 format

It is an ambiguous condition when a line consists of only whitespace

characters, and such a line SHOULD NOT appear in a Gemini index file.

A "line of text" that does not start with whitespace or the character

sequence => is subject to being "reflowed" until two consecutive "end of

line markers" are encountered.

It should be noted that this document follows the format given above, with

no fixed line longer than 40 characters and no link text longer than 40

characters. The rest of the text can be reflowed at any given width. This

should give a feeling for what such a document would look like.

* * * * *

[1] An assumption is being made that any character encoding system used is

based on US-ASCII, which defines the first 128 characters.

[2] US-ASCII character 32. It is both considered a control character as one

of the unit separation characters FS (file separator), GS (group separator),

RS (record separator) and US (unit separator) as a finer grained separator

character, and as a graphic character dispite not having a graphical

representation. For UTF-8, this will also include the variations on white

space, such as thin spacing, zero-width spacing, etc.

[3] Any character with a visual representation, or as part of a visual

representation.

[4] Characters 0 through 31.

[5] The so-called ANSI escape codes. See Wikipedia for more information.

C1 Set

[6] This assumes a "left-to-right" ordering of characters. Other orderings

of rendering text is out of scope for this document.

[7] An ambitious implementation may want to break at a dash (-) or a soft

hyphen (UTF-8 character \u00AD).