💾 Archived View for gemini.conman.org › gRFC › 0004 captured on 2020-10-31 at 00:48:17. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2020-09-24)
-=-=-=-=-=-=-
Subject: Gemini index format
From: Sean Conner <sean@conman.org>
Date: Thu, 6 Sep 2019
Content-Type: text/gemini
Status: PROPOSED
A Proposed Formatting Specification
for Gemini Index files.
by Sean Conner
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP 14
[RFC2119] [RFC8174] when, and only when, they appear in all capitals, as
shown here.
A Gemini index file, regardless of character encoding [1], shall only
consist of the space character [2], graphic characters [3] and a limited set
of control characters out of the C0 set [4]; the C1 control set [5] is
outright rejected and MUST NOT appear in a Gemini index file.
The following C0 control set characters are allowed:
HT (character 9) Horizontal Tab.
Classified as "whitespace"
LF (character 10) Line Feed.
Classified as "end of line marker"
CR (character 13) Carriage Return.
Classified as "end of line marker"
SP (character 32) Space.
Classified as "whitespace"
Any other C0 control character MUST NOT appear in a Gemini index file.
Characters not defined as "end of line marker" or "whitespace" is
considered, per this specification, to be a "graphical character".
The two characters LF and CR MUST appear in that order in a Gemini index
file. It is unspecified (at this time) what should happen if a single LF or
CR is encountered. Both characters together constitute the "end of line
marker". It is also unspecified (at this time) what should happen if a C0
control character not listed above, or a C1 control character is encountered
in a Gemini index file.
A "line of text" is any sequence of "whitespace" and "graphical characters"
followed by an "end of line marker".
A "line of text" that starts with the character sequence => is considered a
"link line" and contains a link to another document. The BNF [RFC5234] for
a "link line" is:
link = mark WSP url [ WSP text ] CRLF
mark = "=>"
url = %x21-7E
; see [RFC3986] for syntax
text = %x20-FF
; see [RFC3629] for format
CR = %x0D
LF = %x0A
CRLF = CR LF
SP = %x20
HTAB = %x09
WSP = SP / HTAB
For maximum interoperability, the text portion (if present) should be at
most 40 characters in length; if longer, it is up to the client to handle it
as it sees fit. It MAY "wrap", it SHOULD "reflow", it MAY "cut off" the text.
If the text portion doesn't appear, then the URL MUST be displayed as the
text portion, subject to the same limitations just mentioned.
To "wrap" text, once the text has reached the right edge of the screen [6],
the text resumes at the left edge, even if it cuts a word in half. Upon
encoutering an "end of line marker", move to the next line. For example, to
"wrap" the following paragraphs:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum mauris
leo, condimentum vitae varius at, elementum sed odio. Sed commodo felis
lacinia blandit vestibulum.
Duis vel sagittis massa. Maecenas sodales dui
tristique velit luctus tincidunt a sit amet neque. Sed vitae velit in
sapien semper accumsan. Nulla sem odio, malesuada a viverra at, tristique
eu tortor.
"Quisque auctor porta enim, eget tincidunt augue cursus non. Nulla at
condimentum purus. Curabitur maximus malesuada risus, at ultrices nisl
luctus vel. Nulla eget est luctus, dignissim urna vel, luctus felis. Donec
facilisis malesuada porta. Nulla elementum felis ut justo sollicitudin
pellentesque. Vestibulum faucibus, ipsum tincidunt volutpat lacinia, turpis
libero bibendum sem, in malesuada turpis lectus et neque."
with a width of 30 characters:
123456789012345678901234567890
------------------------------
"Lorem ipsum dolor sit amet, c
onsectetur adipiscing elit. Ve
stibulum mauris leo, condiment
um vitae varius at, elementum
sed odio. Sed commodo felis la
cinia blandit vestibulum.
Duis vel sagittis massa. Maece
nas sodales dui tristique veli
t luctus tincidunt a sit amet
neque. Sed vitae velit in sapi
en semper accumsan. Nulla sem
odio, malesuada a viverra at,
tristique eu tortor.
"Quisque auctor porta enim, eg
et tincidunt augue cursus non.
Nulla at condimentum purus. C
urabitur maximus malesuada ris
us, at ultrices nisl luctus ve
l. Nulla eget est luctus, dign
issim urna vel, luctus felis.
Donec facilisis malesuada port
a. Nulla elementum felis ut ju
sto sollicitudin pellentesque.
Vestibulum faucibus, ipsum ti
ncidunt volutpat lacinia, turp
is libero bibendum sem, in mal
esuada turpis lectus et neque.
"
To "reflow" text, lines are broken at whitespace [7], where an "end of line
marker" is placed to start the next line, and any existing "end of line
markers" are ignored unless there are two in a row. For example, the two
example paragraphs "reflowed" at 30 characters:
123456789012345678901234567890
------------------------------
"Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Vestibulum mauris leo,
condimentum vitae varius at,
elementum sed odio. Sed
commodo felis lacinia blandit
vestibulum. Duis vel
sagittis massa. Maecenas
sodales dui tristique velit
luctus tincidunt a sit amet
neque. Sed vitae velit in
sapien semper accumsan.
Nulla sem odio, malesuada a
viverra at, tristique eu
tortor.
"Quisque auctor porta enim,
eget tincidunt augue cursus
non. Nulla at condimentum
purus. Curabitur maximus
malesuada risus, at ultrices
nisl luctus vel. Nulla eget
est luctus, dignissim urna
vel, luctus felis. Donec
facilisis malesuada porta.
Nulla elementum felis ut
justo sollicitudin
pellentesque. Vestibulum
faucibus, ipsum tincidunt
volutpat lacinia, turpis
libero bibendum sem, in
malesuada turpis lectus et
neque."
If a suitable breaking point cannot be found (no whitespace or "end of line
markers" found), then the line MUST be wrapped to the next line, at which
the "reflow" algorithm is picked up again. An example of a paragraph that
exhibits such behavior, again at 30 characters:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum mauris
leo, condimentumvitaevariusat,elementum sed odio. Sed commodo felis
lacinia blandit vestibulum. Duis vel sagittis massa. Maecenas sodales dui
tristique velit luctus tincidunt a sit amet neque. Sed vitae velit in
sapien semper accumsan. Nulla sem odio, malesuada a viverra at, tristique
eu tortor."
123456789012345678901234567890
------------------------------
"Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Vestibulum mauris leo,
condimentumvitaevariusat,eleme
ntum sed odio. Sed commodo
felis lacinia blandit
vestibulum. Duis vel
sagittis massa. Maecenas
sodales dui tristique velit
luctus tincidunt a sit amet
neque. Sed vitae velit in
sapien semper accumsan.
Nulla sem odio, malesuada a
viverra at, tristique eu
tortor."
To "cut off", any characters past the right edge of the screen are simple
discared until the next "end of line marker":
123456789012345678901234567890
------------------------------
"Lorem ipsum dolor sit amet, c
leo, condimentum vitae varius
lacinia blandit vestibulum.
Duis vel sagittis massa. Maec
tristique velit luctus tincidu
sapien semper accumsan. Nulla
eu tortor.
"Quisque auctor porta enim, eg
condimentum purus. Curabitur
luctus vel. Nulla eget est lu
facilisis malesuada porta. Nu
pellentesque. Vestibulum fauc
libero bibendum sem, in malesu
A "line of text" that starts with one or more "whitespace" characters,
followed by "graphical characters" is a "fixed line" and MUST NOT be
"reflowed"; it MAY be "wrapped" or it MAY be "cut off". For maximum
interoperability, such "fixed" lines SHOULD be 40 characters or less. The
BNF for a "fixed" line:
fixed = 1*WSP VCHAR *text CRLF
VCHAR = %x21-FF
; see [RFC3629] for format
It is an ambiguous condition when a line consists of only whitespace
characters, and such a line SHOULD NOT appear in a Gemini index file.
A "line of text" that does not start with whitespace or the character
sequence => is subject to being "reflowed" until two consecutive "end of
line markers" are encountered.
It should be noted that this document follows the format given above, with
no fixed line longer than 40 characters and no link text longer than 40
characters. The rest of the text can be reflowed at any given width. This
should give a feeling for what such a document would look like.
* * * * *
[1] An assumption is being made that any character encoding system used is
based on US-ASCII, which defines the first 128 characters.
[2] US-ASCII character 32. It is both considered a control character as one
of the unit separation characters FS (file separator), GS (group separator),
RS (record separator) and US (unit separator) as a finer grained separator
character, and as a graphic character dispite not having a graphical
representation. For UTF-8, this will also include the variations on white
space, such as thin spacing, zero-width spacing, etc.
[3] Any character with a visual representation, or as part of a visual
representation.
[4] Characters 0 through 31.
[5] The so-called ANSI escape codes. See Wikipedia for more information.
[6] This assumes a "left-to-right" ordering of characters. Other orderings
of rendering text is out of scope for this document.
[7] An ambitious implementation may want to break at a dash (-) or a soft
hyphen (UTF-8 character \u00AD).