Escaping in gemtext

🗣️ From: Sean Conner (sean (a) conman.org)
📅 Sent: 2020-11-10 07:15
📧 Message 11 of 22

  There's quite a bit to unpack here.

  First, let me say that the issue here is in-band signalling.  You get this
issue whenever you use some value (or characater in this case) to signal
some change in interpretation of data, and you need to use the value (or
character) *as* data and not a signal.

  HTML has this issue as well, in that it needs a way to designate a markup
tag, and it uses '<' for that (based upon its use in SGML).  But if one
needs to use '<' in regular text it needs to be escaped.  So (again, from
SGML) they use the '&' character to introduce named entities---a
representation of a character that could not otherwise be typed.  But that
means if you want to use '&' as a character, it too, needs to be escaped. 
So that means in HTML, if you want to display a '&' you escape it as "&amp;"
[1].  And to display a '<' you escape it as "&gt;".

  Gemtext does *not* have such a facility, as it complicates the processing
of the text, which is something solderpunk wanted to keep simple.  Escaping
data complicates this (I've seen complications with the proper encoding of
URLs for instance).  There is no solution (aside from serving up a plain
text file) that will easily solve this issue.

  Now on with the rest of the commentary ... 

It was thus said that the Great Ryan Westlund once stated:
> The main reason I don't prefer it to my own suggestion is that it would
> still mean that preformatted lines might need to be altered in some way
> (if the preformatted lines contain "\```" or something), instead of
> allowing to paste them in unmodified and only have to modify the
> prefomatting toggle lines.

  Sorry, no way around that.  I mean, one *could* use HTML entities:

 ```
Blah blah blah blah.  And now a preformatted block of code:

&DiacriticalGrave;`` 
This is a diagram
&DiacriticalGrave;`` 

 ```

but then the processing becomes harder as the client would then have to scan
character by character, converting entities to characters.  Or perhaps use
the standard '\' as an escape character:

 ```
Blah blah blah blah.  And now a preformatted block of code:

\```
This is a diagram
``\`
 ```

as long as at least one of the grave characters is escaped, it won't trigger
block mode (on or off).  But again, you have to process everything character
by character to handle the '\' character.  Or just decide that the following
four characters at the start of a line

	\```

is to be presented, verbatim, as 

	```

and *not* trigger block mode.  As mentioned earlier, one could just use more
than three such characters:

 ````````````
Look Ma!  Block mode!

It's defined as

 ```
...

See?

 ````````

But it's not really defined as three, but more than three, and again, you
have issues.  Other possibilities---use the first non ` character as a final
delimeter:

 ```|
To define a block mode, use three grave accents in a row, with another
character that doesn't appear in the text; said character will then end the
block mode.  For example:

 ```@
this is block mode
@

See?
|

Or perhaps a sequence of characters?

 ```end-of-line
To define a block mode, use three grave accents in a row, followed by a
sequence of non-blank characters; said sequence will end the block mode:

 ```EOF
This is block mode
EOF

See?
end-of-line

  Or how about this variant:

 ```end-of-line
Blah blah blah
 ```EOF
This is a sample block mode
 ```EOF
See?
 ```end-of-line

  I mean, you can go crazy with this stuff.  But every option involves more
processing than happens now.

  This is also not to say I endorse or condemn any of these methods.

> For the sake of use case: I write Python tutorials in Markdown, as well as
> the specification for Sanemark, a variant of Markdown.  

  Do you know that Markdown was created by John Gruber as an easy way to
create HTML pages, with shortcuts for the tags he used the most often,
leaving the more obscure or harder to support tags to HTML itself?  I mean,
why else would his Markdown include the ability to include HTML?  If he
needed an image (and I don't think he includes many images) he would type
the <IMG ...  > tag by hand.  The varitions come when people wanted to

	replace* all HTML with this weird shorthand notation (and then go on to

generate HTML from it) [2].

> Several similar
> issues have come up for me before with Markdown (this specific one
> would've been a major obstacle for the Sanemark spec if Markdown didn't
> implement what I suggest, because leading space is significant).

  So I looked up Sanemark, because I was curious.  And I came across this
bit where you said: [9]

> The rules for HTML blocks are overcomplicated as hell.  The spec defines 7
> different kinds of them, including support for obscure bullshit that
> should never have been invented like <?php and CDATA, and a fucking
> hardcoded list of all block-level HTML tags.  Nevermind future-proofing, I
> guess custom elements can go fuck themselves?

  The reason that abominations like "<?php" and "<![CDATA[" exist is becuase

	people wanted support for it!*  You might consider them abominations (I'll

agree with PHP [6]) but not everybody, and they do solve real issues [7]. 
And isn't trying to change the Gemini text specification a form of
contravariance? [8]  I'm just asking ...

  I'm also reminded of this quote from Bjarne Stroustrup, creator of C++:

	There are just two kinds of [programming] languages: the ones
	everybody complains about and the ones nobody uses.

  -spc (But the nice thing about standards is that there are so many to
	choose from ... )

[1]	Yes, there are four more ways to get that as well, "&#38;",
	"&#X26;", "&#x26" and "<![CDATA[&]]>", but I don't want to digress
	too much here ...

[2]	Personally, I'm not a fan of Markdown (to the degree that I rejected a
	pull request for GLV-1.12556 [3]) and *I* even created my own markup
	language [4] to make it easier for me to write blog posts.  But I
	don't *store* the blog entries in this markup language, but in their
	final HTML form.  That way, I can modify the language (it's already
	happened at least once) without having to maintain backwards
	compatability nor having to update dozens, perhaps hundreds, of
	previously written entries.

	If you are curious enough, here's the code to format it [5].

[3]	https://github.com/spc476/GLV-1.12556/pull/2

[4]	https://github.com/spc476/mod_blog/blob/master/NOTES/testmsg

[5]	https://github.com/spc476/mod_blog/blob/master/Lua/format.lua

[6]	I personally hate the language, but I can't deny that it lets many
	people who would otherwise not be able to express themselves,
	express themselves.  I console myself with the fact that I don't
	have to maintain such code, thank God.

[7]	You want to easily embed HTML sample code in HTML?  You don't have
	to entity escape every '<' and '&' and "'" but instead drop that
	mess into a single <![CDATA[ ...  ]]> block and there you go. 
	Sample HTML code in an HTML page without pain (as long as there
	isn't a <![CDATA[ ]]> block inside, which in that case, you can
	escape the leading '<' and trailing '>' with entities, which is
	*still* a lot less work than trying to safely convert HTML with
	entities.

[8]	https://yujiri.xyz/software/specs_are_contravariant

[9]	https://yujiri.xyz/sanemark
---
Previous in thread (10 of 22): 🗣️ Sudipto Mallick (smallick.dev (a) gmail.com)
Next in thread (12 of 22): 🗣️ Ali Fardan (raiz (a) stellarbound.space)
View entire thread.