[USER] Weird Title Rendering on Various Clients

The Gnuserland <gnuserland (a) mailbox.org>

Hi Geminauts,

I am trying to understand what got wrong with the document below, the 
very first title is not rendered:

gemini://gnuser.land/gemlog/draft/bug01.gmi

The "bug01.gmi" was started in Micro and finalized in Mousepad (Debian, 
XFCE4).

Than I copied all the content into Geany (Debian, XFCE4), removed some 
white spaces, and saved it as "bug02.gmi", and the title was formatted 
properly.

gemini://gnuser.land/gemlog/draft/bug01.gmi

A diff made to both files shows there is actually a **difference** a the 
very first line, but what is that?

diff -bZ bug01.gmi bug02.gmi
1c1
< ?# Title 1
---
 > # Title 1

Thanks!

TGL

Link to individual message.

Alexis <flexibeast (a) gmail.com>


The Gnuserland <gnuserland at mailbox.org> writes:

> A diff made to both files shows there is actually a 
> **difference** a
> the very first line, but what is that?
>
> diff -bZ bug01.gmi bug02.gmi
> 1c1
> < ?# Title 1
> ---
> > # Title 1

Perhaps try running hexdump(1) (or something analogous) on both 
files and compare them byte-for-byte.


Alexis.

Link to individual message.

The Gnuserland <gnuserland (a) mailbox.org>

Thank for the suggestion, something came out, but not sure what does it 
mean:

< 00000000: efbb bf23 2054 6974 6c65 2031 0d0a 0d0a? ...# Title 1....
---
 > 00000000: 2320 5469 746c 6520 310d 0a0d 0a23 2054? # Title 1....# T

You can check the full output here:

gemini://gnuser.land/gemlog/draft/xxd.gmi

TGL

p.s. really loved it how sweat was setting up this page with Gemini!

On 7/6/21 12:09 AM, Alexis wrote:
>
> The Gnuserland <gnuserland at mailbox.org> writes:
>
>> A diff made to both files shows there is actually a **difference** a
>> the very first line, but what is that?
>>
>> diff -bZ bug01.gmi bug02.gmi
>> 1c1
>> < ?# Title 1
>> ---
>> > # Title 1
>
> Perhaps try running hexdump(1) (or something analogous) on both files 
> and compare them byte-for-byte.
>
>
> Alexis.

Link to individual message.

Alexis <flexibeast (a) gmail.com>


The Gnuserland <gnuserland at mailbox.org> writes:

> Thank for the suggestion, something came out, but not sure what 
> does
> it mean:
>
> < 00000000: efbb bf23 2054 6974 6c65 2031 0d0a 0d0a  ...# Title 
> 1....
> ---
>> 00000000: 2320 5469 746c 6520 310d 0a0d 0a23 2054  # Title 
>> 1....# T

Byte Order Mark at the start of the first file.

http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html


Alexis.

Link to individual message.

skyjake <skyjake (a) dengine.net>

On 6. Jul 21, at 7.52, The Gnuserland <gnuserland at mailbox.org> wrote:

> Thank for the suggestion, something came out, but not sure what does it mean:
> 
> < 00000000: efbb bf23 2054 6974 6c65 2031 0d0a 0d0a  ...# Title 1....
> ---
> > 00000000: 2320 5469 746c 6520 310d 0a0d 0a23 2054  # Title 1....# T

I checked how Lagrange handles the Byte Order Mark (BOM), and sure enough 
it breaks the first line's type detection.

Fixed for future releases!

--jaakko

Link to individual message.

The Gnuserland <gnuserland (a) mailbox.org>

Thank you guys,

You are amazing. :)


Actually Mousepad has an option that says: "Write Unicode BOM"

Hence my question is: does it need to be checked or unchecked?


Anyway so far the clients I have tried that do not render the bug01.gmi 
page properly are:








Clients that render the page properly:




Cheers,

TGL


On 7/6/21 1:56 AM, skyjake wrote:
> On 6. Jul 21, at 7.52, The Gnuserland <gnuserland at mailbox.org> wrote:
>
>> Thank for the suggestion, something came out,
>>
>> Cheers,
>>
>> TGL
>>
>>
>>   but not sure what does it mean:
>>
>> < 00000000: efbb bf23 2054 6974 6c65 2031 0d0a 0d0a  ...# Title 1....
>> ---
>>> 00000000: 2320 5469 746c 6520 310d 0a0d 0a23 2054  # Title 1....# T
> I checked how Lagrange handles the Byte Order Mark (BOM), and sure 
enough it breaks the first line's type detection.
>
> Fixed for future releases!
>
> --jaakko

Link to individual message.

Jonathan McHugh <indieterminacy (a) libre.brussels>

Is this bug something already covered underthe Torture Test
=>  gemini://gemini.conman.org/test/torture

If not, should it (and other points of concern) be appended?

I wonder, does the Torture Test and other similar services get used by 
browsers as part of a CI workflow?


====================
Jonathan McHugh
indieterminacy at libre.brussels

July 6, 2021 3:19 PM, "The Gnuserland" <gnuserland at mailbox.org> wrote:

> Thank you guys,
> 
> You are amazing. :)
> 
> Actually Mousepad has an option that says: "Write Unicode BOM"
> 
> Hence my question is: does it need to be checked or unchecked?
> 
> Anyway so far the clients I have tried that do not render the bug01.gmi 
page properly are:
> 
> * Amfora
> 
> * Lagrange
> 
> * Telescope
> 
> Clients that render the page properly:
> 
> * Geminauts
> 
> Cheers,
> 
> TGL
> 
> On 7/6/21 1:56 AM, skyjake wrote:
> 
>> On 6. Jul 21, at 7.52, The Gnuserland <gnuserland at mailbox.org> wrote:
>> 
>>> Thank for the suggestion, something came out,
>>> 
>>> Cheers,
>>> 
>>> TGL
>>> 
>>> but not sure what does it mean:
>>> 
>>> < 00000000: efbb bf23 2054 6974 6c65 2031 0d0a 0d0a ...# Title 1....
>>> ---
>> 
>> 00000000: 2320 5469 746c 6520 310d 0a0d 0a23 2054 # Title 1....# T
>> I checked how Lagrange handles the Byte Order Mark (BOM), and sure 
enough it breaks the first
>> line's type detection.
>> 
>> Fixed for future releases!
>> 
>> --jaakko

Link to individual message.

mbays <mbays (a) sdf.org>



>I checked how Lagrange handles the Byte Order Mark (BOM), and sure 
>enough it breaks the first line's type detection.

If we follow section 6 of RFC 3629, it looks like the right thing to do 
is to interpret this character as a nonbreaking space, even if it is the 
first character of the utf8-encoded gemtext. So then the first line 
should be interpreted as a text line.

Another thing to clarify in the next version of the spec.

"""
A protocol SHOULD forbid use of U+FEFF as a signature for those textual 
protocol elements that the protocol mandates to be always UTF-8, the 
signature function being totally useless in those cases.

A protocol SHOULD also forbid use of U+FEFF as a signature for those 
textual protocol elements for which the protocol provides character 
encoding identification mechanisms, when it is expected that 
implementations of the protocol will be in a position to always use the 
mechanisms properly.  This will be the case when the protocol elements 
are maintained tightly under the control of the implementation from the 
time of their creation to the time of their (properly labeled) 
transmission.

[...]

When a protocol forbids use of U+FEFF as a signature for a certain 
protocol element, then any initial U+FEFF in that protocol element MUST 
be interpreted as a "ZERO WIDTH NO-BREAK SPACE".
"""

Link to individual message.

Andrew Singleton <singletona082 (a) gmail.com>

> Another thing to clarify in the next version of the spec.

Given my own recent face slamming against BOM, I wouldn't mind handling of 
this as part of the spec.

Link to individual message.

skyjake <skyjake (a) dengine.net>

On 7. Jul 21, at 20.40, Andrew Singleton <singletona082 at gmail.com> wrote:
> 
>> Another thing to clarify in the next version of the spec.
> 
> Given my own recent face slamming against BOM, I wouldn't mind handling 
of this as part of the spec. 

I agree. While this is a relatively minor issue, it's always better to 
avoid undefined behavior (that depends on invisible characters!).

Submitted to GitLab:
=> https://gitlab.com/gemini-specification/protocol/-/issues/36 

--jaakko

Link to individual message.

mbays <mbays (a) sdf.org>



> [BOM]
>Submitted to GitLab:
>=> https://gitlab.com/gemini-specification/protocol/-/issues/36

 From that issue:
> 1. The server MUST remove the BOM when serving UTF-8 content that 
> begins with a BOM.

One thing to consider is that a zero width space could be used at the 
start of a line as a quoting mechanism. It isn't a nice solution to 
gemtext's quoting problem, but there doesn't seem to be a nice solution. 
So then even if U+FEFF is the "wrong" zero width space character to use, 
it might be used, and then it would be strange for it to work for any 
but the first line. So this suggests we should follow RFC 3629 when it 
says 'When a protocol forbids use of U+FEFF as a signature for a certain 
protocol element, then any initial U+FEFF in that protocol element MUST 
be interpreted as a "ZERO WIDTH NO-BREAK SPACE".'

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>


The Gnuserland writes:

> Actually Mousepad has an option that says: "Write Unicode BOM"
>
> Hence my question is: does it need to be checked or unchecked?

The BOM at the start of the file is invalid for UTF-8 documents. I

to look it up. If you're writing UTF-8 documents, you should not include
a BOM (and your editor shouldn't write one if it knows the encoding is
supposed to be UTF-8).

-- 
Jason McBrayer      | ?Strange is the night where black stars rise,
jmcbray at carcosa.net | and strange moons circle through the skies,
                    | but stranger still is lost Carcosa.?
                    | ? Robert W. Chambers,The King in Yellow

Link to individual message.

skyjake <skyjake (a) dengine.net>

On 14. Jul 21, at 2.30, Jason McBrayer <jmcbray at carcosa.net> wrote:

> The BOM at the start of the file is invalid for UTF-8 documents. I
> *think* it's required for UTF-16, but I may be wrong and I'm too tired
> to look it up.

BOM is not invalid for UTF-8, although it has limited usefulness:

> UTF-8 can contain a BOM. However, it makes no difference as to the 
endianness of the byte stream. UTF-8 always has the same byte order. An 
initial BOM is only used as a signature ? an indication that an otherwise 
unmarked text file is in UTF-8. 

=> https://www.unicode.org/faq/utf_bom.html#bom1

I recommend this FAQ to anyone wondering about what to do with BOMs, 
there's plenty of good info.

--jaakko

Link to individual message.

---

Previous Thread: [Help] Page not rendering correctly

Next Thread: [Tech] gemserv and SCGI, anyone?