Text reflow woes (or: I want bullets back!)y

🗣️ From: Sean Conner (sean (a) conman.org)
📅 Sent: 2020-01-18 23:36
📧 Message 134 of 148

It was thus said that the Great Brian Evans once stated:
> Aaron Janse writes:
> > Hmmm. It does seem, though, that *allowing* ANSI colors would require
> > non-terminal clients to strip ANSI colors, which would be a PITA,
> > expecially considering that ANSI is a hot mess (I built an ANSI parser
> > a while ago [1])
> 
> Currently Bombadillo has a few different modes. The normal mode removes 
> ansi escape codes. As I am parsing a document if I read an `\033` character I 
> just toggle an escape code boolean and then consume until I read a A-Za-z
> character (and consume that char as well). It works very quickly and handles
> removing them quite well. I do the same thing for the color mode for any
> escape codes that do not end in `m`. That said, it may not work as well for
> people not parsing by writing characters into a buffer char by char.

  Having written an ECMA-48 (the terminal control codes everybody calls ANSI
escape codes when they aren't defined by ANSI) parser you'll probably catch
99% of the control codes used.  But the actual definition is (RFC-5234 BNF):

	CSI   = %d27 '['
	      / %d155       ; ISO-8859-1 or similar
	      / %d194 %d155 ; UTF-8 encoding
	param = %d48-63     ; chars '0' through '?'
        meta  = %d32-47     ; chars ' ' through '/'
        cmd   = %d64-126    ; chars '@' through '~'

	sequence = CSI *param *meta cmd

  There are other ECMA-48 sequences that could prove dangerous if not
filtered for.  I do have Lua code to parse these [1][2] and use them in my
current gopher client to filter them out (and yes, I have come across sites
that embed ECMA-48 control codes).

> 2. Do a simple find and replace on the whole document for '\033' and replace
>     it with "ESC". While this will still leave the codes displaying to the viewer
>     they will not actually render, thus you do not need to worry about line
>      movement, screen clears, etc.

  You might want to replace the following codepoints to render control codes
harmless:

	0 - 31	; C0 set, except interpret the range from 7-13 inclusive
	127	; DEL
	128-159	; C1 set

I say codepoints because in UTF-8, the C1 set is represented by the
sequences

	194 128 through 194 129

-spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/iso/control.lua

	This handles encodings in ISO-8859-1 and similar.  I have a UTF-8
	one that is separate.  This one just returns the escape sequence as
	a unit with no further parsing of the actual sequence.

[2]	https://github.com/spc476/LPeg-Parsers/blob/master/iso/ctrl.lua

	This does a more complete parse of the escape sequence, to include
	its name (if any).  Again, This is for ISO-8859-1 and similar
	encodinds.  I have another version for UTF-8.

---

Previous in thread (133 of 148): 🗣️ Julien Blanchard (julien (a) typed-hole.org)

Next in thread (135 of 148): 🗣️ Aaron Janse (aaron (a) ajanse.me)

View entire thread.