[tech] Zero-width characters and tracking via pasted text

nervuri <nervuri (a) disroot.org>

Public service announcement: Zero-width characters can be used to embed
hidden information inside of plain text.  For example, a page can be
dynamically generated server-side to include, between every few words:



By copying text from the page and pasting it somewhere public, you would
be revealing this information to anyone who knew how to look for it.
This subtle form of tracking would work just as well on Gemini as it
does on the web.  Gopher is more protected from this, as many clients
are ASCII-only.

I'm gathering information about this sort of thing (and text
steganography more broadly) at:

gemini://rawtext.club/~nervuri/stega.gmi
gemini://rawtext.club/~nervuri/zero-width.gmi

The first is an explanation of the problem, with links to various tools
and references.

The second is a test to check which software displays (or warns about
the presence of) zero-width characters.

Contributions are welcome.  I'm especially interested in what software
passes the 0-width character test and what 0-width chars I've missed.
If you know of a good tool for detecting plain text steganography, do
tell.

P.S.  In Amfora I noticed a visual glitch when I opened
gemini://rawtext.club/~nervuri/zero-width.gmi and scrolled up and down
for a bit.

Link to individual message.

nervuri <nervuri (a) disroot.org>

It might be interesting to also test our e-mail clients, so below you'll
find all zero-width Unicode characters I've gathered so far, placed
between underscores.

First, as a point of reference, here are a few positive-width Unicode
characters:
0020: _ _ | 00E9: _?_ | 03A9: _?_ | 5B57: _?_ | 1F407: _?_

# Zero-width characters

 061C: _?_

 180E: _?_

 200B: _?_
 200C: _?_
 200D: _?_
 200E: _?_
 200F: _?_

 202A: _?_
 202B: _?_
 202C: _?_
 202D: _?_
 202E: _?_

 2060: _?_
 2061: _?_
 2062: _?_
 2063: _?_
 2064: _?_
 2066: _?_
 2067: _?_
 2068: _?_
 2069: _?_

 206A: _?_
 206B: _?_
 206C: _?_
 206D: _?_
 206E: _?_
 206F: _?_

 FEFF: _?_
 FFF9: _?_
 FFFA: _?_
 FFFB: _?_

E0001: _?_

E0020: _?_
... (E0020?E007F used for invisibly tagging texts by language)
E007F: _?_

This is probably not a complete list.  Contact me if you know of any
others.

Unicode currently contains 143,859 characters.

Unicode Character Database:
https://www.unicode.org/Public/UCD/latest/

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Sun, Mar 14, 2021 at 04:12:12PM +0000,
 nervuri <nervuri at disroot.org> wrote 
 a message of 34 lines which said:

> This subtle form of tracking would work just as well on Gemini as it
> does on the web.

This is technically interesting but do you suggest that Gemini be
modified in one way or the other, to limit the risks? And, if so, how?

As you note in <gemini://rawtext.club/~nervuri/stega.gmi>, it can
perfectly be done without zero-width characters. A trivial way is to
encode the hidden information in a number of ordinary spaces at the
end of each line.

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Sun, Mar 14, 2021 at 04:12:12PM +0000,
>  nervuri <nervuri at disroot.org> wrote 
>  a message of 34 lines which said:
> 
> > This subtle form of tracking would work just as well on Gemini as it
> > does on the web.
> 
> This is technically interesting but do you suggest that Gemini be
> modified in one way or the other, to limit the risks? And, if so, how?
> 
> As you note in <gemini://rawtext.club/~nervuri/stega.gmi>, it can
> perfectly be done without zero-width characters. A trivial way is to
> encode the hidden information in a number of ordinary spaces at the
> end of each line.

  Or by word choice, or word order, or homographs [1].  There are many ways
to do this.

  -spc

[1]	https://en.wikipedia.org/wiki/IDN_homograph_attack

Link to individual message.

nervuri <nervuri (a) disroot.org>

On Sun, Mar 14, 2021, Stephane Bortzmeyer wrote:
>This is technically interesting but do you suggest that Gemini be
>modified in one way or the other, to limit the risks? And, if so, how?
>
>As you note in <gemini://rawtext.club/~nervuri/stega.gmi>, it can
>perfectly be done without zero-width characters. A trivial way is to
>encode the hidden information in a number of ordinary spaces at the
>end of each line.

Zero-width characters are *by far* the most potent way to do this - you
can encode any number of bits between any two visible characters.  The
other methods are nowhere near as efficient.

As for ways to limit the risks... that's the hard part.  I don't think
it's a matter of changing Gemini.  The best place to put a solution to
this problem is the OS's clipboard utility.  However, browsers can help
insofar as they can interact with the clipboard, by letting users know
when copied text contains zero-width characters (and perhaps homoglyphs,
etc).  Another approach would be to replace zero-width chars with, say,
emojis (a browser extension actually does this), but it would need to
have an on/off toggle, because these characters can be used for good
reason.

The guiding principle is that users must be able to see what's going on
within the "plain" text that they're working with.  If developers pick
it up and figure out solutions, that would be great.

Link to individual message.

Oliver Simmons <oliversimmo (a) gmail.com>

On Sun, 14 Mar 2021 at 16:55, nervuri <nervuri at disroot.org> wrote:
>
> First, as a point of reference, here are a few positive-width Unicode
> characters:
> 0020: _ _ | 00E9: _?_ | 03A9: _?_ | 5B57: _?_ | 1F407: __
>

All fine for me!
(GMail seems to strip emoji in plain-text replies though.. which is rather odd.)

>  FFF9: _?_
>  FFFA: _?_
>  FFFB: _?_

These three show as the replacement box for me.
I've never quite understood what the "inter annotation" whatever
characters are - but I think they're some form of control character so
having them display as a box when used incorrectly might be correct.

>
> E0020: _?_
> ... (E0020?E007F used for invisibly tagging texts by language)
> E007F: _?_
>

These *were* used for tagging texts by language, but have been
deprecated in favour of using other non-Unicode metadata for this
purpose.
They are planned to be used in emojis and are (were?) used (but not
widely supported) for country codes/flags with codes longer than 2
characters (3?), such as USA states or counties of England.
Wikipedia has a ~ok description of their history.
=> https://en.wikipedia.org/wiki/Tags_(Unicode_block)

-Oliver Simmons

Link to individual message.

nervuri <nervuri (a) disroot.org>

On Mon, Mar 15, 2021, Oliver Simmons wrote:
>> E0020: _?_
>> ... (E0020?E007F used for invisibly tagging texts by language)
>> E007F: _?_
>
>These *were* used for tagging texts by language, but have been
>deprecated in favour of using other non-Unicode metadata for this
>purpose.
>They are planned to be used in emojis and are (were?) used (but not
>widely supported) for country codes/flags with codes longer than 2
>characters (3?), such as USA states or counties of England.
>Wikipedia has a ~ok description of their history.
>=> https://en.wikipedia.org/wiki/Tags_(Unicode_block)

Thanks, I replaced "used" with "formerly used".  Wikipedia says "The
release of Emoji 5.0 in March 2017 considers these characters to be
emoji for use as modifiers in special sequences."  I take that to mean
that they will remain zero-width, but will generate emojis when used in
special sequences, as with the flag of England:

???????
=
?<U+E0067><U+E0062><U+E0065><U+E006E><U+E0067><U+E007F><U+E0042>

Unicode keeps getting weirder.

Link to individual message.

---

Previous Thread: [ANN] two li'l Chicken Scheme things

Next Thread: New capsule