Public service announcement: Zero-width characters can be used to embed hidden information inside of plain text. For example, a page can be dynamically generated server-side to include, between every few words:
It might be interesting to also test our e-mail clients, so below you'll find all zero-width Unicode characters I've gathered so far, placed between underscores. First, as a point of reference, here are a few positive-width Unicode characters: 0020: _ _ | 00E9: _?_ | 03A9: _?_ | 5B57: _?_ | 1F407: _?_ # Zero-width characters 061C: _?_ 180E: _?_ 200B: _?_ 200C: _?_ 200D: _?_ 200E: _?_ 200F: _?_ 202A: _?_ 202B: _?_ 202C: _?_ 202D: _?_ 202E: _?_ 2060: _?_ 2061: _?_ 2062: _?_ 2063: _?_ 2064: _?_ 2066: _?_ 2067: _?_ 2068: _?_ 2069: _?_ 206A: _?_ 206B: _?_ 206C: _?_ 206D: _?_ 206E: _?_ 206F: _?_ FEFF: _?_ FFF9: _?_ FFFA: _?_ FFFB: _?_ E0001: _?_ E0020: _?_ ... (E0020?E007F used for invisibly tagging texts by language) E007F: _?_ This is probably not a complete list. Contact me if you know of any others. Unicode currently contains 143,859 characters. Unicode Character Database: https://www.unicode.org/Public/UCD/latest/
On Sun, Mar 14, 2021 at 04:12:12PM +0000, nervuri <nervuri at disroot.org> wrote a message of 34 lines which said: > This subtle form of tracking would work just as well on Gemini as it > does on the web. This is technically interesting but do you suggest that Gemini be modified in one way or the other, to limit the risks? And, if so, how? As you note in <gemini://rawtext.club/~nervuri/stega.gmi>, it can perfectly be done without zero-width characters. A trivial way is to encode the hidden information in a number of ordinary spaces at the end of each line.
It was thus said that the Great Stephane Bortzmeyer once stated: > On Sun, Mar 14, 2021 at 04:12:12PM +0000, > nervuri <nervuri at disroot.org> wrote > a message of 34 lines which said: > > > This subtle form of tracking would work just as well on Gemini as it > > does on the web. > > This is technically interesting but do you suggest that Gemini be > modified in one way or the other, to limit the risks? And, if so, how? > > As you note in <gemini://rawtext.club/~nervuri/stega.gmi>, it can > perfectly be done without zero-width characters. A trivial way is to > encode the hidden information in a number of ordinary spaces at the > end of each line. Or by word choice, or word order, or homographs [1]. There are many ways to do this. -spc [1] https://en.wikipedia.org/wiki/IDN_homograph_attack
On Sun, Mar 14, 2021, Stephane Bortzmeyer wrote: >This is technically interesting but do you suggest that Gemini be >modified in one way or the other, to limit the risks? And, if so, how? > >As you note in <gemini://rawtext.club/~nervuri/stega.gmi>, it can >perfectly be done without zero-width characters. A trivial way is to >encode the hidden information in a number of ordinary spaces at the >end of each line. Zero-width characters are *by far* the most potent way to do this - you can encode any number of bits between any two visible characters. The other methods are nowhere near as efficient. As for ways to limit the risks... that's the hard part. I don't think it's a matter of changing Gemini. The best place to put a solution to this problem is the OS's clipboard utility. However, browsers can help insofar as they can interact with the clipboard, by letting users know when copied text contains zero-width characters (and perhaps homoglyphs, etc). Another approach would be to replace zero-width chars with, say, emojis (a browser extension actually does this), but it would need to have an on/off toggle, because these characters can be used for good reason. The guiding principle is that users must be able to see what's going on within the "plain" text that they're working with. If developers pick it up and figure out solutions, that would be great.
On Sun, 14 Mar 2021 at 16:55, nervuri <nervuri at disroot.org> wrote: > > First, as a point of reference, here are a few positive-width Unicode > characters: > 0020: _ _ | 00E9: _?_ | 03A9: _?_ | 5B57: _?_ | 1F407: __ > All fine for me! (GMail seems to strip emoji in plain-text replies though.. which is rather odd.) > FFF9: _?_ > FFFA: _?_ > FFFB: _?_ These three show as the replacement box for me. I've never quite understood what the "inter annotation" whatever characters are - but I think they're some form of control character so having them display as a box when used incorrectly might be correct. > > E0020: _?_ > ... (E0020?E007F used for invisibly tagging texts by language) > E007F: _?_ > These *were* used for tagging texts by language, but have been deprecated in favour of using other non-Unicode metadata for this purpose. They are planned to be used in emojis and are (were?) used (but not widely supported) for country codes/flags with codes longer than 2 characters (3?), such as USA states or counties of England. Wikipedia has a ~ok description of their history. => https://en.wikipedia.org/wiki/Tags_(Unicode_block) -Oliver Simmons
On Mon, Mar 15, 2021, Oliver Simmons wrote: >> E0020: _?_ >> ... (E0020?E007F used for invisibly tagging texts by language) >> E007F: _?_ > >These *were* used for tagging texts by language, but have been >deprecated in favour of using other non-Unicode metadata for this >purpose. >They are planned to be used in emojis and are (were?) used (but not >widely supported) for country codes/flags with codes longer than 2 >characters (3?), such as USA states or counties of England. >Wikipedia has a ~ok description of their history. >=> https://en.wikipedia.org/wiki/Tags_(Unicode_block) Thanks, I replaced "used" with "formerly used". Wikipedia says "The release of Emoji 5.0 in March 2017 considers these characters to be emoji for use as modifiers in special sequences." I take that to mean that they will remain zero-width, but will generate emojis when used in special sequences, as with the flag of England: ??????? = ?<U+E0067><U+E0062><U+E0065><U+E006E><U+E0067><U+E007F><U+E0042> Unicode keeps getting weirder.
---