Tracking via pasted text

Plain text steganography and how it can be used against you

Zero-width characters can be used to embed hidden information inside of plain text. This is of primary concern to journalists and their sources, but it can affect anyone browsing the Internet. For example, a page can be dynamically generated server-side to include, between every few words:

By copying text from the page and pasting it somewhere public, you would be revealing this information to anyone who knew how to look for it. Details and demo in this article:

Be careful what you copy: Invisibly inserting usernames into text with Zero-Width Characters (Tim Ross, 2018)

To check if your browser displays zero-width characters, open:

Zero-width character test (Gemini)

Zero-width character test (Gopher)

Zero-width character test (Web)

Other plain text watermarking techniques / canary traps are explained on Zach Aysan's blog:

Zero-Width Characters: Invisibly fingerprinting text (2017)

Text Fingerprinting Update: Stories and ideas from readers (2018)

To fingerprint text, server software would only need to encode a hidden number inside it, repeated between every few words, matching a log entry that contains information about the visitor (username, IP address, cookie, browser details, referrer link, timestamp). For easily finding pasted excerpts online, the software could similarly hide a unique page-specific identifier within the text, that can later be put into search engines.

To achieve this, aside from zero-width characters, the software could use some of the other techniques described by Zach Aysan: "differences in dashes (en, em, and hyphens), quotes (straight vs curly), word spelling (color vs colour), and the number of spaces after sentence endings", different types of spaces, homoglyphs (a vs а), diacritic forms (ț vs ţ), ligatures (fi vs fi, Ⅳ vs IV, ½ vs 1/2), as well as inserting hard to detect typos into the text.

Solutions

A partial solution is to convert the text to ASCII, if language allows. There are also tools such as:

Less (CLI)

- displays zero-width characters when used with the "-U" option.

SafeText (CLI)

- also detects some homoglyphs. It started out well, but development has stopped; in its current state, there are many problematic characters that it does not detect - see issues: https://github.com/DavidJacobson/SafeText/issues

Several browser extensions that detect *a few* zero-width characters.

However, they don't protect against the more sophisticated versions of this hack. A more complete tool would have to include not just a list of forbidden/allowed characters, but also a a spellchecker and a way to detect trailing whitespace - an x-ray mode that might be triggered when dubious text is detected in the clipboard. And not just text, image-based steganography can be used in a similar way. A technical solution might never be perfect, but it could cover the vast majority of cases.

An almost perfect non-technical solution is to retype the text. You can also try downloading the page twice from different accounts / IP addresses and diff the two versions, or check if the hashes match. Another solution is to take a screenshot of the text and run it through OCR software.

Tools for text steganography

StegCloak

Spam Mimic (see Encode -> Alternate encodings)

zwfp

SNOW

WORDLISTTEXTSTEGANOGRAPHY & EMAILSTEGANO

inØsight — Zero Width Obfuscation (extension for Firefox and Chromium)

Zero Width Shortener - Shorten URLs using invisible spaces

Unicode character search

Further reading

Text steganography

Text based steganography (Robert Lockwood and Kevin Curran, 2017)

Text Steganography with Multi level Shielding (Sharon Rose Govada et al., 2012)

Any efficient text-based steganographic schemes? (crypto.stackexchange.com)

Steganography to hide text within text (security.stackexchange.com)

Chaffing and winnowing (Wikipedia)

Control characters

Zero-width space (Wikipedia)

Article explaining the role of a few zero-width characters

Partial list of Unicode spaces

Unicode control characters (Wikipedia)

Tags (Unicode block) (Wikipedia)

Unicode Character Database

Homoglyphs

Homoglyph (Wikipedia)

Confusable detection

confusables.txt

NFKC normalisation

"Apply NFKC normalisation" - SafeText issue

Unicode Normalization FAQ

Unicode Normalization Forms

Unicode security considerations

Unicode Security Issues FAQ

Unicode Security Considerations - Technical Report

_____________________

Published: 2021-02-20

Updated: 2021-05-10

Source (contributions welcome)

License: CC-BY-SA