💾 Archived View for bbs.geminispace.org › u › istvan › 15968 captured on 2024-06-16 at 17:42:09. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Comment by 💎 istvan

Re: "How Can We Determine Files Types and Text File Encodings?"

In: s/Gemini

UTF-8 isn't a file type: it's a scheme of encoding text. This text is then connected with glyphs stored in a font.

You can include as many different encodings as you wish in a text file. For example I can make you a single text file containing UTF-16LE, UTF-8, Japanese Shift-JIS, Japanese EUC, Chinese Big5, Chinese GB2312, and ASCII encoding.

Anything that opens it will crap its pants and only show mostly garbage because documents are assumed to use one text encoding. There is nothing in plain text to hint which encoding should be used. At best, a text editor can make a heuristic guess, but non-UTF encodings often still need to be manually configured.

💎 istvan

Apr 04 · 2 months ago

5 Later Comments ↓

💎 istvan · Apr 04 at 22:04:

If you are asking about file types, which is a completely different question, there is typically some form of magic bytes that can be used to make a guess.

Ultimately, it's the responsibility of the software to figure this out.

If you replace the magic for PNGs with JPEG, your OS might guess it is a JPEG and pass it to an image editor. The image editor will attempt to parse the JPEG, find out the data just doesn't work and complain that you passed a broken/invalid JPEG.

So the problem is on the final processing end to solve. Mime and magic is just a shorthand to help guess which software to pass it to for further processing.

🐙 norayr · Apr 05 at 17:02:

i guess you know about the 'file' utiliy.

🚂 MrSVCD · Apr 05 at 22:01:

To make your life a little easier you can make a utility that detects ASCII and UTF-8 text, the rest you can't automate since there is no real way to identify between different codepages besides using a human to see if it looks correct.

🚀 blah_blah_blah [OP] · Apr 10 at 00:04:

@mozz

But why do you think a polygot file is a security issue? I don't see how it would be more insecure than any other untrusted file.

Secure software has to presume that user input is hostile. One form of hostiliy is the poiyglot file, which appears to be one thing while (in addition, under certain circumstances) being something else.

🚀 blah_blah_blah [OP] · Apr 10 at 00:44:

The responses to my post confirm my view that the final determinant of a file's type or encoding is human judgment about whether expected software chokes on the data or not. I guess only I find this an intriguing topic, or an alarming one.

Original Post

🌒 s/Gemini

How Can We Determine Files Types and Text File Encodings? — Determining File Types I have a security question. How can we verify that a UTF-8 file contains only UTF-8 encoded bytes? Running iconv all the time (the preferred solution) isn't appropriate in every situation, and only pushes back the question: how does iconv perform the verification? Other proposals suggest pushing text through UTF-8 language tools, like `read().decode('UTF-8')` in Python, but, again, the /how/ remains...

💬 blah_blah_blah · 7 comments · Apr 04 · 2 months ago