💾 Archived View for zelena.flounder.online › gemlog › 2023-11-12_BOM_Breaks_Flounder.gmi captured on 2024-05-12 at 15:06:33. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-11-14)

-=-=-=-=-=-=-

BOM Breaks Flounder, and How to Fix

I recently broke my Flounder feed and the formatting on a couple posts.

The issue started after using LibreOffice Writer to spell check the files. It caused the first line of a file to no longer render the formatting correctly. As well, the feed would no longer be able to correctly locate the title, falling back to reading it as a generic first line. The HTML mirror also lacked formatting.

After a lot of digging, I found the issue was that LibreOffice Writer saves text files with a Byte Order Mark (BOM) affixed to the front of a text file.

[Wikipedia] Byte order mark

This is an invisible character at the start of a text file that hints to whatever is reading the file how the text is encoded, ensuring it reads it correctly. Despite being part of the text standard, this mark causes issues in a lot of files that rely on checking the start of a text file. For example, a Bash script with the BOM will cause the shebang to not be read correctly.

The same thing happens with Flounder. It reads the first character of the file, checking for `#`. It finds the BOM instead and assumes that the first line is not a gemtext header.

How to Fix

To fix this, make sure your gemtext files are utf-8 encoded WITHOUT a BOM.

Encoding as UTF-8

Use the `file` command to find out what the current encoding of the file is.

file -i "file.gmi"

Use the `iconv` command to translate the current encoding into UTF-8. Replace `$file_encoding` with whatever the current encoding is.

iconv -f "$file_encoding" -t "UTF-8" < "file.gmi" >> "file.gmi"

The file should now be in UTF-8.

Removing the BOM

I chose to use sed to strip the BOM, because it gives more control over safety features. It searches for the BOM at the very start of the file only and replaces it with an empty string.

The `1` at the begining of the substitute command here tells sed to only edit the first line of the file. The `^` at the start of the search tells it to only check at the start of the line. With these two options, it should not be possible to accidentally edit other parts of the file.

Use this command to strip the BOM from the file.

sed -i '1s/^\xef\xbb\xbf//' "file.gmi"

Sanitizer Bash Script

I wrote a simple Bash script to simplify the process.

Copy this code to a new file, then give it executable permission. Pass it any files you want sanitized to it as arguments.

#!/bin/sh

# Zelena © 2023
# 
# This program is free software: you can redistribute it and/or modify it under the terms of the Affero GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Affero GNU General Public License for more details.
# 
# You should have received a copy of the Affero GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/agpl.txt>. 


# Clean gemtext files for compatibility with flounder.online.
for file in "$@"; do
	echo "Sanitizing \"$file\"…"
	
	# Find safe temp file location
	tmp_file="${file}.tmp"
	inc=0
	
	while [ -f "$tmp_file" ]; do
		tmp_file="${file}.tmp${inc}"
		inc=$((inc+1))
	done
	
	# Get original file encoding
	file_encoding=$(file -i "$file" | sed "s/.*charset=\(.*\)$/\1/")
	
	# Convert text to UTF-8.
	# Placed in temp file to prevent race conditions.
	iconv -f "$file_encoding" -t "UTF-8" < "$file" >> "$tmp_file"
	
	# Strip BOM from start of file.
	sed -i '1s/^\xef\xbb\xbf//' "$tmp_file"
	
	# Replace original with sanitized version.
	mv -f "$tmp_file" "$file"
done

Example:

sanitize_gemtext.sh file1.gmi file2.gmi