2012-07-20 Perl and UTF-8

I maintain a wiki engine called Oddmuse. It’s the software used to run my blog, for example. It is written in an older scripting language called *Perl*. Perl predates Unicode. That’s why the use of UTF-8 or UTF-16 is not mandated. That, in turn, means that usually bytes are interpreted as an UTF-8 encoded character is only visible as two bytes.

Consider this regular expression to match WikiWords: `[A-Z][a-z]+[A-Z][a-z]+`

How would you extend it to parse ÖlPlattform!?

Assume the following Perl code was written in a source file that was UTF-8 encoded:

$str = "OelPlattform";
print "OelPlattform YES\n" if $str =~ /[[:upper:]][[:lower:]]+[[:upper:]]\w+/;
$str = "ÖlPlattform";
print "ÖlPlattform YES\n" if $str =~ /[[:upper:]][[:lower:]]+[[:upper:]]\w+/;

This will just print `OelPlattform YES` because what looks like “ÖlPlattform” actually starts with the bytes C3 96 and C3 is not an upper case letter. It’s actually unclear what it is. In a Latin-1 environment the C3 would print as Ã—the dreaded sign of encoding errors!

I wanted to keep Oddmuse encoding agnostic. Users could specify a different encoding which would be served together with the page HTML such that they could have wikis using GB 2312. This is why Oddmuse contained the following line and similar code:

GB 2312

1. we treat input and output as bytes
eval { local $SIG{__DIE__}; binmode(STDOUT, ":raw"); };

This resulted in problems when some packages I was using did in fact produce UTF-8 and so I had to use code as follows:

eval { local $SIG{__DIE__}; binmode(STDOUT, ":utf8"); } if $HttpCharset eq 'UTF-8';
print RSS($3 ? $3 : 15, split(/\s+/, UnquoteHtml($4)));
eval { local $SIG{__DIE__}; binmode(STDOUT, ":raw"); };

I’m not sure why I surrounded it all with an eval—I assume it was to support an older version of Perl but I’m not sure.

Ok, so I wanted to get rid of all that.

The solution seems deceptively simple: add `use utf8;` to the source files and open all files using the UTF-8 encoding layer.

When printing UTF-8 to STDOUT, you need to tell Perl that STDOUT can in fact handle multi-byte characters. Since the HTML produced is UTF-8 encoded, I know that this is true. If you don’t, you’ll get “wide character in print” warnings.

binmode(STDOUT, ':utf8');

You need to be careful with all input and output, however.

open(F, '<:encoding(UTF-8)', $RcFile);

The same is true for output:

open(OUT, '>:encoding(UTF-8)', $file)
  or ReportError(Ts('Cannot write %s', $file) . ": $!", '500 INTERNAL SERVER ERROR');

Oddmuse also offers the ability to *include* other pages (Transclusion) and to produce feeds. This can be a problem. The default page processing is to parse the raw text and start printing HTML as soon as possible because I have always felt that it was more expedient to start printing the top of the page while the rest was still being parsed. What happens when I don’t want to do this, eg. I’m in the middle of building the RSS feed?

Transclusion

The solution I had been using was to redirect STDOUT to a variable. Perl calls this a “memory file.” The problem is the encoding of this memory file:

Here’s what I had to write:

open(STDOUT, '>', \$page) or die "Can't open memory file: $!";
binmode(STDOUT, ":utf8");
PrintPageHtml();
utf8::decode($page);

I think this works because `binmode` tells all the `print` instructions that it’s ok to print multi-byte characters and `utf8::decode` makes sure that all those bytes are in fact decoded back to Perl’s internal representation.

Then I discovered that I needed to look at the *bytes* if I wanted to URL-encode strings:

utf8::encode($str); # turn to byte string
my @letters = split(//, $str);
my %safe = map {$_ => 1} ('a' .. 'z', 'A' .. 'Z', '0' .. '9', '-', '_', '.', '!', '~', '*', "'", '(', ')', '#');
foreach my $letter (@letters) {
  $letter = sprintf("%%%02x", ord($letter)) unless $safe{$letter};
}

Now that I’m looking at the above I wonder what sort of bugs I’m introducing with the inverse operation that I haven’t changed:

$str =~ s/%([0-9a-f][0-9a-f])/chr(hex($1))/ge;

I feel that this requires a call to `utf8::decode` when done! Strangely enough none of my tests have picked this up. :question:

(Actually I think I know why I haven’t stumbled across this problem: I only use the function to decode the Cookie, and all the functions accessing the cookie go through an extra encoding/decoding step that would not be necessary if I had fixed the URL-decoding function. 💡)

Another problem I stumbled upon: directories. Directories often ended up Latin-1 encoded.

utf8::encode($newdir);
return if -d $newdir;
mkdir($newdir, 0775)
  or ReportError(Ts('Cannot create %s', $newdir) . ": $!", '500 INTERNAL SERVER ERROR');

The reason I didn’t discover I had the same problem with filenames was that I’m using a compatibility layer on my Mac when I do my developments. The Mac uses UTF-8 NFD instead of UTF-8 NFC as is the standard on the web. Thus if you take bytes encoding a filename from the web and create the file, or if you go the other way, you have a problem. I store the index of all pages in a files. When a new page is created, I get the page name (NCF encoded) from the web, and store it in a file. When I read the file, the content contains the NFC bytes and with these, I cannot find the NFD encoded file (because the filesystem changed the encoding as it wrote the file). I hated it so much. Thus, the Mac compatibility layer does an extra encoding and decoding to get everything from NFD to NFC—and thereby protected me from this error.

As soon as I installed it on my sites, however—they all use Debian and ext3 filesystems, I think—I had a problem.

The necessary fix:

utf8::encode($file);
if (open(IN, '<:encoding(UTF-8)', $file)) {
  local $/ = undef;   # Read complete files
  my $data=<IN>;
  close IN;
  return (1, $data);
}

And:

utf8::encode($file);
open(OUT, '>:encoding(UTF-8)', $file)
  or ReportError(Ts('Cannot write %s', $file) . ": $!", '500 INTERNAL SERVER ERROR');
print OUT  $string;
close(OUT);

Another stumbling block was that the non-breaking space was no longer just a byte sequence like any other, namely C2 A0. Perl suddenly recognized it as *whitespace*! This is a problem if a path contains non-breaking spaces! The old code translated spaces to underscore characters, so that wasn’t really a possibility. But whenever I had been “smart” and used a non-breaking space, I now had a problem. The `glob` function splits its arguments on *whitespace*. Where there was one pattern, I now had two broken patterns!

Here’s an example:

glob(GetKeepDir(shift) . '/*.kp'); # files such as 1.kp, 2.kp, etc.

Here’s another example:

foreach (glob("$PageDir/*/*.pg $PageDir/*/.*.pg"))

The solution is to `use File::Glob ':glob'` and replace every occurence of `glob` with `bsd_glob`. Wow, my application was very much unsuited to filenames containing whitespace and I hadn’t even realized it!

foreach (bsd_glob("$PageDir/*/*.pg"), bsd_glob("$PageDir/*/.*.pg"))

Remember the regular expression to detect wiki words I used at the top? This was the actual regular expression I had been using:

$WikiWord = '[A-Z]+[a-z\x80-\xff]+[A-Z][A-Za-z\x80-\xff]*';

Essentially wiki words only worked for a first letter containing an ASCII upper case letter.

At first, I switched this to the following regular expression (trying to minimize changes):

$WikiWord = '[A-Z]+[a-z\x{0080}-\x{ffff}]+[A-Z][A-Za-z\x{0080}-\x{ffff}]*';

It turns out that Perl 5.8 chokes on this regular expression, howeveer. FFFE and FFFF are noncharacters. I had to change the regular expression.

$WikiWord = '[A-Z]+[a-z\x{0080}-\x{fffd}]+[A-Z][A-Za-z\x{0080}-\x{fffd}]*'; # exclude noncharacters FFFE and FFFF

I’m sure this list isn’t complete but I’m sure it’s long enough to illustrate my main point: this is painful. It’s HTML quoting all over again.

#Perl #Software

Comments

(Please contact me if you want to remove your comment.)

⁂

Things to do for Oddmuse:

1. fix that regular expression

2. fix that cookie encoding issue

- Alex
—

Can it recognize new WikiWords as “ÖlPlattform”, thanks to change regular expression to match them?. As far I can understanding it a little, does it need change those regex and changes way of read (write?), string (from url, for instance) and files?

– JuanmaMP 2012-07-22 01:16 UTC

JuanmaMP

---

Actually, I think that a simple change of the regular expressions is all that is needed. 😄

– Alex Schroeder 2012-07-22 05:11 UTC

Alex Schroeder