2017-01-25 Finding Encoding Bugs

When I look at one of my log files, I see the following:

alex@sibirocobombus:~/communitywiki$ head -n 3 ~/farm/communitywiki.log
[Sun Jan 22 08:59:11 2017] hypnotoad: utf8 "\xFD" does not map to Unicode at /home/alex/farm/wiki.pl line 3455, <$FILE> line 1.
[Sun Jan 22 08:59:11 2017] hypnotoad: utf8 "\xFD" does not map to Unicode at /home/alex/farm/wiki.pl line 3455, <$FILE> line 1.
[Sun Jan 22 08:59:11 2017] hypnotoad: utf8 "\xFD" does not map to Unicode at /home/alex/farm/wiki.pl line 3455, <$FILE> line 1.

What could be the problem? Let’s find the pages containing this byte.

alex@sibirocobombus:~/communitywiki$ grep -l -P '\xFD' page/*.pg
page/MultilingualExperiment.pg
page/ProjectSpaceMultiLingual.pg

Can we reproduce the problem? Apparently, simply opening the file is not a problem. Those sequences appear to be valid UTF-8.

alex@sibirocobombus:~/communitywiki$ WikiDataDir=/home/alex/communitywiki perl -e 'package OddMuse; $RunCGI=0; do "/home/alex/src/oddmuse/wiki.pl"; Init(); ReadIndex(); OpenPage("MultilingualExperiment"); print "$Page{text}\n";'>/dev/null
alex@sibirocobombus:~/communitywiki$ WikiDataDir=/home/alex/communitywiki perl -e 'package OddMuse; $RunCGI=0; do "/home/alex/src/oddmuse/wiki.pl"; Init(); ReadIndex(); OpenPage("ProjectSpaceMultiLingual"); print "$Page{text}\n";'>/dev/null

How about simply opening all the files?

alex@sibirocobombus:~/communitywiki$ WikiDataDir=/home/alex/communitywiki perl -e 'package OddMuse; $RunCGI=0; do "/home/alex/src/oddmuse/wiki.pl"; Init(); ReadIndex(); OpenPage("ProjectSpaceMultiLingual"); for (@IndexList) { OpenPage($_); print "$OpenPageName\n" };'
...
2007-01-06
2007-01-09_NewsCwb
[Wed Jan 25 14:49:21 2017] -e: utf8 "\xFD" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845.
[Wed Jan 25 14:49:21 2017] -e: utf8 "\xFD" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845.
[Wed Jan 25 14:49:21 2017] -e: utf8 "\xFD" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845.
2007-01-13
2007-01-27_HwoToDo
...
RSS_3.0
rssanchor
RssExclude
[Wed Jan 25 14:49:26 2017] -e: utf8 "\xE8" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845.
[Wed Jan 25 14:49:26 2017] -e: utf8 "\xE8" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845.
[Wed Jan 25 14:49:26 2017] -e: utf8 "\xE8" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845.
RssInterwikiTranslate
RuleOfOrder
...

Interesting! And `less` finds these:

alex@sibirocobombus:~/communitywiki$ less page/2007-01-13.pg
...
diff-minor: <p><strong>Changed:</strong></p>
        <div class="old"><p>&lt; Well, I'll be donating another $75 this year, and hoping that my sites get moved to the  new machine they'll be buying soon. ;) My previous experience with commercial hosting has been less than satisfactory (at $20 per month), so I'd only make such a move if a group of people convinced me. There's shell access, cron jobs, Perl modules <strong class="changes"><FD><FD><FD></strong> a lengthy list of requirements I have personally...</p></div><p><strong>to</strong></p>
...

Thus, I edited 2007-01-13 and RssInterwikiTranslate, removing anything that looked weird in a major edit and from now on I hope to no longer see these warnings.

Alternatively, consider this little script written by CapnDan on the ​#oddmuse channel, Freenode:

#!/usr/bin/env perl
use utf8 ;
foreach (@ARGV) {
  my $file = $_ ;
  print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n--- $file ---\n";
  utf8::encode($file); # filenames are bytes!
  if (open(my $IN, '<:encoding(UTF-8)', $file)) {
    local $/ = undef; # Read complete files
    my $data=<$IN>;
    close $IN;
  };
};

​#Oddmuse