When I look at one of my log files, I see the following:
alex@sibirocobombus:~/communitywiki$ head -n 3 ~/farm/communitywiki.log [Sun Jan 22 08:59:11 2017] hypnotoad: utf8 "\xFD" does not map to Unicode at /home/alex/farm/wiki.pl line 3455, <$FILE> line 1. [Sun Jan 22 08:59:11 2017] hypnotoad: utf8 "\xFD" does not map to Unicode at /home/alex/farm/wiki.pl line 3455, <$FILE> line 1. [Sun Jan 22 08:59:11 2017] hypnotoad: utf8 "\xFD" does not map to Unicode at /home/alex/farm/wiki.pl line 3455, <$FILE> line 1.
What could be the problem? Let’s find the pages containing this byte.
alex@sibirocobombus:~/communitywiki$ grep -l -P '\xFD' page/*.pg page/MultilingualExperiment.pg page/ProjectSpaceMultiLingual.pg
Can we reproduce the problem? Apparently, simply opening the file is not a problem. Those sequences appear to be valid UTF-8.
alex@sibirocobombus:~/communitywiki$ WikiDataDir=/home/alex/communitywiki perl -e 'package OddMuse; $RunCGI=0; do "/home/alex/src/oddmuse/wiki.pl"; Init(); ReadIndex(); OpenPage("MultilingualExperiment"); print "$Page{text}\n";'>/dev/null alex@sibirocobombus:~/communitywiki$ WikiDataDir=/home/alex/communitywiki perl -e 'package OddMuse; $RunCGI=0; do "/home/alex/src/oddmuse/wiki.pl"; Init(); ReadIndex(); OpenPage("ProjectSpaceMultiLingual"); print "$Page{text}\n";'>/dev/null
How about simply opening all the files?
alex@sibirocobombus:~/communitywiki$ WikiDataDir=/home/alex/communitywiki perl -e 'package OddMuse; $RunCGI=0; do "/home/alex/src/oddmuse/wiki.pl"; Init(); ReadIndex(); OpenPage("ProjectSpaceMultiLingual"); for (@IndexList) { OpenPage($_); print "$OpenPageName\n" };' ... 2007-01-06 2007-01-09_NewsCwb [Wed Jan 25 14:49:21 2017] -e: utf8 "\xFD" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845. [Wed Jan 25 14:49:21 2017] -e: utf8 "\xFD" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845. [Wed Jan 25 14:49:21 2017] -e: utf8 "\xFD" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845. 2007-01-13 2007-01-27_HwoToDo ... RSS_3.0 rssanchor RssExclude [Wed Jan 25 14:49:26 2017] -e: utf8 "\xE8" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845. [Wed Jan 25 14:49:26 2017] -e: utf8 "\xE8" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845. [Wed Jan 25 14:49:26 2017] -e: utf8 "\xE8" does not map to Unicode at /home/alex/src/oddmuse/wiki.pl line 2845. RssInterwikiTranslate RuleOfOrder ...
Interesting! And `less` finds these:
alex@sibirocobombus:~/communitywiki$ less page/2007-01-13.pg ... diff-minor: <p><strong>Changed:</strong></p> <div class="old"><p>< Well, I'll be donating another $75 this year, and hoping that my sites get moved to the new machine they'll be buying soon. ;) My previous experience with commercial hosting has been less than satisfactory (at $20 per month), so I'd only make such a move if a group of people convinced me. There's shell access, cron jobs, Perl modules <strong class="changes"><FD><FD><FD></strong> a lengthy list of requirements I have personally...</p></div><p><strong>to</strong></p> ...
Thus, I edited 2007-01-13 and RssInterwikiTranslate, removing anything that looked weird in a major edit and from now on I hope to no longer see these warnings.
Alternatively, consider this little script written by CapnDan on the #oddmuse channel, Freenode:
#!/usr/bin/env perl use utf8 ; foreach (@ARGV) { my $file = $_ ; print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n--- $file ---\n"; utf8::encode($file); # filenames are bytes! if (open(my $IN, '<:encoding(UTF-8)', $file)) { local $/ = undef; # Read complete files my $data=<$IN>; close $IN; }; };
#Oddmuse