This is how I’m parsing page data currently:
sub ParseData { my $data = shift; my %result; while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/sg) { my ($key, $value) = ($1, $2); $value =~ s/\n\t/\n/g; $result{$key} = $value; } return %result; }
Page data is in something resembling RFC 822 format: An identifier (the key), a colon, a space, some text (the value), newline. If the value itself contains newlines, they are “escaped” by inserting a tab character. Here’s a shortened example:
ip: 68.174.154.124 summary: No, backlinks are useful to humans. BillSeitz likes them right on the page. diff-major: 1 text: When a normal forward link points from A to B, the backlink is the automatically generated link back from B to A. On systems where you don't have unidirectional forward links but only bidirectional links, it makes no sense to talk about forward links and back links. From a theoretical point of view, back links are interesting because without back links, it is possible to get to a page without any forward links pointing away. The backlink is your only way to get "back" to the hypertext. languages: en
This page has the keys ip, summary, diff-major, text, and languages. The text value has multiple lines, including two empty lines.
When I run this on a snapshot of CommunityWiki on my laptop:
Alpinobombus:~/Documents/CommunityWiki alex$ time perl time1.pl Pages: 2544 real 1m7.868s user 0m55.660s sys 0m1.410s
Then I rewrote it:
sub ParseData { my $data = shift; my %result; my $end = index($data, ': '); my $key = substr($data, 0, $end); my $start = $end += 2; # skip ': ' while ($end = index($data, "\n", $end) + 1) { # include \n next if substr($data, $end, 1) eq "\t"; # continue after \n\t $result{$key} = substr($data, $start, $end - $start - 1); # strip last \n $start = $end; $end = index($data, ': ', $start); last if $end == -1; $key = substr($data, $start, $end - $start); $end = $start += 2; # skip ': ' } $result{$key} .= substr($data, $end, -1); # strip last \n foreach (keys %result) { $result{$_} =~ s/\n\t/\n/g }; return %result; }
Result:
Alpinobombus:~/Documents/CommunityWiki alex$ time perl time3.pl Pages: 2544 real 0m5.787s user 0m4.130s sys 0m1.250s
Wow! More than 10 × faster!
This is significant because the default search for Oddmuse goes through all the files, opening and searching them. At first I thought that opening the files was taking so long. But as it turns out, *parsing* is taking so long!
#Oddmuse