2006-10-07 Page Parsing

This is how I’m parsing page data currently:

sub ParseData {
  my $data = shift;
  my %result;
  while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/sg) {
    my ($key, $value) = ($1, $2);
    $value =~ s/\n\t/\n/g;
    $result{$key} = $value;
  }
  return %result;
}

Page data is in something resembling RFC 822 format: An identifier (the key), a colon, a space, some text (the value), newline. If the value itself contains newlines, they are “escaped” by inserting a tab character. Here’s a shortened example:

ip: 68.174.154.124
summary: No, backlinks are useful to humans. BillSeitz likes them right on the page.
diff-major: 1
text: When a normal forward link points from A to B, the backlink is the automatically generated link back from B to A.

	On systems where you don't have unidirectional forward links but only bidirectional links, it makes no sense to talk about forward links and back links.

	From a theoretical point of view, back links are interesting because without back links, it is possible to get to a page without any forward links pointing away.  The backlink is your only way to get "back" to the hypertext.
languages: en

This page has the keys ip, summary, diff-major, text, and languages. The text value has multiple lines, including two empty lines.

When I run this on a snapshot of CommunityWiki on my laptop:

Alpinobombus:~/Documents/CommunityWiki alex$ time perl time1.pl
Pages: 2544

real    1m7.868s
user    0m55.660s
sys     0m1.410s

Then I rewrote it:

sub ParseData {
  my $data = shift;
  my %result;
  my $end = index($data, ': ');
  my $key = substr($data, 0, $end);
  my $start = $end += 2; # skip ': '
  while ($end = index($data, "\n", $end) + 1) { # include \n
    next if substr($data, $end, 1) eq "\t";     # continue after \n\t
    $result{$key} = substr($data, $start, $end - $start - 1); # strip last \n
    $start = $end;
    $end = index($data, ': ', $start);
    last if $end == -1;
    $key = substr($data, $start, $end - $start);
    $end = $start += 2; # skip ': '
  }
  $result{$key} .= substr($data, $end, -1); # strip last \n
  foreach (keys %result) { $result{$_} =~ s/\n\t/\n/g };
  return %result;
}

Result:

Alpinobombus:~/Documents/CommunityWiki alex$ time perl time3.pl
Pages: 2544

real    0m5.787s
user    0m4.130s
sys     0m1.250s

Wow! More than 10 × faster!

This is significant because the default search for Oddmuse goes through all the files, opening and searching them. At first I thought that opening the files was taking so long. But as it turns out, *parsing* is taking so long!

​#Oddmuse