2015-01-13 Handwritten Optimization

Oddmuse datafiles are stored in a simple key-value pair format similar to email headers:

key1: some value
key2: a very long value
        continued on the next line
        (with a TAB).

(Learn more...)

Learn more

I used to have this straight forward regular expression based code to parse this data:

sub ParseData {
  my $data = shift;
  my %result;
  while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/sg) {
    my ($key, $value) = ($1, $2);
    $value =~ s/\n\t/\n/g;
    $result{$key} = $value;
  }
  return %result;
}

In 2006, I found the code was really slow and so it was rewritten.

sub ParseData {      # called a lot during search, so it was optimized
  my $data = shift;   # by eliminating non-trivial regular expressions
  my %result;
  my $end = index($data, ': ');
  my $key = substr($data, 0, $end);
  my $start = $end += 2;           # skip ': '
  while ($end = index($data, "\n", $end) + 1) { # include \n
    next if substr($data, $end, 1) eq "\t";     # continue after \n\t
    $result{$key} = substr($data, $start, $end - $start - 1); # strip last \n
    $start = index($data, ': ', $end); # starting at $end begins the new key
    last if $start == -1;
    $key = substr($data, $end, $start - $end);
    $end = $start += 2;   # skip ': '
  }
  $result{$key} .= substr($data, $end, -1); # strip last \n
  $result{$_} =~ s/\n\t/\n/g foreach (keys %result);
  return %result;
}

Around 2014/2015 I moved Emacs Wiki to a new host and discovered that this implementation was *very* slow for large files. I moved back to the old implementation from before 2006 and it was fast again.

Can you explain this? Did regular expression parsing improve dramatically? The new host runs Perl v5.18.2. The old host runs Perl v5.14.2.

☯

OK, I tested the following implementations. For whatever reasons, the version that seemed so fast in 2006 is a hundred times slower than other approaches!

test.pl:

sub ParseData1 {      # called a lot during search, so it was optimized
  my $data = shift;   # by eliminating non-trivial regular expressions
  my %result;
  my $end = index($data, ': ');
  my $key = substr($data, 0, $end);
  my $start = $end += 2;           # skip ': '
  while ($end = index($data, "\n", $end) + 1) { # include \n
    next if substr($data, $end, 1) eq "\t";     # continue after \n\t
    $result{$key} = substr($data, $start, $end - $start - 1); # strip last \n
    $start = index($data, ': ', $end); # starting at $end begins the new key
    last if $start == -1;
    $key = substr($data, $end, $start - $end);
    $end = $start += 2;   # skip ': '
  }
  $result{$key} .= substr($data, $end, -1); # strip last \n
  $result{$_} =~ s/\n\t/\n/g foreach (keys %result);
  return %result;
}

sub ParseData2 {
  my $data = shift;
  my %result;
  while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/sg) {
    my ($key, $value) = ($1, $2);
    $value =~ s/\n\t/\n/g;
    $result{$key} = $value;
  }
  return %result;
}

sub ParseData3 {
  my @lines = split(/\n/, shift);
  my $key, $value, %result;
  for my $line (@lines) {
    if ($line =~ /(\S+?): (.*)/) {
      $result{$key} = $value if $key;
      ($key, $value) = ($1, $2 . "\n");
    } elsif (substr($line, 0, 1) eq "\t") {
      $value .= substr($line, 1) . "\n";
    } else {
      die "Format error after key $key\n";
    }
  }
  $result{$key} = $value if $key;
  return %result;
}

use Time::HiRes qw/time/;

my $file = '/home/alex/emacswiki/page/bookmark+-1.el.pg';
my $n = 1;
my $now = time;
my $data;

if (open(IN, '<:utf8', $file)) {
  local $/ = undef; # Read complete files
  $data=<IN>;
  close IN;
} else {
  die "Cannot open $file: $!\n";
}

printf "Reading file: %.4f\n", time - $now;
printf "Lines read: %d\n", scalar(() = $data =~ /\n/g);

$now = time;
for my $i (1 .. $n) {
  ParseData1($data);
}
printf "ParseData1 $n times: %.4f\n", time - $now;

my $n = 100;

$now = time;
for my $i (1 .. $n) {
  ParseData2($data);
}
printf "ParseData2 $n times: %.4f\n", time - $now;

$now = time;
for my $i (1 .. $n) {
  ParseData3($data);
}
printf "ParseData3 $n times: %.4f\n", time - $now;

The result:

alex@kallobombus:~$ perl test.pl
Reading file: 0.0012
Lines read: 23334
ParseData1 1 times: 20.8207
ParseData2 100 times: 15.9124
ParseData3 100 times: 15.6762

#Perl

Comments

(Please contact me if you want to remove your comment.)

⁂

A few optimizations were made in Perl 5.16. I don’t know if they affected this though.

– Anonymous 2015-01-21 15:50 UTC