Oddmuse datafiles are stored in a simple key-value pair format similar to email headers:
key1: some value key2: a very long value continued on the next line (with a TAB).
(Learn more...)
I used to have this straight forward regular expression based code to parse this data:
sub ParseData { my $data = shift; my %result; while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/sg) { my ($key, $value) = ($1, $2); $value =~ s/\n\t/\n/g; $result{$key} = $value; } return %result; }
In 2006, I found the code was really slow and so it was rewritten.
sub ParseData { # called a lot during search, so it was optimized my $data = shift; # by eliminating non-trivial regular expressions my %result; my $end = index($data, ': '); my $key = substr($data, 0, $end); my $start = $end += 2; # skip ': ' while ($end = index($data, "\n", $end) + 1) { # include \n next if substr($data, $end, 1) eq "\t"; # continue after \n\t $result{$key} = substr($data, $start, $end - $start - 1); # strip last \n $start = index($data, ': ', $end); # starting at $end begins the new key last if $start == -1; $key = substr($data, $end, $start - $end); $end = $start += 2; # skip ': ' } $result{$key} .= substr($data, $end, -1); # strip last \n $result{$_} =~ s/\n\t/\n/g foreach (keys %result); return %result; }
Around 2014/2015 I moved Emacs Wiki to a new host and discovered that this implementation was *very* slow for large files. I moved back to the old implementation from before 2006 and it was fast again.
Can you explain this? Did regular expression parsing improve dramatically? The new host runs Perl v5.18.2. The old host runs Perl v5.14.2.
☯
OK, I tested the following implementations. For whatever reasons, the version that seemed so fast in 2006 is a hundred times slower than other approaches!
test.pl:
sub ParseData1 { # called a lot during search, so it was optimized my $data = shift; # by eliminating non-trivial regular expressions my %result; my $end = index($data, ': '); my $key = substr($data, 0, $end); my $start = $end += 2; # skip ': ' while ($end = index($data, "\n", $end) + 1) { # include \n next if substr($data, $end, 1) eq "\t"; # continue after \n\t $result{$key} = substr($data, $start, $end - $start - 1); # strip last \n $start = index($data, ': ', $end); # starting at $end begins the new key last if $start == -1; $key = substr($data, $end, $start - $end); $end = $start += 2; # skip ': ' } $result{$key} .= substr($data, $end, -1); # strip last \n $result{$_} =~ s/\n\t/\n/g foreach (keys %result); return %result; } sub ParseData2 { my $data = shift; my %result; while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/sg) { my ($key, $value) = ($1, $2); $value =~ s/\n\t/\n/g; $result{$key} = $value; } return %result; } sub ParseData3 { my @lines = split(/\n/, shift); my $key, $value, %result; for my $line (@lines) { if ($line =~ /(\S+?): (.*)/) { $result{$key} = $value if $key; ($key, $value) = ($1, $2 . "\n"); } elsif (substr($line, 0, 1) eq "\t") { $value .= substr($line, 1) . "\n"; } else { die "Format error after key $key\n"; } } $result{$key} = $value if $key; return %result; } use Time::HiRes qw/time/; my $file = '/home/alex/emacswiki/page/bookmark+-1.el.pg'; my $n = 1; my $now = time; my $data; if (open(IN, '<:utf8', $file)) { local $/ = undef; # Read complete files $data=<IN>; close IN; } else { die "Cannot open $file: $!\n"; } printf "Reading file: %.4f\n", time - $now; printf "Lines read: %d\n", scalar(() = $data =~ /\n/g); $now = time; for my $i (1 .. $n) { ParseData1($data); } printf "ParseData1 $n times: %.4f\n", time - $now; my $n = 100; $now = time; for my $i (1 .. $n) { ParseData2($data); } printf "ParseData2 $n times: %.4f\n", time - $now; $now = time; for my $i (1 .. $n) { ParseData3($data); } printf "ParseData3 $n times: %.4f\n", time - $now;
The result:
alex@kallobombus:~$ perl test.pl Reading file: 0.0012 Lines read: 23334 ParseData1 1 times: 20.8207 ParseData2 100 times: 15.9124 ParseData3 100 times: 15.6762
#Perl
(Please contact me if you want to remove your comment.)
⁂
A few optimizations were made in Perl 5.16. I don’t know if they affected this though.
– Anonymous 2015-01-21 15:50 UTC