I was looking at an Oddmuse wiki with 3741 pages and ran `time perl wiki.pl search=fnord raw=1` a few times. It took 1.502 seconds on the second run (the first one runs into disk caching issues). Then I rewrote the SearchString code to compile regular expressions and take it out of the loop. The second search took about 1.480 seconds. Talk about minimal gain!
This is disappointing. RadomirDopieralski pointed me to Regular Expression Matching Can Be Simple And Fast by Russ Cox. Maybe that explains why Daniel MacKay’s fix to Oddmuse search using `grep` as a filter is so much faster than the standard Oddmuse search.
Regular Expression Matching Can Be Simple And Fast
sub SearchTitleAndBody { my ($query, $func, @args) = @_; # skip null entries, compile regular expressions my @regexps = map { qr/$_/i } grep /./, $query =~ /\"([^\"]+)\"|(\S+)/g; my @found; my $lang = GetParam('lang', ''); foreach my $id (AllPagesList()) { my $name = NormalToFree($id); my ($text) = PageIsUploadedFile($id); # set to mime-type if this is an uploaded file if (not $text) { # not uploaded file, therefore allow searching of page body OpenPage($id); # this opens a page twice if it is not uploaded, but that's ok if ($lang) { my @languages = split(/,/, $Page{languages}); next if (@languages and not grep(/$lang/, @languages)); } $text = $Page{text}; } if (SearchString($name . "\n" . $text, @regexps)) { push(@found, $id); &$func($id, @args) if $func; } } return @found; } sub SearchString { my $data = shift; foreach my $regexp (@_) { return 0 unless ($data =~ /$regexp/); } return 1; }
I think Radomir is right when he says that Perl probably caches all the regular expressions anyway so there is very little time wasted in the original loop.
Then again, it seems that regular expressions are not the problem. I returned to the old code and tried `index($data, $str) >= 0` instead of `($data =~ /$str/i)` but got no gain at all. Gaah!
sub SearchTitleAndBody { my ($string, $func, @args) = @_; my @found; my $lang = GetParam('lang', ''); foreach my $id (AllPagesList()) { my $name = NormalToFree($id); my ($text) = PageIsUploadedFile($id); # set to mime-type if this is an uploaded file if (not $text) { # not uploaded file, therefore allow searching of page body OpenPage($id); # this opens a page twice if it is not uploaded, but that's ok if ($lang) { my @languages = split(/,/, $Page{languages}); next if (@languages and not grep(/$lang/, @languages)); } $text = $Page{text}; } if (SearchString(uc($string), uc($name . "\n" . $text))) { push(@found, $id); &$func($id, @args) if $func; } } return @found; } sub SearchString { my ($string, $data) = @_; my @strings = grep /./, $string =~ /\"([^\"]+)\"|(\S+)/g; # skip null entries foreach my $str (@strings) { return 0 unless index($data, $str) >= 0; } return 1; }
I must therefore assume that for the simple regular expression I’m using the regular expression machinery is not the limiting factor. It must be page opening or page parsing or something like that.
I then tried to avoid parsing the page itself. OpenPage now takes an argument to shortcut:
sub OpenPage { # Sets global variables my ($id, $raw) = shift; return if $OpenPageName eq $id; $OpenPageName = $id; if ($IndexHash{$id}) { my $data = ReadFileOrDie(GetPageFile($id)); return $data if $raw; # must call ParseData later! %Page = ParseData($data); } else { %Page = (); $Page{ts} = $Now; $Page{revision} = 0; if ($id eq $HomePage and (open(F, $ReadMe) or open(F, 'README'))) { local $/ = undef; $Page{text} = <F>; close F; } elsif ($CommentsPrefix and $id =~ /^$CommentsPrefix(.*)/o) { # do nothing } } }
And I use this in a version similar to the very first copy of the code except that I call SearchString twice:
if (not $text) { # not uploaded file, therefore allow searching of page body my $data = OpenPage($id, 1); # this may open a page twice if not uploaded next unless SearchString($name . "\n" . $data, @regexps); # shortcut %Page = ParseData($data); # avoid parsing page data if ($lang) { my @languages = split(/,/, $Page{languages}); next if (@languages and not grep(/$lang/, @languages)); } $text = $Page{text}; } if (SearchString($name . "\n" . $text, @regexps)) { push(@found, $id); &$func($id, @args) if $func; }
This reduced the time used by a bit more than 2%. How disappointing.
#Regexp #Perl #Oddmuse
(Please contact me if you want to remove your comment.)
⁂
wow, thanks for doing those tests and posting the results, Alex — maybe the current search code is “close enough” to optimal that we shouldn’t worry about tuning it any more ...
– Mark Zimmermann 2008-10-02 01:43 UTC
---
Well, the only strange thing is that if you run `perl -d:DProf wiki.pl search=fnord raw=1; dprofpp` you’ll see that around a third of the time is spent in ParsePage. There must still be ways to gain a significant advantage.
– Alex Schroeder 2008-10-02 04:49 UTC
---
Based on an old idea by Daniel MacKay I tried the following:
< foreach my $id (AllPagesList()) { --- > foreach my $id (GrepFiltered($string, AllPagesList())) {
and
sub GrepFiltered { # grep is so much faster!! my ($string, @pages) = @_; return @pages unless $UseGrep; my @result; my $regexp = quotemeta join '|', map { index($_,'|') == -1 ? $_ : "($_)" } grep /./, $string =~ /\"([^\"]+)\"|(\S+)/g; # this acts as OR open(F,"grep -l -i $regexp $PageDir/*/*.pg|"); while (<F>) { if (m/.*\/(.*)\.pg/) { push(@result, $1); } } close(F); return @result; }
And guess what? `time perl wiki.pl search=fnord raw=1` runs in 0.375 secs. That’s about ¼ of the original running time!
Grep does in fact rule over Perl.
– Alex Schroeder 2008-10-02 22:17 UTC
---
wow, Alex! — a factor of 4 is huge ... hmmm ... I must study the above code ... any chance it could sneak into some future Oddmuse version? *(I’m still running 1.835 — apologies, I know I need to update it!)*
– Mark Zimmermann 2008-10-05 11:41 UTC
---
I immediately added it to the latest and greatest version of Oddmuse. :D
– Alex Schroeder 2008-10-05 13:34 UTC