💾 Archived View for marginalia.nu › projects › edge › design-notes.gmi captured on 2022-04-29 at 11:33:04. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Notes on Designing a Search Engine

robots.txt

People put lists of very specific URLs they do not want you to look at in robots.txt, and I don't specifically mean secret admin log-in pages (even though that happens too), but like embarrassing stuff, dirt, the awkward august 2003 issue of campus magazine when the dean awarded Kony philanthropist of the year. It keeps the search engines out, but human beings can read these files too.

Speaking of robots.txt, there is no standard. Adherence is best-effort by every search engine, and the amount of weird directives you'll find is staggering. Oh, and ASCII art too, little messages. Its cute, but not something you should do if crawler adherence actually matters.

Standards

The HTML standard is not a standard. A major american university uses <title>-tags for its navigational links. It's a technological marvel how coherently web browsers deal with the completely incoherent web they browse.

Quality measure

The search engine evaluates the "quality" of a web page with a formula that, a bit simplified looks like

       length_text     -script_tags
  Q =  -----------  x e
       length_markup

As a consequence, the closer to plain text a website is, the higher it'll score. The more markup it has in relation to its text, the lower it will score. Each script tag is punished. One script tag will still give the page a relatively high score, given all else is premium quality; but once you start having multiple script tags, you'll very quickly find yourself at the bottom of the search results.

Modern web sites have a lot of script tags. The web page of Rolling Stone Magazine has over a hundred script tags in its HTML code. Its quality rating is of the order 10-51%.

/log/10-astrolabe-2-sampling-bias.gmi

Link Farms

Smut and link farms seems to go hand-in-hand, to the extent have at times filtered out the first to get at the other.

/log/04-link-farms.gmi

Trade-offs

There is a constant trade-off between usefulness, and efficiency. That is a necessity when running a search engine, typically reserved for a datacenter, on consumer hardware. Do you need to be able to search for slpk-ya-fxc-sg-wh, the serial number of a Yamaha subwoofer? If it comes at the cost of polluting the index with such highly unique entities? At the cost of speed, and size? What about Day[9], is the conventions of occasional digital handles enough to justify increasing the search term dictionary by 20%?

Standard searches

It's hard to quantify qualitative aspects, but I have some standard tasks I use to evaluate the virtues of the the search engine works.

I want to be able to find an interesting ariticle on Protagoras
Searching for PuTTY ssh should yield a download link relatively easily

While the goal of the search engine is to give an interesting degree of inaccuracy, it can't be too inaccurate either, to the point of being useless or just returning basically random links. These are challenges of promoting sufficiently relevant results. R.P. Feynman is an interesting man, but that doesn't make his cursory mention of silly putty an interesting result. Likewise, people seem to love to attribute man is the measure of all things to Protagoras, but relatively few articles are actually relevant to the man himself.

Description extraction

The most effective way of extracting a meaningful snippet of text from a web site seems to be to simply look for a piece of text that has a relatively low proportion of markdown. 50% seems a decent enough cut-off.

I've tried various approaches, and this relatively simple approach seems to work by far the best. The problem, in general, is identifying what is navigation and what is content. It's better having no summary than having summaries that look like

Home Blog About RSS feed Follow me on instagram | | | | | (C) 2010 Brown Horse Industries CC-BY-SA 3.0

This is the actual code I use

private Optional<String> extractSummaryRaw(Document parsed) {
  StringBuilder content = new StringBuilder();

  parsed.getElementsByTag("p").forEach(
        elem -> {
          if (elem.text().length() > elem.html().length()/2) {
            content.append(elem.text());
          }
      }
  );

  if (content.length() > 10) {
    return Optional.of(content.toString());
  }
  return Optional.empty();
}

Links

/projects/edge/index.gmi

/topic/astrolabe.gmi

Navigation

Back to Index

Reach me at kontakt@marginalia.nu