[Geocities-esque "UNDER CONSTRUCTION" GIF here]
Despite the apparent simplicity of this basic algorithm, web crawling has many inherent challenges
- Marc Najork
I enjoy building crawlers. I've done so in both a hobbyist and a profession context for over 15 years. Crawlers are so interesting to me because there is such a enormous range of usable capabilities. The simplest crawler can be only a dozen lines in a high level language and will work. However, crawlers can scan to incredible complexity allow them to process hundreds of pages a second per crawling node. Let's start with a simple crawler design and add capabilities it.
Note: These techniques apply equally to writing any crawler, regardless of protocol. In fact, early web crawlers handle HTTP, FTP, and Gopher. I may use Gemini in my examples, but nothing is specific to any one protocol.
The most basic crawler looks something like this:
Queue = [] Queue.enqueue(startUrl) while(Queue.Length > 0) { Url = Queue.dequeue() Response = Request(Url) NewLinks = ExtractLinks(Response) Queue.Enqueue(NewLinks) }
It has:
Even with such a simple design, there are minor modifications that have big implications on the behavior of the crawler. In the example, we use a queue to store our set of URLs to visit. A queue is a First-In-First-Out data structure. We could have used a stack instead. A stack is a Last-In-First-Out data structure. Using a queue or a stack would change the behavior of our crawler:
Crawlers typically use a breadth first approach, because this allows them to visit a wide variety of URLs from different sources, whereas depth-first can result in the crawler plunging into a specific area consuming all links. If a site has hundreds or thousands of URLs, or ever dynamically generates URLs so there is an infinite amount, this can result in the crawler getting stuck and not visiting any other sites. So our simplest crawler will use a queue.
Our basic design has a lot of room for improvements.
First, our crawler should do something valuable with the responses it gets, such as index them for a search engine. So let's add a placeholder DoSomething() function for that.
Queue = [] Queue.enqueue(startUrls) while(Queue.Length > 0) { Url = Queue.dequeue() Response = Request(Url) DoSomething(Response) NewLinks = ExtractLinks(Response) Queue.Enqueue(NewLinks) }
Next, we don't want our crawler to visit the same URL multiple times (at least for now). So we need to keep track of not only the URLs the crawler has already visited, but also the URLs that the crawler already has in its queue. Lets pass our list of extracted URLs through a filter function to remove links we have already seen.
Now our crawler code looks like this:
Queue = [] Queue.enqueue(startUrls) while(Queue.Length > 0) { Url = Queue.dequeue() Response = Request(Url) DoSomething(Response) NewLinks = ExtractLinks(Response) NewLinks = RemoveAlreadySeen(NewLinks) Queue.Enqueue(NewLinks) }
If we are adding code that tries to track if we have seen a URL before, we need a way to see if 2 URLs are the same. This is more complex than doing simple string comparison. URLs have a lot of complexities
This means the root of any crawler is a robust set of URL normalization functions.
Consider that a resource gemini://example.com/bar.gmi which contains a relative hyperlink to "foo/a.gmi".
page: gemini://example.com/bar.gmi => foo/a.gmi a link
Due to a site misconfiguration, when you vist "foo/a.gmi" you get the same page content, again with a relative URL:
page: gemini://example.com/foo/a.gmi => foo/a.gmi a link
This creates a crawler trap where the crawler keeps new finding URLs:
gemini://example.com/foo/a.gmi gemini://example.com/foo/foo/a.gmi gemini://example.com/foo/foo/foo/a.gmi gemini://example.com/foo/foo/foo/foo/a.gmi gemini://example.com/foo/foo/foo/foo/foo/a.gmi ...
While eventually you will hit a limit on the size of the URL, ideally you want to avoid something like this. This is an example of how processing duplicate content can trap you. But duplicate content has other problems too. These URLs could point to the same content, like this:
gemini://example.com/foo/ gemini://example.com/foo/index.gmi
You probably don't want to process both of them. This is especially true as your processing gets smarter and thus takes longer. You want to avoid duplicate content. This means modifying our crawler so that it detect when it has seen a response before. Depending on the protocol, this may involve just hashing the response body (e.g. Gemini), or potentially normalizing some response headers, while removing others, and including them in the hash (e.g. HTTP).
This is usually best accomplished by hashing a response and keeping a table of all the hashes you have already seen. So now our crawler looks like this:
Queue = [] Queue.enqueue(startUrl) while(Queue.Length > 0) { Url = Queue.dequeue() Response = Request(Url) if(!SeenBefore(Response)) { DoSomething(Response) NewLinks = ExtractLinks(Response) NewLinks = RemoveAlreadySeen(NewLinks) Queue.Enqueue(NewLinks) } }
Crawlers always have a fundamental tension. They have potentially hundreds of thousands (in the case of Gemini) to millions (early web) to billions (modern web) of URLs to visit. So they must work very very quickly.
However, despite the need for speed, anyone running a crawler that overloads servers soon learns that not only is such behavior is considered unacceptable, it contradictorily limits your effectiveness
In other words, going too fast will actually make your perform worst than going more slowly. Why?
At the very least, a crawler should not attempt to download multiple URLs from the same server simultaneously. better, it should impose a limit on the portion of a web server’s resources it consumes. Thus it is responsible for crawlers to include a delay between requests.
That looks something like this:
Queue = [] Timer Queue.enqueue(startUrl) while(Queue.Length > 0) { Url = Queue.dequeue() Timer = new Timer() Timer.start() Response = Request(Url) if(!SeenBefore(Response)) { DoSomething(Response) NewLinks = ExtractLinks(Response) NewLinks = RemoveAlreadySeen(NewLinks) Queue.Enqueue(NewLinks) } while(Timer.ElapsedSec < 1.5) { sleep() } }
Here we are using a timer so that, regardless of how quickly the site responds, the crawler will not request URLs faster than once every 1.5 seconds.
So far, our crawler has operated like this:
+-------------+ +------->+ Queue | +------+ | +-------------+ | Add New URLs | | to Queue | | Pull off URL to vist | +-------------+ | +--------+ Crawler | <------+ +----+--+-----+ | ^ | | Requests/Responses | | v + Gemini Capsules
While we have made our crawler smarter, we haven't increased it throughput. It only requests a single URL at a time. In fact, we just added a timer to make it polite, so it is capped. Most of requesting and downloading a URL is waiting, so this is a perfect opportunity to download multiple URLs in parallel.
+-------------------+ | URL Queue | +-------------------+ X X X X X X X X XX X X X X X X X X X +----+ +----+ +----+ | A | | B | | C | +----+ +----+ +----+
There are some immediate, an obvious challenges with having multiple threads or worker this:
When dealing with a single thread, knowing if the crawler was "done" was pretty easy. If the queue was empty, there was no more work to do. This is not the case with a multi-threaded crawler. A and B are not doing any work, and they may check the queue and see that it is empty. However, if C is still making a request or processing a response, it may yet find more URLs to add to the queue. The crawler is only "done" if there is no threads are actively doing work, AND the queue is empty.
while(queue.Lenth > 0 || threadsDoingWork > 0) { }
For simplicity, lets define an abstract function IsCrawlerComplete(). This will handle checking all of this stuff for us.
However there are other, more subtle challenges when building a multi-threaded crawler.
Imagine that one of the crawler threads has visited example.com, and added a number of new links to the queue. So inside our queue, there is a run of consecutive URLs, all going to example.com. In a multi-threaded design with a single crawl queue, multiple different threads will pull work off that queue. They are now all sending requests for different URLs on example.com. This circumvents the throttling mechanism we put into our crawler before. In fact its worse than our original design. In our original design, we could only send a new request after a previous request had finished. In our new design, we could have 5, 10, over 20 requests to the same site, all in flight at the same time!
We need a way to have multiple workers, while ensuring that a URLs for a site cannot be processed by more than 1 worker at a time.
Najork's work with the Mercator crawler produced several papers about this. A simplified version of their approach is to add URLs into N different queues, where each queue has a dedicated worker. A URL is sorted into a specific queue by hashing its hostname and using modulus N. This ensures that URLs for a specific host will always end up in the same queue, and thus multiple works cannot make requests simultaneously to the same site. This allows our polite policy to continue to work, while dramatically increasing throughput.
Incoming URL + | | | v +---+-----+ | Hash | X+---------+XXXX XXXX X XXX XXXXXX XXXXX XX XXX XXXXXXXXXX +----XX-+ +---X---+ +-X------+ +--X----+ |Queue | |Queue | |Queue | |Queue | | A | | B | | C |... | N | | | | | | | | | +--+----+ +---+---+ +----+---+ +---+---+ | | | | | | | | v v v v +--+----+ +---+---+ +----+---+ +---+---+ |Worker | |Worker | |Worker | |Worker | +-------+ +-------+ +--------+ +-------+
To do this, this replace our Queue with a "URL Frontier". The URL frontier represents the URLs that need to be visited. For know, we implement the URL frontier as a set of N queues. We need new functions to add to the frontier, and to get the next URL from it. Our GetNext() function requires that the worker provide an ID for the queue it is associated with. Remember, in this model, we have N workers for N queues in a 1:1 mapping. Here is what our URL Frontier looks like
//URL Frontier Frontier.Queues = [N] //array of N queues Frontier.AddUrl(url) { var ip = resolveIp(url) var num = hash(ip) % MAX_QUEUES Frontier.Queues[num].enqueue(url) } Frontier.GetNext(queueNum) { Frontier.Queues[queueNum].dequeue() }
Under this new system, a worker could check the frontier and there isn't a URL, so we need to account for that. Now our crawler looks like this.
//Function for each worker thread. ID for thread, 0...N is passed Worker(id) { while(IsCrawlerComplete()) { Url = Forntier.GetNext(id) Timer = new Timer() Timer.start() if(Url != null) { Response = Request(Url) if(!SeenBefore(Response)) { DoSomething(Response) NewLinks = ExtractLinks(Response) NewLinks = RemoveAlreadySeen(NewLinks) Frontier.AddUrl(NewLinks) } } while(Timer.ElapsedSec < 1.5) { sleep() } } }
How much can this increase throughput? A tremendous amount. You could potentially have N workers, where N is the number the total number of sites you want to visit. This is wasteful, since the vast majority of sites don't have that many pages, but a few sites have a large number of pages. In a well designed crawler, you can usually increase N until to run into one of a few different bottle necks:
Actually, simply hashing the hostname is not good enough because of virtual hosting. Virtual hosting allows multiple sites to be served by the same single IP address. Since network connections are made to IP addresses, and not hostnames, when we talk about crawlers needing to be polite and not make to connections or requests in parallel, our atomic unit is a IP address, not a hostname.
A good example of this in Gemini space is Flounder. Flounder serves more than hundred capsules, all as subdomains off the flounder.online hostname (e.g. foo.flounder.online, bar.flounder.online). All of these however come from the same IP address. A crawler with multiple workers and with its worker queues partitioned by hostname will quickly overwhelm flounder.online sites and you will get errors and timeouts trying to access its content.
To resolve this issue, we should be using the IP address that a hostname resolves to when determining what queue to place a URL.
This is another example of how DNS resolution is critical for crawlers. In fact, as we will see DNS resolution is often a bottle neck for crawlers.
Robots.txt provides a way to for a site owner to tell crawlers areas of the site they should avoid. You will want to respect this.
You also will find sites or URLs that don't make sense to crawl or index:
Maintaining a list like this is difficult to scale. Ideally you want your crawl to be able to detect things like this and automatically start skipping them (Don't request URLs with content that rapidly changes, etc). However even early web crawlers, through the mid 2000's, maintained lists of known bad URLs or patterns they would skip.
You can combine your "robots.txt" filter and your "URLs to skip" features into a single module, FilterAllowedLinks(), since they basically are doing the same thing: checking a URL against a list of rules and skipping it if it matches one. So now our crawler looks like this:
//Function for each worker thread. ID for thread, 0...N is passed Worker(id) { while(IsCrawlerComplete()) { Url = Forntier.GetNext(id) Timer = new Timer() Timer.start() if(Url != null) { Response = Request(Url) if(!SeenBefore(Response)) { DoSomething(Response) NewLinks = ExtractLinks(Response) NewLinks = RemoveAlreadySeen(NewLinks) NewLinks = FilterAllowedLinks(NewLinks) Frontier.AddUrl(NewLinks) } } while(Timer.ElapsedSec < 1.5) { sleep() } } }