Notes on Writing Web Scrapers

Author: cushychicken

Score: 128

Comments: 58

Date: 2021-12-02 18:03:07

________________________________________________________________________________

marginalia_nu wrote at 2021-12-02 19:09:37:

Be nice to your data sources.

Very much this. If I know I'm about to possibly consume a non-negligible amount of resources from a server by indexing a website with lots of subdomains, I typically send the webmaster an email asking them if this is fine, telling them what I use the data for and how often my crawler re-visits a site, asking if I should set any specific crawl delays or do it in particular times of day.

In every case I've done this, they have been fine with it. It goes down a lot better than just barging in and grabbing data. This also gives them a way of reaching me if something should go wrong. I'm not just some IP address grabbing thousands of documents, I'm a human person they can talk to.

If I'd get a no, then I would respect that too, not try to circumvent it like some entitled asshole. No means no.

GekkePrutser wrote at 2021-12-03 04:34:49:

I wouldn't do that. It's better to stay under the radar IMO. They usually won't notice the traffic if it's just for personal reasons. And who says the internet can't be used for automatic retrieval?

I'm not going to pound websites with requests every minute of course but I think this falls under legitimate use. Whether I click the button myself or just schedule a script to do it shouldn't matter so much.

But if you ask you draw attention to it and get their legal dept involved (after all most websites are not run by a single webmaster in their bedroom) who will most likely say no because legal people are hesitant to commit to anything.

But maybe my usecase is different. I just scrape stuff to check if something I want is back in stock, to download my daily pdf newspaper I pay for, archive forum posts I've written, stuff like that. I don't index whole sites.

But yeah I do make sure I don't bombard them with requests, though this is more from a "staying under the radar" point of view. And indeed to avoid triggering stuff like cloudflare.

But if you're scraping to run your own search engine and offer the results to the public the situation is much more complex of course, both technically and legally.

cushychicken wrote at 2021-12-02 19:36:26:

I've never actually reached out to a webmaster to ask permission, but I think that's a great idea. (They may even have some suggestions for a better way to achieve what I'm doing.)

How do you typically find contact info for someone like that?

I'm running _very_ generous politeness intervals at the moment to try and ensure I'm not a nuisance - one query every two seconds.

marginalia_nu wrote at 2021-12-02 19:42:17:

If you can't find the address on the website, sometimes it's in robots.txt, you could also try support@domain, admin@domain or webmaster@domain.

Contact forms can as well, usually they seem to get forwarded to the IT department if they look like they have IT stuff in them, and I've had reasonable success getting hold of the right people that way.

cushychicken wrote at 2021-12-02 19:48:29:

I'll give that a shot!

Do you know if this is generally an in-house position for companies that use third party platforms?

I ask because Workday has been the absolute bane of my indexing existence, and I suspect they make it hard so they can own distribution of jobs to approved search engines. (Makes it easier to upcharge that way, I suppose.)

If the administrator for the job site is the Workday customer (i.e. Qualcomm or NXP or whoever is using Workday to host their job ads), I'd suspect I'd have a chance at getting a better way to index their jobs. (My god, I'd love API access if that's a thing I can get. I'd be a fly on the wall in most cases - one index a day is plenty for my purposes!)

Terry_Roll wrote at 2021-12-02 23:06:39:

If their security is any good, their firewall may tarpit you anyway if they can see you are spidering links quicker than a human can read, your browser agent offers no clues and/or your ip address (range) offers no clues.

buffet_overflow wrote at 2021-12-02 21:24:42:

This is a nice approach. I generally leave a project specific email in the request headers with a similar short summary of my goals.

marginalia_nu wrote at 2021-12-02 22:11:55:

Yeah, my User-Agent is "search.marginalia.nu", and my contact information is not hard to find on that site. Misbehaving bots are very annoying and it's incredibly frustrating when you can't get hold of the owner to tell them about it.

gjs278 wrote at 2021-12-02 21:20:30:

complete waste of time

ggambetta wrote at 2021-12-02 20:47:22:

I've written scrapers over the years, mostly for fun, and I've followed a different approach. Re. "don't interrupt the scrape", whenever URLs are stable, I keep a local cache of downloaded pages, and have a bit of logic that checks the cache first when retrieving an URL. This way you can restart the scrape at any time and most accesses will not hit the network until the point where the previous run was interrupted.

This also helps with the "grab more than you think you need" part - just grab the whole page! If you later realize you needed to extract more than you thought, you have everything in the local cache, ready to be processed again.

justsomeuser wrote at 2021-12-03 10:09:00:

I would use SQLite between each of the stages so each process can be restarted, run concurrently and observed with queries.

cushychicken wrote at 2021-12-02 21:22:30:

You're not the first person in this thread to suggest grabbing the whole page text. I've never tried, just because I assumed it was so much space as to be impractical, but I don't see the harm in trying!

muxator wrote at 2021-12-02 21:48:55:

My current favorite cache for whole pages is a single sqlite file with the page source stored with brotli compression. Additional columns for any metadata you might need (URL, scraping sessionid, age). The resulting file is big (but brotli for this is even better than zstd), and having a single file is very convenient.

jmnicolas wrote at 2021-12-03 10:26:29:

It is not advised to use SQLite with several threads, are you running only a single thread?

muxator wrote at 2021-12-03 10:45:03:

You can't use SQLite from multiple threads if you are in single thread mode [0]. If you are in multi-thread mode you can use multiple connections to the DB from different threads, or if you want to share a connection then you have to serialize over it. Or you can use SERIALIZED mode.

This is if you do not care writing any logic. If you do, there are other possibilities, for example using a queue and a a dedicated thread for handling database access, but I personally do not think there are many advantages to this more complicated approach.

[0]

https://www.sqlite.org/threadsafe.html

jmnicolas wrote at 2021-12-03 14:59:53:

Thanks, I wasn't aware of this. I'm not sure if it was old info or I "imagined" it but in any case it's good to know.

I hope it wasn't true in 2019, since I painstakingly wrote a multi-threaded app that was accessing SQLite on only 1 thread!

Chris2048 wrote at 2021-12-03 04:54:05:

But how do you deal with dynamic pages, i.e. content that changes each time - would you need to pattern-match?

nlh wrote at 2021-12-03 12:47:27:

The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month.

So I think there’s a balance to be struck - I’d argue you should absolutely be thoughtful about your target - if they notice you or you break them, that could be problematic for both of you. But if you’re TOO conservative, the job will never get done in a reasonable timeframe.

cushychicken wrote at 2021-12-03 13:04:05:

_The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month_

Fortunately for me, I'm almost assuredly never going to have to do this on the scale of millions of pages. If time proves me wrong, I suspect I'll be hiring someone with more expertise to take over that part of the project.

I'm definitely biasing towards a very conservative interval. Optimizing the runtime is more to help with tightening the iteration cycles for me, the sole developer, instead of limiting the job size to a reasonable timeframe.

throwaway894345 wrote at 2021-12-02 22:25:08:

When you’re reading >1000 results a day, and you’re inserting generous politeness intervals, an end-to-end scrape is expensive and time consuming. As such, you don’t want to interrupt it. Your exception handling should be air. Tight.

Seems like it would be more robust to just persist the state of your scrape so you can resume it. In general, I try to minimize the amount of code that a developer _MUST_ get right.

cushychicken wrote at 2021-12-03 00:03:00:

_Seems like it would be more robust to just persist the state of your scrape so you can resume it._

Say more about this. I'm not a software engineer by training, so I don't really know what "persist your state" would look like in this case.

franga2000 wrote at 2021-12-03 11:18:38:

Generally this means that you should be writing all the important information about the program to disk so in the event of a crash, you can read it back and continue where you left off. Alternatively, your state could BE on disk (or whatever persistent store, like a DB) and your program should process it in atomic chunks with zero difference between a "from scratch" and "resumed" run.

I work on web scrapers at [day job] and can say the latter approach is far better, but the former is far more common. An implementation of the former could be as simple as "dump the current URL queue to a CSV file every minute".

As for doing this "properly" in a way that works at scale, my preferred way is with a producer-consumer pipeline and out-of-process shared queues in between each pair. So, for example, you have 4 queues: URL queue, verify queue, reponse queue, item queue and 5 stages: fetch (reading from URLQ and writing to verifyQ), response check (reading from verifyQ, writing good responses to reponseQ and bad response URLs back to fetchQ), parse (reading from responseQ and writing to itemQ) and publish (reading from itemQ and writing to database or whatever).

This is is be both horrendously overcomplicated and beautifully simple. I've implemented this both with straight up bash scripts and with dedicated queues and auto-scaling armies of docker containers.

MaxDPS wrote at 2021-12-03 02:48:09:

If you are scraping 1000 pages and your program crashes after the 800th page, you are going to be much happier if you have the data for the 800 pages saved somewhere vs having to start all over.

cushychicken wrote at 2021-12-03 12:31:37:

Got it. Yeah, I can give this approach a shot.

One of the things I've been doing with this scraping project is "health checking" job posting links. There's nothing more annoying than clicking an interesting looking link on a job site, only to find it's been filled. (This is one of the lousiest parts of Indeed and similar, IMO.) I wrote some pretty simple routines

Caching solves the problem of potentially missing data while the scraper is running, but it doesn't really alleviate the network strain of requesting pages to see that they are actually still posted, valid job links.

derac wrote at 2021-12-03 00:10:50:

Write the pages you are scraping to a cache. The simplest way would be to write them each to folder. Check if the page you are going to scrape is cached before trying to request it from the server.

throwaway894345 wrote at 2021-12-03 00:31:12:

It just means to write to disk where you left off, like saving your progress in a video game.

dzolob wrote at 2021-12-03 03:15:46:

I just want to add that is very important to look at get/post and ajax calls within the site. When properly understood, they can be worked on your favor, taking away a lot of complexity of your scarpers.

gullywhumper wrote at 2021-12-03 17:04:18:

Agreed - frequently the necessary data is available as nicely formatted json. Sometimes you an also modify the query: rather than something 10 results per request, you can get 100.

cushychicken wrote at 2021-12-02 18:58:51:

Author heeyah. Would love your feedback, either here or at @cushychicken on the tweet street.

cogburnd02 wrote at 2021-12-02 19:53:17:

There's a project called woob (

https://woob.tech/

) that implements python scripts that are a little bit like scrapers but only 'scrape' on demand from requests from console & desktop programs.

How much of this article do you think would apply to something like that? e.g. something like 'wait a second (or even two!) between successive hits' might not be necessary (one could perhaps shorten it to 1/4 second) if one is only doing a few requests followed by long periods of no requests.

cushychicken wrote at 2021-12-02 22:04:33:

Interesting question. My first instinct is to say that woob seems closer in use case to a browser than a scraper, as it seems largely geared towards making rich websites more easily accessible. (If I'm reading the page right, anyway.) A scraper is basically just hitting a web page over and over again, as fast as you can manage.

The trick, IMO, is to be closer to browser level loading on a server than scraper. Make sense?

gringo_flamingo wrote at 2021-12-02 19:11:37:

I can truly relate to this article, especially where you mentioned trying to extract only the specific contents of elements that you need; without bloating your software. To me, that seemed intuitive with the minimal experience I have in web scraping. However, I ended up fighting the frameworks. Me being stubborn, I did not try your approach and kept trying to be a perfectionist about it LMAO. Thank you for this read, glad I am not the only one who has been through this. Haha...

cushychicken wrote at 2021-12-02 19:26:15:

Yeah it's an easy thing to get into a perfectionist streak over.

Thinking about separation of concerns helped me a _lot_ in getting over the hump of perfectionism. Once I realized I was trying to make my software do too much, it was easier to see how it would be much less work to write as two separate programs bundled together. (Talking specifically about the extract/transform stages here.)

Upon reflection, this project has been just as much self-study of good software engineering practices as it has been learning how to scrape job sites XD

funnyflamigo wrote at 2021-12-02 19:06:30:

Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?

Let's say you're scraping product info from a large list of products. I'm assuming you mean if it's strange one-off type errors to handle those, and you'd stop altogether if too many fail? Otherwise you'd just be DOS'ing the site.

cushychicken wrote at 2021-12-02 19:33:15:

_Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?_

Sure! I can get a little more concrete about this project more easily than I can comment on your hypothetical about a large list of products, though, so forgive me in advance for pivoting on the scenario here.

I'm scraping job pages. Typically, one job posting == one link. I can go through that link for the job posting and extract data from given HTML elements using CSS selectors or XPath statements. However, sometimes the data I'm looking for isn't structured in a way I expect. The major area I notice variations in job ad data is location data. There are a zillion little variations in how you can structure the location of a job ad. City+country, city+state+country, comma separated, space separated, localized states, no states or provinces, all the permutations thereof.

I've written the extractor to expect a certain format of location data for a given job site - let's say "<city>, <country>", for example. If the scraper comes across an entry that happens to be "<city>, <state>, <country>", it's generally not smart enough to generalize its transform logic to deal with that. So, to handle it, I mark that particular job page link as needing human review, so it pops up as an ERROR in my logs, and as an entry in the database that has post_status == 5. After that, it gets inserted into the database, but not posted live onto the site.

That way, I can go in and manually fix the posting, approve it to go on the site (if it's relevant), and, ideally, tweak the scraper logic so that it handles transforms of that style of data formatting as well as the "<city>, <country>" format I originally expected.

Does that make sense?

I suspect I'm just writing logic to deal with malformed/irregular entries that humans make into job sites XD

marginalia_nu wrote at 2021-12-02 19:39:22:

I've had a lot of success just saving the data into gzipped tarballs, like a few thousand documents per tarball. That way I can replay the data and tweak the algorithms without causing traffic.

cushychicken wrote at 2021-12-02 19:44:45:

Is that still practical even if you're storing the page text?

The reason I don't do that is because I have a few functions that analyze the job descriptions for relevance, but don't store the post text. I mostly did that to save space - I'm just aggregating links to relevant roles, not hosting job posts.

I figured saving ~1000 job descriptions would take up a needlessly large chunk of space, but truth be told I never did the math to check.

Edit: I understand scrapy does something similar to what you're describing; have considered using that as my scraper frontend but haven't gotten around to doing the work for it yet.

marginalia_nu wrote at 2021-12-02 19:52:32:

Yeah, sure. The text itself is usually at most a few hundred Kb, and HTML compresses extremely well. Like it's pretty slow to unpack and replay the documents, but it's still a lot faster than downloading them again.

MrMetlHed wrote at 2021-12-02 21:23:34:

And it's friendlier to the server you're getting the data from.

As a journalist, I have to scrape government sites now and then for datasets they won't hand over via FOIA requests ("It's on our site, that's the bare minimum to comply with the law so we're not going to give you the actual database we store this information in.") They're notoriously slow and often will block any type of systematic scraping. Better to get whatever you can and save it, then run your parsing and analysis on that instead of hoping you can get it from the website again.

muxator wrote at 2021-12-02 21:53:27:

First of all, thanks for marginalia.nu.

Have you considered stored compressed blobs in a sqlite file? Works fine for me, you can do indexed searches on your "stored" data, and can extract single pages if you want.

marginalia_nu wrote at 2021-12-02 22:05:42:

The main reason I'm doing it this way is because I'm saving this stuff to a mechanical drive, and I want consistent write performance and low memory overhead. Since it's essentially just an archive copy, I don't mind if it takes half an hour to chew through looking for some particular set of files. Since this is a format deigned for tape drives, it causes very little random access. It's important that it's relatively consistent to write since my crawler does while it's crawling, and it can reach speeds of 50-100 documents per second, which would be extremely rough on any sort of database based on a single mechanical hard drive.

These archives are just an intermediate stage that's used if I need to reconstruct the index to tweak say keyword extraction or something, so random access performance isn't something that is particularly useful.

gardnr wrote at 2021-12-03 04:48:15:

Have you thought about pushing the links onto a queue and running multiple scrapers off that queue? You'd need to build in some politeness mechanism to make sure you're not hitting the same domain/ip address too often but it seems like a better option than a serial process.

yakshaving_jgt wrote at 2021-12-02 23:17:28:

Why 5, exactly? This struck me as odd in the article. Perhaps I missed something. Are there other statuses? Why are statuses numeric?

marginalia_nu wrote at 2021-12-03 13:37:54:

Integer columns are significantly faster and smaller than strings in a SQL database. It adds up quickly if you have a sufficiently large database.

I use the following scheme:

        1 - exhausted
   0 - alive
  -1 - blocked (by my rules)
  -2 - redirected
  -3 - error

yakshaving_jgt wrote at 2021-12-03 14:20:39:

The author is scraping fewer than 1,000 records per day, or roughly 365,000 records per year.

On my own little SaaS project, the difference between querying an integer and a varchar like “active” is imperceptible, and that’s in a table with 7,000,000 rows.

It would take the author 19 years to run into the scale that I’m running at, where this optimisation is meaningless. And that’s assuming they don’t periodically clean their database of stale data, which they should.

So this looks like a premature optimisation to me, which is why it stood out as odd to me in the article.

marginalia_nu wrote at 2021-12-03 15:07:15:

I'd put it closer to the category of best practices than premature optimization. It's pretty much always a good idea. It's not that not doing this will break things, the alternative is slower and uses more resources in a way that affects all queries since larger datatypes elongate the records, and locality is tremendously important all aspects of software performance.

yakshaving_jgt wrote at 2021-12-03 17:06:06:

I disagree. I think a _better_ "best practice" is to make the _meaning_ behind the code as clear as possible. In this case, the code/data is _less_ clear, and there is zero performance benefit.

marginalia_nu wrote at 2021-12-03 17:17:29:

There is absolutely a performance benefit to reducing your row sizes. It both reduces the amount of disk I/O and the amount of CPU cache misses and in many cases also increases the amount of data that can be kept in RAM.

You can map meaning onto the column in your code, as most languages have enums that are free in terms of performance. It does not make sense to burden the storage layer with this, as it lacks this feature.

yakshaving_jgt wrote at 2021-12-03 17:24:34:

The performance benefit is negligible at the scale the author of the article is operating at. You already alluded to this point being context-dependent earlier when you said:

> if you have a sufficiently large database

Roughly 360,000 rows per year is not sufficiently large. It's _tiny_.

cushychicken wrote at 2021-12-03 12:23:25:

It's arbitrary.

I have a field, post_status, in my backend database, that I use to categorize posts. Each category is a numeric code so SQL can filter it relatively quickly. I have statuses for active posts, dead posts, ignored links, links needing review, etc.

It's a way for me to sort through my scraper's results quickly.

yakshaving_jgt wrote at 2021-12-03 14:43:02:

I think you have a case of premature optimisation there, as I wrote in a recent comment[0].

[0]:

https://news.ycombinator.com/item?id=29430281

cushychicken wrote at 2021-12-03 15:25:17:

Not sure what's premature here. The optimization is to allow me, a human, to find a certain class of database records quickly. I also chose a method that I understand to be snappy on the SQL side as well.

What would you suggest as a non-optimized alternative? That might make your point about premature optimization clearer.

yakshaving_jgt wrote at 2021-12-03 16:59:55:

There is indeed a trade-off, and the direction I would have chosen is to use meaningful status names as opposed to magic numbers. My reasoning being that maintenance cost in terms of how self-explanatory the system is makes more sense to me economically than obscuring the meaning behind some of the code/data for a practically non-existent performance benefit.

After all, hardware is cheap, but developer time isn't.

For a more concrete example, I might have chosen the value `'pending'` (or similar) instead of `5`. Active listings might have status `'active'`. Expired ones might have status `'expired'`, _etc._

derac wrote at 2021-12-03 00:14:21:

It's arbitrary.

enigmatic02 wrote at 2021-12-02 22:18:37:

Ah the sanitization links were great, thanks!

Do you plan on handling sanitization of roles so people can search by that? I ended up using a LONG case when statement to group roles into buckets, probably not ideal

Doing something similar to you, but focused on startups and jobs funded by Tier 1 investors:

https://topstartups.io/

cushychicken wrote at 2021-12-03 13:13:59:

_Do you plan on handling sanitization of roles so people can search by that? I ended up using a LONG case when statement to group roles into buckets, probably not ideal_

Probably not. I get the impression that job titles are one of the ways that recruiters and hiring managers "show their shine", so to speak, in terms of marketing open roles. Job titles can convey explicit and important information about the role's focus - things like "FPGA", or "New College Grad", or "DSP", for example. It's an opportunity for them to showcase what's special about the role in a succinct way. Sanitizing that away would reduce the amount of information given to all sides in the transaction. It also seems like a really broad task; there are way more niche specialties in this field than there are succinct buckets to place them in, job-title-wise.

I've found it more useful to tag roles based upon title and contents of the job description. It's a way to get the same info across without obscuring the employer's intent.

yolo3000 wrote at 2021-12-02 22:58:35:

Do you think there's a lot of job boards lately? Or have they always been there? It feels like I've seen a lot of them popping up this year.

ed25519FUUU wrote at 2021-12-02 23:30:25:

Scraping 1000 results a day is really not any kind of web scraping scale. There’s barely any of the same considerations of systems that scrape tens millions a day.

You could easily store those kinds of results in a local DB for offline processing and resuming.

cushychicken wrote at 2021-12-03 00:00:29:

Even at my cutesy, boutique scale, there is plenty there to obsess about. XD