Sitemap Failures

We notice that scrapes take longer and longer. A pet theory is that sitemap fetches are timing out. We've changed the curl -m time limit from 10 to 5 seconds. Now we wonder what effect that will have.

We notice a long term increase in the scrape run time from 20 to 80 minutes. Sudden drops are actually the clock wrapping around at 80. plots

plots

We plot a reasonably accurate count of active sites by finding a correctly formatted sitemap. This runs in the mid 800s for each scrape without the volatility we see above.

We see timeouts in the log as json parse failures. Empty results complain that two octets are required. logs

logs

We use this script to count how many sitemap fetch failures we had each run over the last week.

This count was on Sept 18, 2016 right before we shortened the time out time from 10 to 5 seconds.