Friendly bot or foe?

By lucastech
scrapers crawlers bots web hosting

Web crawlers have been around for decades, but a new breed of scraper is abusing web servers unlike any before.

Happy 4th of July! Or so I thought.

I manage servers hosting a handful of wordpress sites for a client of mine. A few months back, I began getting complaints that the sites were down. The servers were never down, but often under extreme load. I built this dashboard to help me keep an eye on the CPU usage of each server, and it's continued to pay dividends.

A wild CPU load appears - how it began
A wild CPU load appears - how it began

For a while, there were always a few crawlers that would pop up and be abusive over the course of a few hours if I let them. I got pretty comfortable with identifying them and blocking the IP that was causing the issue. With AWS hosting a mix of valid bots/crawlers (think semrush, newsletter crawlers, search engines, etc) and these abusive bots (forging their user agent with a common browser) with virtually unlimited IP's to get assigned, I never wanted to block the whole subnet. So instead it's been largely a game of wack-a-mole that starts over every day.

The past few weeks however it's been clear that there is a new, much more aggressive and sophisticated scraping operation that has been hitting the websites that I'm hosting.

Scrapers, left unchecked
Scrapers, left unchecked

Out of curiosity I figured I'd let them scrape till they wore themselves out. After about 20 minutes, the same aggressive scraper began scraping another site that was hosted on another server (see teal line). That site was much smaller, and the site is a bit more performant so it handled the parallel requests much better.

Hourly requests, IPs and response times
Hourly requests, IPs and response times

After performing a bit more analysis on the logs generated by the purple line, it was clear that this had been a dramatic increase in requests as well as requesting IPs. Nearly 1,500 IP addresses were used to make 14,000 requests in just under an hour. While I'm still working on identifying exactly which IPs are to blame, it's easy to see that these abusive scrapers are getting much more sophisticated. One thing is clear, they do not care about the servers or legitimate users they are impacting with their careless scraping.