The bots that keep on giving

The red line on the chart was initially what caught my eye, but before I could look into it the purple line for web 5 began to skyrocket again.

Over the weekend I built some additional tools to help me better understand what exactly was going on during these periods of high volume. I knew that there was a large increase in IP's hitting the site, which resulted in the spike in requests and load times in tandem. However, the hourly view did not feel granular enough, so I built a 15 minute interval view.

When I loaded up todays logs, I noticed that this was actually the second attack of the day, but I had been sleeping for the first and didn't notice. I felt the need to compare the period of high volume traffic to the previous period of time, so I built a quick comparison zoom on the graph.

To better understand the requests, IP addresses were not very helpful. As a result, I took it a layer deeper and aggregated by IP subnet. While this did highlight that there were a lot of IP's from the same subnets, it did not produce the insights I had hoped for. So I went a level deeper and aggregated the subnets by the organizations which control them.

To start, lets look at a period of time that was not involved in any intense traffic spikes.

This makes it pretty easy to see that there are quite a few organizations regularly hitting this site. Amazon is pretty consistent, so are Google, Digital Ocean and Microsoft. This isn't really a surprise given that most cloud services here in the US are hosted by these providers. The website that we're discussing today is a local publication for a city in the US. There really isn't any relevant information for any international audience. That makes looking at the traffic for legitimate users much easier. We can see that there was a few non-US spikes in traffic, but nothing too crazy, likely a few random bots scraping around.

Lets look at the first spike in this view, compared to the previous period.

It is pretty obvious that we have a handful of organizations suddenly surging in traffic to this site. I was surprised to see that most of the largest offenders were previously on the list, but their volume had increased significantly (1,000% or more). It's also pretty clear that the result of this traffic surge (and load times) resulted in our legitimate traffic suffering as well (Google, Amazon). Almost all of the worst offenders are international data centers, but there are even a few inside the US as well. I suspect that some portion of this traffic is legitimate users who are leveraging VPN's hosted by these same data centers.

Looking at the spike later in the day, we see an almost identical list of organizations spiking dramatically (1,000% or more) in their requests, while normal traffic gets pushed down, presumably due to serialized requests and increased request times. I also noticed that the time of day they scrape seems to be varying, on the 4th of July they had started at 22:00 UTC, but today today they started around 14:30.

I did also try to do some analysis on the user agents, but they are all rotating forged user agents even from the same IP. This just highlights the malicious nature of these bots, purposefully trying to evade detection while abusing sites at their whim.

I think next I'll spend some time looking at what exactly they're requesting. Now that I know they're scraping the same site repeatedly every day, it'll be interesting to see if there are any patterns that emerge there.