I went ahead and replaced IP (Internet Protocol) addresses with ASN (Autonomous System Number)s in the log file to find the network that sent the most requests to my blog for the month of February.
Table: Top 10 networks requesting a page from blog MICROSOFT-CORP-MSN-AS-BLOCK, US 78889 OVH, FR 31837 ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN 25019 HETZNER-AS, DE 23840 GOOGLE-CLOUD-PLATFORM, US 21431 CSTL, US 17225 HURRICANE, US 15495 AMAZON-AES, US 14430 FACEBOOK, US 13736 AKAMAI-LINODE-AP Akamai Connected Cloud, SG 12673
Even though Alibaba US has the most unique IPs hitting my blog [1], Microsoft is still the network making the most requests. So let's see how Microsoft presents itself to my web server. Here are the user agents it sends:
Table: Web agents from the Microsoft Network agent requests ------------------------------ Go-http-client/2.0 43236 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) 23978 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 7953 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 2955 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot 210 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot 161 DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html) 123 'DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot)' 122 Python/3.9 aiohttp/3.10.6 28 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.36 Safari/537.36 14 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.114 Safari/537.36 14 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.68 10 DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) 10 DuckAssistBot/1.1; (+http://duckduckgo.com/duckassistbot.html) 10 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 6 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.143 Safari/537.36 6 python-requests/2.32.3 5 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.142 Safari/537.36 5 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 4 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0 4 DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot) 4 Twingly Recon 3 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot) 3 Mozilla/5.0 (compatible; Twingly Recon; twingly.com) 3 python-requests/2.28.2 2 newspaper/0.9.1 2 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 2 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b 2 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 2 http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) Bot 1 http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) 1 Mozilla/5.0 (Windows NT 6.1; WOW64) SkypeUriPreview Preview/0.5 skype-url-preview@microsoft.com 1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36 1 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48 1 Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) Bot 1 Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) 1 Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) Bot 1 Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) 1 Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) Bot 1 Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) 1
The top result comes from a single IP address and probably requires a separate post about it [2], since it's weird and annoying. But the rest—you got Bing, you got OpenAI, you got several Mastodon instances—it seems like most of these are from Microsoft's cloud offering. A mixture of things.
What about Facebook?
Table: Web agents from Facebook agent requests ------------------------------ meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) 13497 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 207 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 12 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 4 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/59.0 4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0 2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 2
Hmm … looks like I have a few readers at Facebook, but other than that, nothing terribly interesting.
Alibaba, on the other hand, is frightening. Out of 25,019 requests, it presented 581 different user agents. From looking at what was requested, I don't think it's 500 Chinese people reading my blog—it's defintely bots crawling my site (and amusingly, there are requests to /robots.txt file, but without a proper user agent to go by, it's hard to block it via that file).
I can think of one conclusion here—if you do filter by ASN, it can help tremendously, but it also comes with possibly blocking legitimate traffic.