2022-01-24 The fediverse doing previews for the CO₂ god

I just finished eating a nice risotto. Hmmmm. So nice!

But now I’m angry again. Angry about programmers programming bots for the CO₂ god. But let’s start at the beginning. I was posting a toot on Mastodon:

Greta Götz writing about the upcoming year, recommending podcasts, philosophers, also mentioning @neauoire and @rek, Emacs, teaching, and lots of other stuff. – Happy 2022: Points for systems resonance

@neauoire

@rek

Happy 2022: Points for systems resonance

A second later I was confused: ö or oe? I mean, I usually write Schroeder but it’s actually Schröder, but … I wanted to check, and went back to the link. But now I got an error:

“The website is temporarily unable to service your request as it exceeded resource limit. Please try again later.”

Of course it is. Because Mastodon, Pleroma, and all those other Fediverse servers each get a copy for their previews. There is in fact an interesting discussion on issue 4486. Personally, I happen to agree with what ddevault writes:

If it were up to me, I would probably not fetch previews at all until the user clicks the post and asks for them. … If you can’t find a solution, remove the feature. Don’t ship things which freaking DDoS the internet. … I just had a brief outage caused by 915 requests from mastodon and pleroma servers on an expensive endpoint over the course of 2 minutes. – Mastodon can be used as a DDOS tool

Mastodon can be used as a DDOS tool

But it is what it is.

When I look at my followers on my profile, I get 35 pages of 40 accounts each, so less than 1400 followers. I am trying to guess how many distinct servers that would be. I looked at the latest three pages, counted the unique server names and got 82/120. Assume that roughly ½ of the servers are unique. That any site I link to will get at least 700 hits in a minute.

And Mastodon used to send two requests per server! Perhaps Pleroma still does or so somebody says in that issue discussion, a long time ago.

700 hits in a minute is plenty to bring down an old school website without caching, without DDOS protection, with a CGI script starting up Perl to render a page. Think about it. 700 Perl processes starting on your server. Each of them taking 50MiB or more…

And why not? If your site gets very few visitors it makes sense to just start it up when required. And sometimes you just want a little application to handle something for you. Or sometimes you’re hosting cgit or something bigger, who knows. Not everything is static files.

Anyway, now I’m blocking them all based on their user agent.

# fediverse instances asking for previews
RewriteCond "%{HTTP_USER_AGENT}" "Mastodon|Friendica|Pleroma" [nocase]
RewriteRule ^(.*)$ - [forbidden,last]

I was looking at the logs for my own site. Perhaps I had posted a link in the last 24h?

root@sibirocobombus:~# egrep 'Mastodon|Pleroma|Friendica' /var/log/apache2/access.log.1 | /home/alex/bin/time-grouping
      23/Jan/2022:23         97     5%  403 (98%), 200 (1%)
      23/Jan/2022:22        227    13%  403 (99%), 200 (0%)
      23/Jan/2022:21         31     1%  403 (100%)
      23/Jan/2022:20         38     2%  403 (100%)
      23/Jan/2022:19         46     2%  403 (97%), 200 (2%)
      23/Jan/2022:18         91     5%  403 (100%)
      23/Jan/2022:17         34     1%  403 (100%)
      23/Jan/2022:16        506    29%  403 (99%), 503 (0%)
      23/Jan/2022:15         17     0%  403 (100%)
      23/Jan/2022:14          3     0%  403 (66%), 200 (33%)
      23/Jan/2022:13         18     1%  403 (100%)
      23/Jan/2022:12         48     2%  403 (100%)
      23/Jan/2022:11         33     1%  403 (100%)
      23/Jan/2022:10        107     6%  403 (99%), 200 (0%)
      23/Jan/2022:09        172    10%  403 (100%)
      23/Jan/2022:08        113     6%  403 (98%), 200 (1%)
      23/Jan/2022:07         44     2%  403 (95%), 200 (4%)
      23/Jan/2022:06          7     0%  403 (100%)
      23/Jan/2022:05          7     0%  403 (100%)
      23/Jan/2022:04         15     0%  403 (93%), 200 (6%)
      23/Jan/2022:03          8     0%  403 (75%), 200 (25%)
      23/Jan/2022:02         16     0%  403 (100%)
      23/Jan/2022:01         12     0%  403 (100%)
      23/Jan/2022:00         17     0%  403 (64%), 200 (29%), 503 (5%)

Maybe I did! And some of them still got a 503 response.

Here is the data in 10min buckets:

   23/Jan/2022:16:50          6     0%  403 (100%)
   23/Jan/2022:16:40         18     1%  403 (100%)
   23/Jan/2022:16:30         48     2%  403 (100%)
   23/Jan/2022:16:20        428    25%  403 (100%)
   23/Jan/2022:16:10          1     0%  403 (100%)
   23/Jan/2022:16:00          5     0%  503 (60%), 403 (40%)

Thanks, fediverse.

And why am I on my Butlerian Jihad again? Because these previews aren’t fetched for humans! If only 400 humans had followed that link. No, it’s bots, serving their programmers. The programmers program badly, so the servers collect previews just to be safe, just to that it will be fast when a local user finally, maybe, scrolls past it.

This makes me very unhappy.

In any case, the name I was looking for is Greta Goetz. Recommended reading.

#Mastodon #Pleroma #Fediverse #Butlerian Jihad

Comments

(Please contact me if you want to remove your comment.)

⁂

You could try publicly shaming the authors of such bots [1] and piss them off enough to get them to stop crawling your site [2]. But that’s a lot of bots to deal with.

[1] http://boston.conman.org/2019/07/09.1

http://boston.conman.org/2019/07/09.1

[2] http://boston.conman.org/2019/07/12.1

http://boston.conman.org/2019/07/12.1

– Sean Conner 2022-01-31 02:24 UTC

Sean Conner

---

Well, in this case we are not really talking about crawlers but every fediverse server getting a message containing a link fetching the data from said link in order have a nice preview ready in case a human looks at it. If you look at the issue, you’ll see that Eugen, the creator of Mastodon, thinks that the current system is the correct kind of compromise.

– Alex 2022-01-31 14:08 UTC

---

@jwz writes:

@jwz

Every time I do a new blog post, within a second I have over a thousand simultaneous hits of that URL on my web server from unique IPs. Load goes over 100, and mariadb stops responding. The server is basically unusable for 30 to 60 seconds until the stampede of Mastodons slows down. Presumably each of those IPs is an instance, none of which share any caching infrastructure with each other, and this problem is going to scale with my number of followers (followers’ instances). This system is not a good system. – Mastodon stampede

Mastodon stampede

@crschmidt writes:

@crschmidt

Fun fact: sharing this link on Mastodon caused my server to serve 112,772,802 bytes of data, in 430 requests, over the 60 seconds after I posted it (>7 r/s). Not because humans wanted them, but because of the LinkFetchWorker, which kicks off 1-60 seconds after Mastodon indexes a post (and possibly before it’s ever seen by a human). Every Mastodon instance fetches and stores their own local copy of my 750kb preview image. – Thread

Thread

@syskill summarized the situation:

@syskill

Everyone who replied with “use a CDN,” is really saying, “I expect all web sites to be run by skilled and dedicated professionals, who deploy future-proofed technology stacks, so that my social network can be run by amateur hobbyists, and developed by those who fear what the future might bring.”

I actually learned about a new issue:

Even with the random delay, there’s still a significant amount of load that can come from requests for page metadata for posts from users with lots of followers from different instances. Misleading metadata is a non-issue as it’s already possible to create a redirect page with misleading information on it for Mastodon to consume. – Fetch link metadata on sender instance rather than receiver instances #12738

Fetch link metadata on sender instance rather than receiver instances #12738

– Alex 2022-11-28 10:56 UTC

---

Anyway, I think it’s important to point out that the people complaining about it often notice because they post a link to their own site to the fediverse and bring their own site down, but as my original example shows, me posting a link to Greta’s site brings her site down, too. Thus, everybody who says we should just add caching are basically saying: *every* dynamic website must add caching. Which is a valid position: let the web be fore the professionals. It’s too dangerous out there. I’d argue that the reverse should be true, however: designing our programs such that amateurs can host dynamic sites on the web should be something to strive for and to disregard this goal should invite criticism. Support the future you want to see.

– Alex 2022-11-29 06:37 UTC