Thinking about web mentions because of a post by @hbuchel@hachyderm.io I just read. I remember having added them to my wiki so that it would basically just post a comment. Then I started seeing spam, so I started checking whether the source actually linked to the page it claimed to link to. Then I realized that practically nobody was doing it. I had one person posting articles on their block and using web mention to ping my site. I guess they used a Wordpress plugin. Sadly, that plugin pinged my site no matter how insignificant the link. A few times the web mention didn't really add value and eventually I disabled it. 😿
What seemed more useful, eventually, was using referrals. Browsers still send them along even though it's a privacy issue, and I used reused the ideas I had already developed: see if the URL is readable, does it actually contain a link to my site, plus a blocklist that removes search engines, some pattern matching to try and get canonical URLs (with and without www subdomain, with and without https scheme, with and without certain path info and the like – a lot of Blogspot specific trickery), and on and on… so it was still a lot of work. Ugh!
But the *signal* was more interesting, I felt: a referral usually meant an actual person followed a link because it seemed interesting. I discovered Reddit threads and blogs linking to my pages. There was joy on a regular basis.
The only stain on the endeavour was that I still feel like this is a privacy violation. The signal I am using shouldn’t even be there. Plus the canonicalisation of URLs was annoying.
I don't think I will be adding either of the two to this site.
The last time I thought about it:
If you want to look into your own referrals, I have a Perl script that may help. Here's one way to use it:
# grep ^alexschr /var/log/apache2/access.log | bin/referrers | grep -v alexschroeder | head 18459 - 149 https://planet.emacslife.com/ 99 https://www.emacswiki.org/alex?action=journal;full=1;search=tag:Podcast 46 http://feeds.feedburner.com/rsp-blogs 22 https://www.google.com/ 9 https://www.reddit.com/ 8 https://178.209.50.237:443/users/sign_in 8 http://178.209.50.237:80/users/sign_in 6 http://www.shuct.net/?q=PB%E5%8F%8D%E7%BC%96%E8%AF%91 5 http://baidu.com/
And now you're ready to investigate. I see search engines. I see Reddit isn't telling me about the actual URL any more in order to protect the privacy of their users. I see attempts at intrusion. I see Chinese stuff. And I see two feed agregators, Planet Emacs Life and RSP Blogs. And a weird one: a very old URL from back in the days when my blog was still hosted on Emacs Wiki. Now I can go check those:
# grep "action=journal;full=1;search=tag:Podcast" /var/log/apache2/access.log | head -n 3 www.emacswiki.org:80 34.141.219.254 - - [20/Sep/2023:00:03:06 +0200] "GET /alex?action=journal;full=1;search=tag:Podcast HTTP/1.1" 301 594 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.emacswiki.org:443 34.141.219.254 - - [20/Sep/2023:00:03:06 +0200] "GET /alex?action=journal;full=1;search=tag:Podcast HTTP/1.1" 301 5305 "http://www.emacswiki.org/alex?action=journal;full=1;search=tag:Podcast" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" alexschroeder.ch:443 34.141.219.254 - - [20/Sep/2023:00:03:07 +0200] "GET /wiki?action=journal;full=1;search=tag:Podcast HTTP/1.1" 410 5250 "https://www.emacswiki.org/alex?action=journal;full=1;search=tag:Podcast" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
It's my friend 😝 Google Bot! The old URL is redirected to HTTPS, and then it's redirected to the correct path, and then it's rejected with 410 GONE. But of course Google Bot doesn't stop requesting it.
#Web #Administration #Bots #Butlerian Jihad