2023-10-04 Search engines, the deal is off!

No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. – Block the Bots that Feed “AI” Models by Scraping Your Website

Block the Bots that Feed “AI” Models by Scraping Your Website

I use something like the following for my webserver to try and deny all bots, spiders and crawlers access to my site. This is the second level of defense. Defense in depth is good. Level 1 is robots.txt; Level 2 is user agent filtering; Level 3 is fail2ban monitoring the access log and banning anybody who requests stuff faster than I think people can read.

Anyway: user agent filter. These three conditions all need to be true. The first condition makes sure only the sites listed are affected (because some sites are exempt…). The second condition makes an exception for Archive Bot and Gwene. The third condition filters for all self-identified bots, crawlers and spiders. The actual rule tells them that the page is gone and should be deleted (status 410), and it also adds a Location header just in case a human is curious which leads them to nobots.

nobots

The reason is this: For a while it seemed that we all benefited from search engines – authors and readers both. These days, you'll find that search results are full of garbage sites. Big sites with the most flatulent of pages explaining in great detail why the thing you're looking for is important and how to do it, clearly optimized for an ad company and not for a reader. Big sites that have a gazillion answers are preferred over small and individual sites. Perhaps that's easier. Perhaps it allows them to diffuse responsibility for the garbage, I don't know. The effect is, in any case, that there is no benefit to search engines for small site authors, either. I was unable to find my own pages on the search engines. If you you are a small site owner and you think you can find your own pages on Google and Bing, I suspect that's because they track you. Try it on a different computer, anonymously. Perhaps you won't find yourself, either.

In any case, if I can't get anything in return, both as a reader and as an author, I feel that the deal is off. Why let them feed on my words for free? Nay, at a cost, since they are keeping my website busy, producing CO₂ and heating the planet for no benefit at all.

Better to block them all.

RewriteCond "%{HTTP_HOST}" "^(alexschroeder\.ch|…)$" [nocase]
RewriteCond "%{HTTP_USER_AGENT}" "!archivebot|^gwene" [nocase]
RewriteCond "%{HTTP_USER_AGENT}" "bot|crawler|spider" [nocase]
RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

​#Web ​#Butlerian Jihad ​#Search

Clew maintains an independent index and is aiming to be a copyleft (APGLv3), self-hostable, privacy-respecting, customizable search engine which prioritizes independent creators/bloggers/writers and penalizes sites with ads and trackers. – Clew

Clew

the comparison table on Wikipedia

🔴 Google = AI / Ads / Tracking

🔴 Bing = AI / Ads / Tracking

🔴 Brave = Cryptocurrency / AI / Anti-LGBT*

🟠 DuckDuckGo = Ads / AI

🟠 Kagi = AI

🟠 Mojeek = Ads / AI

🟠 You.com = Ads / AI

🟡 Qwant = Ads

🟡 Startpage = Ads

🟡 Searx = Complicated

🟢 ???

Between the absolute blase attitude towards privacy, the 100% dedication to AI being the future of search, and the completely misguided use of the company's limited funds, I honestly can't see Kagi as something I could ever recommend to people. Is the search good? I mean...it's not really much better than any other search, it heavily leverages Bing like DDG and the other indie search platforms do, the only real killer feature it has to me is the ability to block domains from your results, which I can currently only do in other search engines via a user script that doesn't help me on mobile. – Why I Lost Faith in Kagi

Why I Lost Faith in Kagi

@llimllib@hachyderm.io and @jnv@fosstodon.org reminded me of niche search engines:

🟢 Marginalia https://search.marginalia.nu/

🟢 Wiby http://wiby.me/

🟢 Clew https://clew.se/

https://search.marginalia.nu/

http://wiby.me/

https://clew.se/

I fear that "niche" is going to be the new gold standard. The dark net is the future for our coming decades, I suspect.

@albertcardona@mathstodon.xyz adds "specialist search engines that only search within their own data":

🟢 Wikipedia https://wikipedia.org

🟢 Web Archive Wayback Machine https://web.archive.org/ (which is essentially the whole internet)

https://wikipedia.org

https://web.archive.org/

Specifically for academic work:

🟢 OpenAlex https://openalex.org

🟢 Scholar Archive https://scholar.archive.org

🟢 BASE https://www.base-search.net (Bielefield Academic Search Engine)

🟢 Semantic Scholar https://www.semanticscholar.org

https://openalex.org

https://scholar.archive.org

https://www.base-search.net

https://www.semanticscholar.org

Nice!

It matches my own expectations with Lieu for niche search.

Lieu for niche search

I also discovered a whole thread of blog posts on the topic:

Should I remove this blog from Google Search?

Fighting bots

Fighting bots