đŸ’Ÿ Archived View for seirdy.one â€ș posts â€ș 2021 â€ș 03 â€ș 10 â€ș search-engines-with-own-indexes â€ș index.gm
 captured on 2023-09-28 at 15:49:24. Gemini links have been rewritten to link to archived content

View Raw

More Information

âŹ…ïž Previous capture (2023-09-08)

âžĄïž Next capture (2023-12-28)

-=-=-=-=-=-=-

A look at search engines with their own indexes

Originally posted 2021-03-10. Last updated 2023-09-02.

This is a cursory review of all the indexing search engines I have been able to find. Gemini engines are at the bottom; the rest of this post is about Web search engines.

The three dominant English search engines with their own indexesÂč are Google, Bing, and Yandex (GBY). Many alternatives to GBY exist, but almost none of them have their own results; instead, they just source their results from GBY.

With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if they’d like more information.

This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; I’d especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer.

I plan on updating the engines in the top two categories with more info comparing the structured/linked data the engines leverage (RDFa vocabularies, microdata, microformats, JSON-LD, etc.) to help authors determine which formats to use.

About the list

I discuss my motivation for making this page in the "Rationale" section.

I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn't able to find many Spanish-language engines powered by their own crawlers.

I mention details like "allows site submissions" and structured data support where I can only to inform authors about their options, not as points in engines' favor.

See the "Methodology" section at the bottom to learn how I evaluated each one.

General indexing search-engines

Large indexes, good results

These are large engines that pass all my standard tests and more.

1. Google: the biggest index. Allows submitting pages and sitemaps for crawling, and even supports WebSub to automate the process. Powers a few other engines:

Programmable Search Engine

2. Bing: the runner-up. Allows submitting pages and sitemaps for crawling without login using the IndexNow API. Its index powers many other engines:

3. Yandex: originally a Russian search engine, it now has an English version. Some Russian results bleed into its English site. Like Bing, it allows submitting pages and sitemaps for crawling using the IndexNow API. Powers:

4. Mojeek: Seems privacy-oriented with a large index containing billions of pages. Quality isn’t at Google/Bing/Yandex’s level, but it’s not bad either. If I had to use Mojeek as my default general search engine, I’d live. Partially powers eTools.ch. At this moment, I think that Mojeek is the best alternative to GBY for general web search.

Google, Bing, and Yandex support structured data such as microformats1, microdata, RDFa, Open Graph markup, and JSON-LD. Yandex's support for microformats1 is limited; for instance, it can parse h-card metadata for organizations but not people. Open Graph and Schema.org are the only supported vocabularies I'm aware of. Mojeek is evaluating structured data; it's interested in Open Graph and Schema.org vocabularies.

Smaller indexes or less relevant results

These engines pass most of the tests listed in the "methodology" section. All of them seem relatively privacy-friendly. I wouldn't recommend using these engines to find specific answers; they're better for learning about a topic by finding interesting pages related to a set of keywords.

Stract

Stract source code (GitHub)

Right Dao

Alexandria

Alexandria engine source code

Yep

SeSe Engine

SeSe back-end Python code

SeSe-UI Vue-based front-end

Yep supports Open Graph and some JSON-LD at the moment. A look through the source code for Alexandria and Gigablast didn't seem to reveal the use of any structured data. The surprising quality of results from SeSe and Right Dao seems influenced by the crawlers' high-quality starting locations (e.g. Wikipedia).

Smaller indexes, hit-and-miss

These engines fail badly at a few important tests. Otherwise, they seem to work well enough.

Infotiger

Infotiger hidden service

seekport (HTTP only)

Exalead

Curlie

ExactSeek

Burf.co

Entfer

Siik

inetdex.com

ChatNoir

Common Crawl

ChatNoir source code (GitHub)

ChatNoir Announcement

Secret Search Engine Labs

CashRank Algorithm

Unusable engines, irrelevant results

Results from these search engines don’t seem at all useful.

Yessle

Bloopish

MetaGer

Artado Search

Active Search Results

Crawlson

Anoox

Plumb CPO

Yioop!

Spyda search engine

Blog post introducing Spyda

Spyda source code

Slzii.com: A new web portal with a search engine. Has a tiny index dominated by SEO spam. Discovered in the seirdy.one access logs.

Slzii.com

Semi-independent indexes

Engines in this category fall back to GBY when their own indexes don't have enough results. As their own indexes grow, some claim that this should happen less often.

I can't in good conscience recommend using Brave Search, as the company runs cryptocurrency, has held payments to creators without disclosing that creators couldn't receive rewards, has made dangerously misleading claims about fingerprinting resistance (will update with a link to my thoughts on the matter), is run by a CEO who spent thousands of dollars opposing gay marriage, and has rewritten typed URLs with affiliate links.

Brave Search

Plumb

Qwant

Kagi Search

Kagi.ai

TinyGem

Non-generalist search

These indexing search engines don’t have a Google-like “ask me anything” endgame; they’re trying to do something different. You aren't supposed to use these engines the same way you use GBY.

Small/non-commercial Web

search.marginalia.nu

Announcement: marginalia.nu goes open source

Ichido search engine

Blog post documenting how Ichido works.

Teclis

Teclis free version shutdown notice

Site finders

These engines try to find a website, typically at the domain-name level. They don't focus on capturing particular pages within websites.

Kozmonavt

search.tl

Thunderstone

sengine.info

Gnomit

Other

High Browse

Keybot Translation Search Machine.

Semantic Scholar

Bonzamate

Blog post about Bonzamate: "Abuzing AWS to make a search engine".

searchcode

Lixia Labs Search

Other languages

I’m unable to evaluate these engines properly since I don’t speak the necessary languages. English searches on these are a hit-or-miss. I might have made a few mistakes in this category.

Big indexes

Daum (Korean)

Naver

Seznam

Cốc Cốc

go.mail.ru

Smaller indexes

ALibw.com (Chinese)

Vuhuv

Yuhuv (alternate domain)

Parsijoo

search.ch

fastbot

Moose.at

SOLOFIELD

kaz.kz

Almost qualified

These engines come close enough to passing my inclusion criteria that I felt I had to mention them. They all display original organic results that you can't find on other engines, and maintain their own indexes. Unfortunately, they don't quite pass.

wiby.me

Mwmbl

Search My site

Blog Surf

Misc

uk.ask.com

Infinity Search

Infinity Decentralized

Search engines without a web interface

Some search engines are integrated into other appliances, but don’t have a web portal.

Gemini search engines

Time for my first Gemini-exclusive content! A Gemini page about search engines wouldn't be complete without a few search engines for the Gemini space.

geminispace.info

AuraGem Search Engine

Ponix source code

Ponix devlog

Graveyard

These engines were originally included in the article, but have since been discontinued.

petalsearch.com

Neeva shutdown announcement

Gigablast

Private.sh

gus.guru

The Wbsrch Experiment

Gowiki

Meorca Search Engine (Wayback Machine snapshot)

Ninfex

Marlo

Exclusions

Two engines were excluded from this list for having a far-right focus.

One engine was excluded because it seems to be built using cryptocurrency in a way I'd rather not support.

Some fascinating little engines seem like hobbyist proofs-of-concept. I decided not to include them in this list, but watch them with interest to see if they can become something viable.

Rationale

Why bother using non-mainstream search engines?

Conflicts of interest

Google, Microsoft (the company behind Bing), and Yandex aren't just search engine companies; they're content and ad companies as well. For example, Google hosts video content on YouTube and Microsoft hosts social media content on LinkedIn. This gives these companies a powerful incentive to prioritize their own content. They are able to do so even if they claim that they treat their own content the same as any other: since they have complete access to their search engines' inner workings, they can tailor their content pages to better fit their algorithms and tailor their algorithms to work well on their own content. They can also index their own content without limitations but throttle indexing for other crawlers.ÂČ

One way to avoid this conflict of interest is to *use search engines that aren't linked to major content providers;* i.e., use engines with their own independent indexes.

Information diversity

There's also a practical, less-ideological reason to try other engines: different providers have different results. Websites that are hard to find on one search engine might be easy to find on another, so using more indexes and ranking algorithms results in access to more content.

No search engine is truly unbiased. Most engines' ranking algorithms incorporate a method similar to PageRank, which biases them towards sites with many backlinks.

PageRank (Wikipedia)

Search engines have to deal with unwanted results occupying the confusing overlap between SEO spam, shock content, and duplicate content. When this content’s manipulation of ranking algos causes it to rank high, engines have to address it through manual action or algorithm refinement. Choosing to address it through either option, or choosing to leave it there for popular queries after receiving user reports, reflects bias. The best solution is to mix different ranking algorithms and indexes instead of using one engine for everything.

Methodology

Discovery

I find new engines by:

Criteria for inclusion

Engines in this list should have their own indexes powered by web crawlers. Original results should not be limited to a set of websites hand-picked by the engine creators; indexes should be built from sites from across the Web. An engine should discover new interesting places around the Web.

Here's an oversimplified example to illustrate what I'm looking for: imagine somone self-hosts their own personal or interest-specific website and happens to get some recognition. Could they get *automatically* discovered by your crawler, indexed, and included in the first page of results for a certain query?

I'm willing to make two exceptions:

1. Engines in the "semi-independent" section may mix results that do meet the aforementioned criteria with results that do not.

2. Engines in the "almost qualified" section may use indexes primarily made of user-submitted or hand-picked sites, rather than focusing primarily on sites discovered organically through crawling.

The reason the second exception exists is that while user submissions don't represent automatic crawling, they do at least inform the engine of new interesting websites that it had not previously discovered; these websites can then be shown to other users. That's fundamentally what an alternative web index needs to achieve.

I'm not usually willing to budge on my "no hand-picked websites" rule. Hand-picked sites will be ignored, whether your engine fetches content through their APIs or crawls and scrapes their content. It's fine to use hand-picked websites as starting points for your crawler (Wikipedia is a popular option).

I only consider search engines that focus on link results for webpages. Image search engines are out of scope, though I *might* consider some other engines for non-generalist search (e.g., Semantic Scholar finds PDFs rather than webpages).

Evaluation

I focused almost entirely on "organic results" (the classic link results), and didn't focus too much on (often glaring) privacy issues, "enhanced" or "instant" results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements.

I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engine’s results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index.

I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested:

Some less-mainstream engines have noticed this article, which is great! I've had excellent discussions with people who work on several of these engines. Unfortunately, this article's visibility also incentivizes some engines to optimize specifically for any methodology I describe. I've addressed this by keeping a long list of test queries to myself. The simple queries above are a decent starting point for simple quick evaluations, but I also test for common search operators, keyword length, and types of domain-specific jargon. I also use queries designed to pull up specific pages with varying levels of popularity and recency to gauge the size, scope, and growth of an index.

Professional critics often work anonymously because personalization can damage the integrity of their reviews. For similar reasons, I attempt to try each engine anonymously at least once by using a VPN and/or my standard anonymous setup: an amnesiac Whonix VM with the Tor Browser. I also often test using a fresh profile when travelling, or via a Searx instance if it supports a given engine. When avoiding personalization, I use "varied" queries that I don't repeat verbatim across search engines; this reduces the likelihood of identifying me. I also attempt to spread these tests out over time so admins won't notice an unusual uptick in unpredictable and esoteric searches. This might seem overkill, but I already regularly employ similar methods for a variety of different scenarios.

Caveats

I didn't try to avoid personalization when testing engines that require account creation. Entries in the "hit-and-miss" and "unusable" sections got less attention: for instance, I didn't spend a lot of effort tracking results over time to see how new entries got added to them.

I avoided "natural language" queries like questions, focusing instead on keyword searches and search operators. I also mostly ignored infoboxes (also known as "instant answers").

Findings

What I learned by building this list has profoundly changed how I surf.

Using one engine for everything ignores the fact that different engines have different strengths. For example: while Google is focused on being an "answer engine", other engines are better than Google at discovering new websites related to a broad topic. Fortunately, browsers like Chromium and Firefox make it easy to add many search engine shortcuts for easy switching.

When talking to search engine founders, I found that the biggest obstacle to growing an index is getting blocked by sites. Cloudflare is one of the worst offenders. Too many sites block perfectly well-behaved crawlers, only allowing major players like Googlebot, BingBot, and TwitterBot; this cements the current duopoly over English search and is harmful to the health of the Web as a whole.

Too many people optimize sites specifically for Google without considering the long-term consequences of their actions. One of many examples is how Google's JavaScript support rendered the practice of testing a website without JavaScript or images "obsolete": almost no non-GBY engines on this list are JavaScript-aware.

When building webpages, authors need to consider the barriers to entry for a new search engine. The best engines we can build today shouldn't replace Google. They should try to be different. We want to see the Web that Google won't show us, and search engine diversity is an important step in that direction.

Try a "bad" engine from lower in the list. It might show you utter crap. But every garbage heap has an undiscovered treasure. I'm sure that some hidden gems you'll find will be worth your while. Let's add some serendipity to the SEO-filled Web.

Acknowledgements

Some of this content came from the Search Engine Map and Search Engine Party. A few web directories also proved useful.

Search Engine Map

Search Engine Party

Matt from Gigablast also gave me some helpful information about GBY which I included in the "Rationale" section. He's written more about big tech in the Gigablast blog:

Gigablast blog

Nicholas A. Ferrell of The New Leaf Journal wrote a great post on alternative search engines.

A 2021 List of Alternative Search Engines and Search Resources

N.A. Ferrell's Gemlog

He also gave me some useful details about Seznam, Naver, Baidu, and Goo:

Re: Editor of The New Leaf Journal - Added Your Guestbook Comment Info to My Post + Feedback

Notes

Âč Yes, “indexes” is an acceptable plural form of the word “index”. The word “indices” sounds weird to me outside a math class.

ÂČ Matt from Gigablast told me that indexing YouTube or LinkedIn will get you blocked if you aren't Google or Microsoft. I imagine that you could do so by getting special permission if you're a megacorporation.

Âł DuckDuckGo has a crawler called DuckDuckBot. This crawler doesn't impact the linked results displayed; it just grabs favicons and scrapes data for a few instant answers. DuckDuckGo's help pages claim that the engine uses over 400 sources; my interpretation is that at least 398 sources don't impact organic results. I don't think DuckDuckGo is transparent enough about the fact that their organic results are proxied. Compare DuckDuckGo side-by-side with Bing and you'll see it's sourcing organic results from one of them (probably Bing). Update 2022: DuckDuckGo has the ability to downrank results on its own; it was previously working with Bing to get Bing to remove misinformation and spam:

Gabriel Weinberg on Twitter

DuckDuckGo's prior approach to moderation

 Qwant claims to also use its own crawler for results, but it’s still mostly Bing in my experience. See the "semi-independent" section.

⁔ Disconnect Search allows users to have results proxied from Bing or Yahoo, but Yahoo sources its results from Bing.

⁶ Yippy claims to be powered by a certain IBM brand (a brand that could correspond to any number of products) and annotates results with the phrase “Yippy Index”, but a side-by-side comparison with Bing and other Bing-based engines revealed results to be nearly identical.

⁞ This is based on a statement Right Dao made in on Reddit:

Right Dao on Reddit

Archive of the Reddit thread

âč Some search engines support the "site:" search operator to limit searches to subpages/subdomains of a single site or TLD. "site:.one", for instance, limits searches to websites with the ".one" TLD.

Âč⁰ More information can be found in a HN subthread and the Cliqz tech blog:

HN comment thread for "Introducing Brave Search Beta"

Tech @ Cliqz: Building a search engine from scratch

Tech @ Cliqz: Search quality at Cliqz

ÂčÂč I'm in the process of re-evaluating You.com. It claims to operate a crawler and index. As of right now, it seems very much like DuckDuckGo to me: organic results look like they're from Bing, while infoboxes ("apps") seem to be scraped or queried from hand-picked websites; I'm not currently seeing results from "around the web" like the other engines that do pass my inclusion criteria. I might be wrong! I'm re-evaluating it to see if this isn't actually the case. (Update: You.com seems to source organic link fresults from Bing, and only interleaves those results with its own curated infoboxes)

---

Article changelog

Homepage

View “A look at search engines with their own indexes” on the WWW

Gemini capsule source code

Copyright ïżœïżœ 2021 Rohan Kumar