đŸ’Ÿ Archived View for seirdy.one â€ș 2021 â€ș 03 â€ș 10 â€ș search-engines-with-own-indexes.gmi captured on 2022-04-28 at 17:55:03. Gemini links have been rewritten to link to archived content

View Raw

More Information

âŹ…ïž Previous capture (2022-03-01)

âžĄïž Next capture (2022-04-29)

🚧 View Differences

-=-=-=-=-=-=-

A look at search engines with their own indexes

Originally posted 2021-03-10. Last updated 2022-04-23.

This is a cursory review of all the indexing search engines I have been able to find. Gemini engines are at the bottom; the rest of this post is about Web search engines.

The three dominant English search engines with their own indexesÂč are Google, Bing, and Yandex (GBY). Many alternatives to GBY exist, but almost none of them have their own results; instead, they just source their results from GBY.

With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if they’d like more information.

This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; I’d especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer.

I plan on updating the engines in the top two categories with more info comparing the structured/linked data the engines leverage (RDFa vocabularies, microdata, microformats, JSON-LD, etc.) to help authors determine which formats to use.

About the list

I discuss my motivation for making this page in the "Rationale" section.

I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn't able to find many Spanish-language engines powered by their own crawlers.

I mention details like "allows site submissions" and structured data support where I can only to inform authors about their options, not as points in engines' favor.

See the "Methodology" section at the bottom to learn how I evaluated each one.

General indexing search-engines

Large indexes, good results

These are large engines that pass all my standard tests and more.

1. Google: the biggest index. Allows submitting pages and sitemaps for crawling, and even supports WebSub to automate the process. Powers a few other engines:

2. Bing: the runner-up. Allows submitting pages and sitemaps for crawling without login using the IndexNow API. Its index powers many other engines:

3. Yandex: originally a Russian search engine, it now has an English version. Some Russian results bleed into its English site. Like Bing, it allows submitting pages and sitemaps for crawling using the IndexNow API. Powers:

4. Mojeek: Seems privacy-oriented with a large index containing billions of pages. Quality isn’t at Google/Bing/Yandex’s level, but it’s not bad either. If I had to use Mojeek as my default general search engine, I’d live. Partially powers eTools.ch. At this moment, I think that Mojeek is the best alternative to GBY for general web search.

5. Petal search: A search engine by Huawei that recently switched from searching for Android apps to general search. Despite its surprisingly good results, I wouldn't recommend it due to privacy concerns. Requires an account to submit sites. I discovered this via my access logs. Be aware that in some jurisdictions, it doesn't use its own index: in Russia and some EU regions it uses Yandex and Qwant, respectively.

petalsearch.com

Google, Bing, and Yandex support structured data such as microformats1, microdata, RDFa, Open Graph markup, and JSON-LD. Yandex's support for microformats1 is limited; for instance, it can parse h-card metadata for organizations but not people. Open Graph and Schema.org are the only supported vocabularies I'm aware of. Mojeek is evaluating structured data; it's interested in Open Graph and Schema.org vocabularies.

Smaller indexes, relevant results

These engines pass most of the tests listed in the "methodology" section. All of them seem relatively privacy-friendly.

Right Dao

Gigablast

Private.sh

Alexandria

Alexandria engine source code

Fairsearch

FairSearch supports Open Graph and some JSON-LD at the moment. A look through the source code for Alexandria and Gigablast didn't seem to reveal the use of any structured data

Smaller indexes, hit-and-miss

These engines fail badly at a few important tests. Otherwise, they seem to work well enough.

seekport (HTTP only)

Exalead

Curlie

ExactSeek

Infotiger

Burf.co

Entfer

Siik

inetdex.com

Meorca Search Engine

ChatNoir

Common Crawl

ChatNoir source code (GitHub)

ChatNoir Announcement

Secret Search Engine Labs

CashRank Algorithm

Unusable engines, irrelevant results

Results from these search engines don’t seem at all useful.

Bloopish

MetaGer

Artado Search

Active Search Results

Crawlson

Anoox

Plumb CPO

Yioop!

Semi-independent indexes

Engines in this category fall back to GBY when their own indexes don't have enough results. As their own indexes grow, some claim that this should happen less often.

Brave Search

Plumb

Neeva

Qwant

Kagi Search

Kagi.ai

TinyGem

Non-generalist search

These indexing search engines don’t have a Google-like “ask me anything” endgame; they’re trying to do something different. You aren't supposed to use these engines the same way you use GBY.

Small/non-commercial Web

wiby.me

search.marginalia.nu

Search My site

Teclis

Site finders

These engines try to find a website, typically at the domain-name level. They don't focus on capturing particular pages within websites.

Kozmonavt

search.tl

Thunderstone

sengine.info

Gnomit

Other

Keybot Translation Search Machine.

Ninfex

Semantic Scholar

Other languages

I’m unable to evaluate these engines properly since I don’t speak the necessary languages. English searches on these are a hit-or-miss. I might have made a few mistakes in this category.

Big indexes

Naver

Seznam

Cốc Cốc

go.mail.ru

Smaller indexes

Vuhuv

Yuhuv (alternate domain)

Parsijoo

search.ch

fastbot

Moose.at

SOLOFIELD

kaz.kz

Misc

uk.ask.com

Infinity Search

Infinity Decentralized

Upcoming engines

These engines aren’t ready yet; their indexes are either in a proof-of-concept phase with a handful of sites or aren’t available yet.

Gemini search engines

Time for my first Gemini-exclusive content! A Gemini page about search engines wouldn't be complete without a few search engines for the Gemini space.

geminispace.info

AuraGem Search Engine

Ponix source code

Ponix devlog

Graveyard

These engines were originally included in the article, but have since been discontinued.

gus.guru

The Wbsrch Experiment

Gowiki

Exclusions

Two engines were excluded from this list for having a far-right focus.

One engine was excluded because it seems to be built using cryptocurrency in a way I'd rather not support.

Some fascinating little engines seem like hobbyist proofs-of-concept. I decided not to include them in this list, but watch them with interest to see if they can become something viable.

Rationale

Why bother using non-mainstream search engines?

Conflicts of interest

Google, Microsoft (the company behind Bing), and Yandex aren't just search engine companies; they're content and ad companies as well. For example, Google hosts video content on YouTube and Microsoft hosts social media content on LinkedIn. This gives these companies a powerful incentive to prioritize their own content. They are able to do so even if they claim that they treat their own content the same as any other: since they have complete access to their search engines' inner workings, they can tailor their content pages to better fit their algorithms and tailor their algorithms to work well on their own content. They can also index their own content without limitations but throttle indexing for other crawlers.ÂČ

One way to avoid this conflict of interest is to *use search engines that aren't linked to major content providers;* i.e., use engines with their own independent indexes.

Information diversity

There's also a practical, less-ideological reason to try other engines: different providers have different results. Websites that are hard to find on one search engine might be easy to find on another, so using more indexes and ranking algorithms results in access to more content.

No search engine is truly unbiased. Most engines' ranking algorithms incorporate a method similar to PageRank, which biases them towards sites with many backlinks.

PageRank (Wikipedia)

Search engines have to deal with unwanted results occupying the confusing overlap between SEO spam, shock content, and duplicate content. When this content’s manipulation of ranking algos causes it to rank high, engines have to address it through manual action or algorithm refinement. Choosing to address it through either option, or choosing to leave it there for popular queries after receiving user reports, reflects bias. The best solution is to mix different ranking algorithms and indexes instead of using one engine for everything.

Methodology

Discovery

I find new engines by:

Criteria for inclusion

Engines in this list should have their own indexes built primarily by web spiders. They should not be limited to a set of domains hand-picked by the engine creators.

I'm willing to make one exception: engines in the "non-generalist" section may use indexes primarily made of user-submitted sites, rather than focusing primarily on sites discovered organically through crawling. I'm not willing to budge on the "no hand-picked domains" rule.

I only consider search engines that focus on link results for webpages. Image search engines are out of scope, though I *might* consider some other engines for non-generalist search (e.g., Semantic Scholar finds PDFs rather than webpages).

Evaluation

I focused almost entirely on "organic results" (the classic link results), and didn't focus too much on (often glaring) privacy issues, "enhanced" or "instant" results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements.

I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engine’s results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index.

I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested:

Some less-mainstream engines have noticed this article, which is great! I've had excellent discussions with people who work on several of these engines. Unfortunately, this article's visibility also incentivizes some engines to optimize specifically for any methodology I describe. I've addressed this by keeping a long list of test queries to myself. The simple queries above are a decent starting point for simple quick evaluations, but I also test for common search operators, keyword length, and types of domain-specific jargon. I also use queries designed to pull up specific pages with varying levels of popularity and recency to gauge the size, scope, and growth of an index.

Professional critics often work anonymously because personalization can damage the integrity of their reviews. For similar reasons, I attempt to try each engine anonymously at least once by using a VPN and/or my standard anonymous setup: an amnesiac Whonix VM with the Tor Browser. I also often test using a fresh profile when travelling, or via a Searx instance if it supports a given engine. When avoiding personalization, I use "varied" queries that I don't repeat verbatim across search engines; this reduces the likelihood of identifying me. I also attempt to spread these tests out over time so admins won't notice an unusual uptick in unpredictable and esoteric searches. This might seem overkill, but I already regularly employ similar methods for a variety of different scenarios.

Caveats

I didn't try to avoid personalization when testing engines that require account creation. Entries in the "hit-and-miss" and "unusable" sections got less attention: for instance, I didn't spend a lot of effort tracking results over time to see how new entries got added to them.

I avoided "natural language" queries like questions, focusing instead on keyword searches and search operators. I also mostly ignored infoboxes (also known as "instant answers").

Findings

What I learned by building this list has profoundly changed how I surf.

Using one engine for everything ignores the fact that different engines have different strengths. For example: while Google is focused on being an "answer engine", other engines are better than Google at discovering new websites related to a broad topic. Fortunately, browsers like Chromium and Firefox make it easy to add many search engine shortcuts for easy switching.

When talking to search engine founders, I found that the biggest obstacle to growing an index is getting blocked by sites. Cloudflare is one of the worst offenders. Too many sites block perfectly well-behaved crawlers, only allowing major players like Googlebot, BingBot, and TwitterBot; this cements the current duopoly over English search and is harmful to the health of the Web as a whole.

Too many people optimize sites specifically for Google without considering the long-term consequences of their actions. One of many examples is how Google's JavaScript support rendered the practice of testing a website without JavaScript or images "obsolete": almost no non-GBY engines on this list are JavaScript-aware.

When building webpages, authors need to consider the barriers to entry for a new search engine. The best engines we can build today shouldn't replace Google. They should try to be different. We want to see the Web that Google won't show us, and search engine diversity is an important step in that direction.

Try a "bad" engine from lower in the list. It might show you utter crap. But every garbage heap has an undiscovered treasure. I'm sure that some hidden gems you'll find will be worth your while. Let's add some serendipity to the SEO-filled Web.

Acknowledgements

Some of this content came from the Search Engine Map and Search Engine Party. A few web directories also proved useful.

Search Engine Map

Search Engine Party

Matt from Gigablast also gave me some helpful information about GBY which I included in the "Rationale" section. He's written more about big tech in the Gigablast blog:

Gigablast blog

Nicholas A. Ferrell of The New Leaf Journal wrote a great post on alternative search engines.

A 2021 List of Alternative Search Engines and Search Resources

N.A. Ferrell's Gemlog

He also gave me some useful details about Seznam, Naver, Baidu, and Goo:

Re: Editor of The New Leaf Journal - Added Your Guestbook Comment Info to My Post + Feedback

Notes

Âč Yes, “indexes” is an acceptable plural form of the word “index”. The word “indices” sounds weird to me outside a math class.

ÂČ Matt from Gigablast told me that indexing YouTube or LinkedIn will get you blocked if you aren't Google or Microsoft. I imagine that you could do so by getting special permission if you're a megacorporation.

Âł DuckDuckGo has a crawler called DuckDuckBot. This crawler doesn't impact the linked results displayed; it just grabs favicons and scrapes data for a few instant answers. DuckDuckGo's help pages claim that the engine uses over 400 sources; my interpretation is that at least 398 sources don't impact organic results. I don't think DuckDuckGo is transparent enough about the fact that their organic results are proxied. Compare DuckDuckGo side-by-side with Bing and Yandex and you'll see it's sourcing organic results from one of them (probably Bing). Update 2022: DuckDuckGo has the ability to downrank results on its own; it was previously working with Bing to get Bing to remove misinformation and spam:

Gabriel Weinberg on Twitter

DuckDuckGo's prior approach to moderation

 Qwant claims to also use its own crawler for results, but it’s still mostly Bing in my experience. See the "semi-independent" section.

⁔ Disconnect Search allows users to have results proxied from Bing or Yahoo, but Yahoo sources its results from Bing.

⁶ Yippy claims to be powered by a certain IBM brand (a brand that could correspond to any number of products) and annotates results with the phrase “Yippy Index”, but a side-by-side comparison with Bing and other Bing-based engines revealed results to be nearly identical.

⁞ This is based on a statement Right Dao made in on Reddit:

Right Dao on Reddit

Archive of the Reddit thread

âč Some search engines support the "site:" search operator to limit searches to subpages/subdomains of a single site or TLD. "site:.one", for instance, limits searches to websites with the ".one" TLD.

Âč⁰ More information can be found in a HN subthread and the Cliqz tech blog:

HN comment thread for "Introducing Brave Search Beta"

Tech @ Cliqz: Building a search engine from scratch

Tech @ Cliqz: Search quality at Cliqz

---

Article changelog

Homepage

View “A look at search engines with their own indexes” on the WWW

Gemini capsule source code

Copyright © 2021 Rohan Kumar