💾 Archived View for envs.net › ~seirdy › 2021 › 03 › 10 › search-engines-with-own-indexes.gmi captured on 2022-04-29 at 12:40:14. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

A look at search engines with their own indexes

Originally posted 2021-03-10. Last updated 2022-04-23.

This is a cursory review of all the indexing search engines I have been able to find. Gemini engines are at the bottom; the rest of this post is about Web search engines.

The three dominant English search engines with their own indexes¹ are Google, Bing, and Yandex (GBY). Many alternatives to GBY exist, but almost none of them have their own results; instead, they just source their results from GBY.

With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if they’d like more information.

This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; I’d especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer.

I plan on updating the engines in the top two categories with more info comparing the structured/linked data the engines leverage (RDFa vocabularies, microdata, microformats, JSON-LD, etc.) to help authors determine which formats to use.

About the list

I discuss my motivation for making this page in the "Rationale" section.

I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn't able to find many Spanish-language engines powered by their own crawlers.

I mention details like "allows site submissions" and structured data support where I can only to inform authors about their options, not as points in engines' favor.

See the "Methodology" section at the bottom to learn how I evaluated each one.

General indexing search-engines

Large indexes, good results

These are large engines that pass all my standard tests and more.

1. Google: the biggest index. Allows submitting pages and sitemaps for crawling, and even supports WebSub to automate the process. Powers a few other engines:

Startpage
GMX search
(discontinued) Runnaroo
SAPO (Portuguese interface, can work with English results)

2. Bing: the runner-up. Allows submitting pages and sitemaps for crawling without login using the IndexNow API. Its index powers many other engines:

Yahoo (and its sibling engine, OneSearch)
DuckDuckGo³
AOL
Qwant (partial)⁴
Ecosia
Ekoru
Privado
Findx
Disconnect Search⁵
PrivacyWall
Lilo
SearchScene
Peekier
Oscobo
Million Short
Yippy search⁶
Lycos
Givero
Swisscows
Fireball
You.com
Partially powers MetaGer by default; this can be turned off
At this point, I mostly stopped adding Bing-based search engines. There are just too many.

3. Yandex: originally a Russian search engine, it now has an English version. Some Russian results bleed into its English site. Like Bing, it allows submitting pages and sitemaps for crawling using the IndexNow API. Powers:

Epic Search (went paid-only by June 2021)
Occasionally powers DuckDuckGo’s link results instead of Bing.

4. Mojeek: Seems privacy-oriented with a large index containing billions of pages. Quality isn’t at Google/Bing/Yandex’s level, but it’s not bad either. If I had to use Mojeek as my default general search engine, I’d live. Partially powers eTools.ch. At this moment, I think that Mojeek is the best alternative to GBY for general web search.

5. Petal search: A search engine by Huawei that recently switched from searching for Android apps to general search. Despite its surprisingly good results, I wouldn't recommend it due to privacy concerns. Requires an account to submit sites. I discovered this via my access logs. Be aware that in some jurisdictions, it doesn't use its own index: in Russia and some EU regions it uses Yandex and Qwant, respectively.

petalsearch.com

Google, Bing, and Yandex support structured data such as microformats1, microdata, RDFa, Open Graph markup, and JSON-LD. Yandex's support for microformats1 is limited; for instance, it can parse h-card metadata for organizations but not people. Open Graph and Schema.org are the only supported vocabularies I'm aware of. Mojeek is evaluating structured data; it's interested in Open Graph and Schema.org vocabularies.

Smaller indexes, relevant results

These engines pass most of the tests listed in the "methodology" section. All of them seem relatively privacy-friendly.

Right Dao: very fast, good results. Passes the tests fairly well. It plans on including query-based ads if/when its userbase grows.⁸

Right Dao

Gigablast: It’s been around for a while and also sports a classic web directory. Searches are a bit slow, and it charges to submit sites for crawling. It powers Private.sh. Gigablast is tied with Right Dao for quality.

Gigablast

Private.sh

Alexandria: A pretty new "non-profit, ad free" engine, with freely-licensed code. Surprisingly good at finding recent pages. Its index is built from the Common Crawl; it isn't as big as Gigablast or Right Dao but its ranking is great.

Alexandria

Alexandria engine source code

Fairsearch: an ambitious engine from Ahrefs, an SEO/backlink-finder company, that "shares ad profit with creators and protects your privacy". Most engines show results that include keywords from or related to the query; Fairsearch also shows results linked by pages containing the query. In other words, not all results contain relevant keywords. This makes it excellent for less precise searches and discovery of "related sites", especially with its index of *hundreds of billions of pages.* It's far worse at finding very specific information or recent events for now, but it will probably improve: while this version is live, it's not officially launched yet. When it officially launches, *it will be under a different name*. I expect Fairsearch to graduate from this section as result relevancy improves.

Fairsearch

FairSearch supports Open Graph and some JSON-LD at the moment. A look through the source code for Alexandria and Gigablast didn't seem to reveal the use of any structured data

Smaller indexes, hit-and-miss

These engines fail badly at a few important tests. Otherwise, they seem to work well enough.

seekport: The interface is in German but it supports searching in English just fine. The default language is selected by your locale. It’s really good considering its small index; it hasn’t heard of less common terms (e.g. “Seirdy”), but it’s able to find relevant results in other tests. The server does not support TLS.
Exalead: slow, quality is hit-and-miss. Its indexer claims to crawl the DMOZ directory, which has since shut down and been replaced by the Curlie directory. No relevant results for “Oppenheimer” and some other history-related queries. Allows submitting individual URLs for indexing, but requires solving a Google reCAPTCHA and entering an email address.
ExactSeek: small index, disproportionately dominated by big sites. Failed multiple tests. Allows submitting individual URLs for crawling, but requires entering an email address and receiving a newsletter. Webmaster tools seem to heavily push for paid SEO options. It also powers SitesOnDisplay and Blog-search.com.
Infotiger: A small index that seems to find relevant results. It allows site submission for English and German pages. It also features a "similarity" search to query pages similar to a given link, with mixed results.

Burf.co: Very small index, but seems fine at ranking more relevant results higher. Allows site submission without any extra steps.
Entfer: a newcomer that lets registered users upvote/downvote search results to customize ranking. Doesn't offer much information about who made it. Its index is small, but it does seem to return results related to the query.
Siik: Lacks contact info, and the ToS and Privacy Policy links are dead. Seems to have PHP errors in the backend for some of its instant-answer widgets. If you scroll past all that, it does have web results powered by what seems to be its own index. These results do tend to be somewhat relevant, but the index seems too small for more specific queries.
websearchengine.org and tuxdex.com: Both are run by the same people, powered by their inetdex.com index. Searches are fast, but crawls are a bit shallow. Claims to have an index of 10 million domains, and not to use cookies.

Meorca: A UK-based search engine that claims not to "index pornography or illegal content websites". It also features an optional social network ("blog"). Discovered in the seirdy.one access logs.
ChatNoir: An experimental engine by researchers that uses the Common Crawl index. The engine is open source. There's more information in its announcement on the Common Crawl mailing list (Google Groups).
Secret Search Engine Labs: Very small index with very little SEO spam; it toes the line between a "search engine" and a "surf engine". It's best for reading about broad topics that would otherwise be dominated by SEO spam, thanks to its CashRank algorithm. Allows site submission.

Meorca Search Engine

ChatNoir

Common Crawl

ChatNoir source code (GitHub)

ChatNoir Announcement

Secret Search Engine Labs

CashRank Algorithm

Unusable engines, irrelevant results

Results from these search engines don’t seem at all useful.

Bloopish: extremely quick to update its index; site submissions show up in seconds. Unfortunately, its index only contains a few thousand documents (under 100 thousand at the time of writing). It's growing fast: if you search for a term, it'll start crawling related pages and grow its index.
YaCy: community-made index; slow. Results are awful/irrelevant, but can be useful for intranet or custom search.
Scopia: only seems to be available via the MetaGer metasearch engine after turning off Bing and news results. Tiny index, very low-quality.
Artado Search: Primarily Turkish, but it also seems to support English results. Like Plumb, it uses client-side JS to fetch results from existing engines (Google, Bing, Yahoo, Petal, and others); like MetaGer, it has an option to use its own independent index. Results from its index are almost always empty. Very simple queries ("twitter", "wikipedia", "reddit") give some answers. Supports site submission and crowdsourced instant answers.
Active Search Results: very poor quality

Bloopish

MetaGer

Artado Search

Active Search Results

Crawlson: young, slow. In this category because its index has a cap of 10 URLs per domain. I initially discovered Crawlson in the seirdy.one access logs.
Anoox: Results are few and irrelevant; fails to find any results for basic terms. Allows site submission. It's also a lightweight social network and claims to be powered by its users, letting members vote on listings to alter rankings.
Yioop!: A FLOSS search engine that boasts a very impressive feature-set: it can parse sitemaps, feeds, and a variety of markup formats; it can import pre-curated data in forms such as access logs, Usenet posts, and WARC archives; it also supports feed-based news search. Despite the impressive feature set, Yioop's results are few and irrelevant due to its small index. It allows submitting sites for crawling. Like Meorca, Yioop has social features such as blogs, wikis, and a chat bot API.

Semi-independent indexes

Engines in this category fall back to GBY when their own indexes don't have enough results. As their own indexes grow, some claim that this should happen less often.

Brave Search: Many tests (including all the tests I listed in the "Methodology" section) resulted results identical to Google, revealed by a side-by-side comparison with Google, Startpage, and a Searx instance with only Google enabled. Brave claims that this is due to how Cliqz (the discontinued engine acquired by Brave) used query logs to build its page models and was optimized to match Google.¹⁰ The index is independent, but optimizing against Google resulted in too much similarity for the real benefit of an independent index to show. Furthermore, many queries have Bing results mixed in; users can click an "info" button to see the percentage of results that came from its own index. The independent percentage is typically quite high (often close to 100%) but can drop for advanced queries.

Brave Search

Plumb: Almost all queries return no results; when this happens, it falls back to Google. It's fairly transparent about the fallback process, but I'm concerned about *how* it does this: it loads Google's Custom Search scripts from "cse.google.com" onto the page to do a client-side Google search. This can be mitigated by using a browser addon to block "cse.google.com" from loading any scripts. Plumb claims that this is a temporary measure while its index grows, and they're planning on getting rid of this. Allows submitting URLs, but requires solving an hCaptcha. This engine is very new; hopefully as it improves, it could graduate from this section. Its Chief Product Officer previously founded the Gibiru search engine which shares the same affiliates and (for now) the same index; the indexes will diverge with time.

Plumb

Neeva: Combines Bing results with results from its own index. Bing normally isn't okay with this, but Neeva is one of few exceptions. As of right now, results are mostly identical to Bing but original links not found by Bing frequently pop up. Long and esoteric queries are less likely to feature original results. Requires signing up with an email address or OAuth to use, and offers a paid tier with additional benefits.

Neeva

Qwant: Qwant claims to use its own index, but it still relies on Bing for most results. It seems to be in a position similar to Neeva. Try a side-by-side comparison to see if or how it compares with Bing.

Qwant

Kagi Search: The most interesting entry in this category, IMO. Like Neeva, it requires an account; it will eventually require payment. It's powered by its own Teclis index (Teclis can be used independently; see the non-commercial section below), and claims to also use results from Google and Bing. The result seems somewhat unique: I'm able to recognize some results from the Teclis index mixed in with the mainstream ones. In addition to Teclis, Kagi's other products include the Kagi.ai intelligent answer service and the TinyGem social bookmarking service, both of which play a role in Kagi.com in the present or future.

Kagi Search

Kagi.ai

TinyGem

Non-generalist search

These indexing search engines don’t have a Google-like “ask me anything” endgame; they’re trying to do something different. You aren't supposed to use these engines the same way you use GBY.

Small/non-commercial Web

Wiby: I love this one. It focuses on smaller independent sites that capture the spirit of the “early” web. It’s more focused on “discovering” new interesting