💾 Archived View for dioskouroi.xyz › thread › 29417061 captured on 2021-12-05 at 23:47:19. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?

Author: syedkarim

Score: 394

Comments: 474

Date: 2021-12-02 15:10:03

________________________________________________________________________________

gbmatt wrote at 2021-12-02 15:29:27:

Ha, yes, I've done that at

https://gigablast.com/

.

The biggest problems now are the following:

1) Too hard to spider the web. Gatekeeper companies like Cloudflare (owned in part by Google) and Cloudfront make it really difficult for upstart search engines to download web pages.

2) Hardware costs are too high. It's much more expensive now to build a large index (50B+ pages) to be competitive.

I believe my algorithms are decent, but the biggest problem for Gigablast is now the index size. You do a search on Gigablast and say, well, why didn't it get this result that Google got. And that's because the index isn't big enough because I don't have the cash for the hardware. btw, I've been working on this engine for over 20 years and have coded probably 1-2M lines of code on it.

easton wrote at 2021-12-02 19:37:32:

You can be whitelisted so Cloudflare doesn't slow you down (or block you):

https://support.cloudflare.com/hc/en-us/articles/36003538743...

gbmatt wrote at 2021-12-03 01:29:35:

It's not quite that easy. Have you ever tried it? See my post below. Basically, yes, I've done it, but i had to go through a lot and was lucky enough to even get them to listen to me. I just happened to know the right person to get me through. So, super lucky there.

Furthermore, they have an AI that takes you off the whitelist if it sees your bot 'misbehave', whatever that is. So if you have a certain kind of bug in your spider, or your bot 'misbehaves', whatever that means is anyone's guess, then you're going to get kicked off the list. So then what? You have to try to get on the whitelist again? They have Bing and Google on some special short lists so those guys don't have to sweat all these hurdles.

Lastly, their UI and documentation is heavily centered around Google and Bing, so upstart search engines aren't getting the same treatment.

gbmatt wrote at 2021-12-03 01:38:18:

Cloudflare is not the only gatekeeper, too. Keep that in mind. There's many others and, as an upstart search engine operator, it's quite overwhelming to have to deal with them all. Some of them have contempt for you when you approach them. I've had one gatekeeper actually list my bot as a bad actor in an example in some of their documentation. So, don't get me wrong, this is about gatekeepers in general, not just only Cloudflare and Cloudfront.

ipaddr wrote at 2021-12-03 02:12:35:

But your treatment one could say sites fronted by cloudflares are part of a closed web

Lhiw wrote at 2021-12-03 01:53:03:

I dunno if y'all realise this but I'd pay for a search engine that black holes CloudFlare and any other sites that think bots shouldn't read their sites.

S5yDyAk3XoQH5 wrote at 2021-12-03 13:34:32:

rip the internet if you do that =/

SamBam wrote at 2021-12-02 20:57:07:

> You do a search on Gigablast and say, well, why didn't it get this result that Google got. And that's because the index isn't big enough

I wionder how much this is true, and how much (despite all our rhetoric to the contrary) it's because we have actually come to expect Google's modern proprietary page ranking, which counts more than just inbound links but all sorts of other signals (freshness, relevance to our previous queries, etc.).

We dislike the additional signals when it feels like Google is trying to second-guess our intentions, but we probably don't notice how well they work when they give us the result we expect in the first three links.

eitland wrote at 2021-12-03 06:50:07:

>but we probably don't notice how well they work when they give us the result we expect in the first three links.

For me the experienced quality of Google search results gave have dropped massively since 2008, despite (and maybe even because of) all their new parameters.

When someone says this someone else usually immediately says it is because of web spam and black hat SEO.

But black hat SEO doesn't explain why verbatim doesn't work for many of us.

Black hat SEO doesn't explain why double quotes doesn't work.

Black hat SEO doesn't explain why there is no personal blacklists so all those who hate pintrest can blacklist them.

Black hat SEO probably also doesn't explain why I cannot find a unique strings in open source repos and instead get pages of not exactly webspam but answers to questions I didn't ask.

JacobThreeThree wrote at 2021-12-02 21:14:58:

I think people also have an inflated recollection of how good Google actually was back in 2005.

Back then Google was only going up against indexes and link-rings, not 2021 Google/Bing/DDG/etc.

eitland wrote at 2021-12-03 07:11:08:

> I think people also have an inflated recollection of how good Google actually was back in 2005.

I've been pointing this out for at least close to a decade.

I know since I bothered to screenshot and blog about it in 2012.

I'll admit mistakes happened back then too, but they were more forgivable like keyword stuffing on unrelated pages. Back then Google were on our side and removed those as fast as possible.

Today however the problem isn't that someone hss stuffed the keyword into an unrelated page but that Google themselves mix a whole lot of completely irrelevant pages into the results, probably because some metrics go up when they do that.

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - tonsome degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

Note that I'm not necessarily suggesting an grand evil master plan here, only that end-to-end metrics will improve as long as there is no realistic competition.

remus wrote at 2021-12-03 09:46:01:

> Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - tonsome degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

This would mean that google were measuring the quality of their search results by the number of ad impressions which seems unlikely to me. Maybe in some big, wooly sense this is sort of true but it seems pretty unlikely that anyone interested in search quality (i.e. the search team at google) is looking at ad impressions.

pronik wrote at 2021-12-02 22:49:52:

I've been using Altavista at that time, every now and then switching to Northern Light. Everything else was abysmal. Google blew them out of the water in terms of speed, quality, simplicity, unclutterdness and everything else. I can't remember ever retraining muscle memory so fast when switching to Google. So, no, Google has been great then and apart from people actively working against the algorithm is still good now, but obviously a completely different beast.

remus wrote at 2021-12-03 07:01:13:

I think the parent's point was that people say Google 2005 >> Google 2021, but it's pretty hard to make this comparison in an objective way. No doubt Google 2005 was way better than other offerings around at the time.

pbhjpbhj wrote at 2021-12-02 22:04:19:

2005? There were loads of other search engines (SE), and many meta-SE: hotbot, dogpile, metacrawler, ... (IIRC), plenty more.

There was also indexes, which Yahoo, AOL (remember them!) had but there was, what was it called, dmoz?, the open web directory. When Google started, being in the right web directory gave you a boost in SERPs as it was used as a domain trust indicator, and the categories were used for keywords. Of course it got gamed hard.

Google was good, but I used it as an alt for maybe 6 months before it won over my main SE at the time. I've tried but can't remember what SE that was, Omni-something??

One of the main things Google had was all the extra operators like link: inurl:, etc., but they had Boolean logic operators too at one point I think.

nitrogen wrote at 2021-12-03 00:58:08:

_I've tried but can't remember what SE that was, Omni-something??_

Google replaced Altavista in my usage, who in turn were usually better than their predecessors.

ipaddr wrote at 2021-12-03 02:17:25:

I used them all and kept using the ones that gave me unique results. Google was hands down better because of pagerank and boosts to dmoz listed sites and because they scanned the whole page ignoring keywords.

selcuka wrote at 2021-12-03 02:35:57:

Google was good, actually very good back in 2000s. Their PageRank algorithm practically eliminated spam pages that were simply a list of keywords. Before Google, those pages came up on the first page of Altavista.

I don't specifically remember 2005, but the quality went down with more modern but still shady SEO practices.

more_corn wrote at 2021-12-04 05:22:03:

No, quality went down because google shat the bed. All the changes have been deliberate.

more_corn wrote at 2021-12-04 05:19:56:

I hate google now. Every time I use it by accident I’m reminded how infuriating it is. I know DuckDuckGo is just bing in a Halloween mask, but I’ll gladly use something that’s not awesome as long as it’s also not infuriating. I’d take 2005 google any day.

romwell wrote at 2021-12-02 21:18:21:

Well if the result didn't appear in the first 5-10 pages, it's probably not in the index.

You can see it with other search engines. I challenge you to come up with a Google query for which a first-page result won't be seen within the first 10 pages of Bing results for the same query.

(Bonus points if that result is relevant).

There's only so much tweaking that personalization and other heuristic can do.

But if something is missing from them index, that's it.

salawat wrote at 2021-12-03 02:58:39:

I would like to see the least relevant search result Google comes up with. :)

Yes, I realize this is probably trivial with an API call, but I always found it interesting there isn't a way to see what the site with the lowest pagerank in the index is.

jefftk wrote at 2021-12-03 00:58:20:

It sounds to me like your challenge includes anything which is in Google's index but not Bing's? Is that intentional?

jldugger wrote at 2021-12-02 21:25:35:

I assume the author has the ability to search the index to see if your preferred Google result is even indexed.

indymike wrote at 2021-12-02 15:42:02:

I've used Gigablast off and on for a long time (I think I first discovered Gigablast in 2006 or so). Would be cool to have a registration service for legitimate spiders. I used to run a team that scraped jobs and delivered them (by fax, email, us mail as require by law) to local veteran's employment staffers for compliance. We were contracted by huge companies (at one point about 700 of the fortune 1000) to do so, and often our spiders would be blocked by the employer's IT department even though the HR team was paying us big bucks to do so.

betwixthewires wrote at 2021-12-02 23:54:26:

Dude, I use your engine regularly, it is spectacular. The amount of work you put into this takes some dedication.

I was curious if you ever intend to implement OpenSearch API so that we could use it as default in browser or embed it in applications?

Also how can people contribute to help you maintain a larger index and/or keep the service going?

sockaddr wrote at 2021-12-02 16:47:46:

Nice.

I'd pay 5-10$/mo for a search engine that didn't just funnel me into the revenue-extracting regions of the web like Google does.

RhysU wrote at 2021-12-02 18:49:49:

A subscriber-supported search engine sounds cool to me. Any precedent?

xtracto wrote at 2021-12-02 19:12:36:

Copernic (

https://copernic.com/

) had Copernic Agent Professional, a for-pay desktop application that had really good search features, a while ago . Not sure if they discontinued it.

gompertz wrote at 2021-12-02 22:53:46:

Wow blast from the past. I think I was using Copernic all the way back in 2003... Forgot all about them. Thanks!

KarlKemp wrote at 2021-12-03 00:25:27:

As a general rule, nobody is willing to pay what they are worth to advertisers. Facebook makes 70$ / y / user in the US. You would pay $70 for an ad-free Facebook? Congratulation, you must be an above-average earner. Also: your value to advertisers just tripled. If you are willing to pay $210, it will immediately triple again.

derekjdanserl wrote at 2021-12-03 01:42:14:

Great point! So simple, but as someone who has never worked on this side of things I never thought about it.

How would legal limitations on data collection, like GDPR, influence the ratio? None? Only an insignificant degree? Or enough to actually influence business decisions?

gianthockey495 wrote at 2021-12-02 20:02:42:

You'll like

https://neeva.com/

fxtentacle wrote at 2021-12-02 23:29:14:

How do they pay for it?

yellow_postit wrote at 2021-12-02 23:49:51:

From the FAQ:

> …Eventually, we plan to charge our members $4.95/month.

samcrawford wrote at 2021-12-02 21:45:15:

Kagi.com does this. In closed beta at the moment, but you can email and request access.

eitland wrote at 2021-12-03 07:23:35:

I've tested Kagi a bit. It nicely gave me exactly what I wanted even in cases where names could have different meanings in different contexts (I tested with Kotlin)

The basic results are good with some nice touches here and there like including a "blast from the past" section with older results which is actually what I sometimes want and another section where it widens up a bit (i.e. what Google does by default?).

Furthermore you can apply pre defined search "lenses" that focuses your search, or even make your own, and you can boost or de-rank sites.

I had not expected this to happen so quickly but I'm going to move from DDG to kagi as my default search engine for at least a couple of days because I am fed up with both Googles and DDGs inability to actually respect my queries.

If ir continues to work as well as it does today I'll happily pay $10 a month and I might also buy 6 months gift cards for close friends and family for next Christmas.

Think about it, unlike an ad financed engine incentives are extremely closely aligned here: the smartest thing Kagi can do is to get my results as fast as possible to conserve server resources (and delight their customer).

For an ad financed engine abd especially one that also serves ads on search results pages as well the obvious thing to do is to keep me bouncing between tweaking my search query and various that almost has my question answered but not quite.

(That said, if one us going to stay mainstream I recommend DDG over Google since 1. for me at least Googles results are just as bad and 2. with DDG it is at least extremely easy to check with Google as well to see if they have a better result 3. competition is good)

thoughtstheseus wrote at 2021-12-02 15:40:47:

Perhaps trolling the entire web is not useful today? I’d love a search engine where I can whitelist sites or take an existing whitelist from trusted curators.

GordonS wrote at 2021-12-02 15:56:06:

Heh, I guess you mean "trawling" - trolling the entire web is something very different :)

hdjjhhvvhga wrote at 2021-12-02 16:03:52:

Then again, if you look at today's search results, where everything above the fold belongs to Google, maybe we have been trolled indeed.

rodiger wrote at 2021-12-02 16:04:20:

Depending on the intended metaphor, trolling could work too :)

https://en.wikipedia.org/wiki/Trolling_(fishing)

xwdv wrote at 2021-12-02 22:37:04:

What would trolling the entire web look like?

TechBro8615 wrote at 2021-12-03 05:15:39:

It would look like a modern search engine with innovative technology offerings like Advanced Mobile Pages.

xwdv wrote at 2021-12-03 06:00:30:

Wow, you’re right. Trolling the entire web would involve an organization that carries considerable authority whose decisions can impact every member of the web.

AMP is the perfect way to troll websites into making shitty versions of their content, for no real reason other than just because you feel like it. And then when you’re satisfied with your trolling you just abandon the standard.

wbillingsley wrote at 2021-12-02 23:14:32:

reddit

giardini wrote at 2021-12-02 16:15:06:

"Trolling" is fine, see e.g.

https://grammarist.com/usage/trawl-troll/#:~:text=Troll%20fo...

.

GrinningFool wrote at 2021-12-02 20:28:16:

Not in this context - "trolling" as described there would apply to targeted indexing of a specific site; while "trawling" would refer to a wide net that attempts to catch all the sites.

romwell wrote at 2021-12-02 21:21:37:

Well, no, it's not fine.

See e.g. _the source you linked_, which explains the difference.

giardini wrote at 2021-12-03 17:11:12:

Did you read to the end? Methinks not!

romwell wrote at 2021-12-04 21:55:02:

>Did you read to the end? Methinks not!

Methink harder.

>Troll for means to patrol or wander about an area in search of something. Trawl for means to search through or gather from a variety of sources.

We were talking about _gathering_ information from a _variety of sources_ to build a search engine index.

erhk wrote at 2021-12-02 15:42:35:

Trusted curators is a dangerous dependency

klankoo wrote at 2021-12-02 15:54:32:

Trusted consumers are better. The original page-rank algo was organic and bottom-up. But now it's the person not the page. Businesses compete for interaction not inbound links. So if you can make a modern page-rank that follows interaction instead of links and isn't a walled garden then I'd invest.

politician wrote at 2021-12-02 16:25:19:

I could make that work, but what do you mean by "walled garden" in this context?

hazza_n_dazza wrote at 2021-12-03 03:46:11:

the business and allies of google - those entrenched interests that limit the current visibility of the web to themselves

dcow wrote at 2021-12-02 21:17:24:

That’s why you don’t make it a hard dependency and let people curate their own list of taste makers. They can share and exchange info about who good taste makers are and good one might even charge for access to exclusive flavors.

thoughtstheseus wrote at 2021-12-02 15:47:09:

It is. The alternative is scooping everything and using algos to curate. That seems worse imo.

sdfjkl wrote at 2021-12-02 15:54:35:

Perhaps vote on results like on Reddit posts? Gets the junk sites down (and out of the index eventually).

Retric wrote at 2021-12-02 16:17:20:

Any open voting system is going to be under serious SEO pressure.

That’s the real issue, Google has indirectly infected the web with junk sites optimized for it. Any new search engine now has a huge hurdle to sort through all the junk and if it succeeds the SEO industry is just going to target them.

A more robust approach is simply pay people to evaluate websites. Assuming it costs say 2$ per domain to either whitelist or block that’s ~300 million for the current web and you need to repeat that effort over time. Of course it’s a clear cost vs accuracy tradeoff. Delist sites that have copies and suddenly people will try to poison the well to delist competitors etc etc.

Nasrudith wrote at 2021-12-02 22:56:07:

Adding a gatekeeper collecting rent isn't a solution - the people using SEO are already spending money to get their name up high on the list.

Retric wrote at 2021-12-03 01:23:49:

This is money spent by a search engine not money collected from websites. People don’t ever want to be sent to a domain parking landing page for example.

More abstractly SEO is inherently a problem for search engines. Algorithms have no inherent way to separate clusters of websites setup to fake relevance from actually relevant websites. Personally I would exclude Quora from all search results, but even getting to the point your trying to make that kind of assessment is extremely difficult in the modern web. Essentially the minimum threshold for usefulness has become quite high which is a problem as Google continues to degenerate into uselessness.

marginalia_nu wrote at 2021-12-02 16:46:46:

Given Reddit is notorious for it's problems with astroturfing and vote bots, I don't think this is a particularly promising approach.

arein3 wrote at 2021-12-02 16:06:36:

Reddit is a heavily gatekeeped community by the mods in regards to specific topics

1024core wrote at 2021-12-02 20:39:32:

Reddit is an extreme example of group think. Try posting something pro-Trump (I mean, surely even that guy has a positive thing or two to be said about him) and you'll get banned in some subs. Or you may get banned simply because the mod doesn't like the fact that you don't toe the party line.

notriddle wrote at 2021-12-02 16:08:45:

Also, vote bots

pessimizer wrote at 2021-12-02 17:23:09:

That just means that you have to curate the people allowed to vote. Otherwise, it would be rule by the obsessed and the search engine optimizers, and the junk sites will dominate the index.

I'm not convinced that Google's recursive AI algos aren't a functional equivalent. They let you vote by tracking your clicks.

dragonwriter wrote at 2021-12-02 15:52:00:

Plus, it scales less well than pure algorithmic search. This fight already happened, with a much smaller internet.

shituonui wrote at 2021-12-02 16:27:32:

It works really, really well for libraries. Research libraries (and research librarians) are phenomenally valuable. I've missed them any time I'm not at a university.

Both curators and algorithms are valuable. This goes for finding books, for finding facts and figures, for finding clothes, for finding dishwashers, and for pretty much everything else.

I love the fact that I have search engines and online shopping, but that shouldn't displace libraries and brick-and-mortar. Curation and the ability to talk to a person are complementary to the algorithmic approach.

dragonwriter wrote at 2021-12-02 19:26:17:

> It works really, really well for libraries

It scales extremely poorly. It works very well for situations where there are customers/sponsors are willing to spend lots of money for quality, because then the cost scaling doesn't matter as much; research libraries, Lexis/Nexus Westlaw, etc. all do this, but it's not cheap, and the cost scaling with the size of the corpus _sucks_ compared to algorithmic search.

It is among the approaches to internet search that lost to more purely algorithmic search, because it scales poorly in cost.

thoughtstheseus wrote at 2021-12-02 21:23:07:

+book stores. Curators can use algorithms to help them curate… Google’s SE is taking signals from poor curators imo.

Zamicol wrote at 2021-12-02 15:54:07:

How about just a meritocratic rating? Even here on HN I would appreciate some sort of weight on expert/experienced opinion. Although in theory I like the idea that every thought is judged on its own, the context of the author is more relevant the deeper the subject. That's one of the reasons I still read

https://lobste.rs

. It has a niche audience with industry experience.

Karrot_Kream wrote at 2021-12-03 00:54:24:

Lobsters is a great example of the benefits _and dangers_ of expert/experienced opinion. Lobsters is highly oriented around programming languages and security and leaves out large swaths of what's out there in computing. That's fine of course, but it creates a pretty big distortion bubble that's largely driven by the opinions of the gatekeepers on the site rather than a more wide computing audience.

skinnymuch wrote at 2021-12-03 01:11:12:

Nothing is meritocratic. I think the term came into our lexicons because of a sociologist satirizing society and writing about how awful a “true” meritocracy would be.

marginalia_nu wrote at 2021-12-02 16:13:40:

> meritocratic rating

That is literally PageRank.

skinnymuch wrote at 2021-12-03 01:13:06:

Pagerank was mostly based on inbound links. A popularity contest with some nuance is just that. Nothing is meritocratic including any Google algo.

marginalia_nu wrote at 2021-12-03 13:23:37:

It's not merely a democratic vote, where the most links wins, but what the algorithm does is evaluate the links based on the popularity of the originating domain. In other words, meritocratic rating.

You can apply the algorithm to any graph, and what it does is find the most influential nodes.

hawthornio wrote at 2021-12-02 15:56:55:

I’m really interested in this as well. I use DDG and whenever I’m doing research I tend to add “.edu” because there are so many spam sites.

zn44 wrote at 2021-12-03 13:03:02:

ha nice to hear this idea. I'm planning to work on this as a side project, just started recently

laurent92 wrote at 2021-12-04 09:33:50:

If the user requests a website, you could at least crawl on request, which would be an excuse to bypass the rules in robots.txt. It would be a loophole, let’s say.

technobabbler wrote at 2021-12-02 15:48:03:

That's a great idea.

DavidCole1 wrote at 2021-12-02 16:45:48:

Interesting. I had some interests in building a search engine myself (for playing around ofcourse). I had read a blog post by Michael Nielson [1] which had sparked my interest. Do you have any written material about your architecture and stuff like that? Would love to read up.

[1]:

https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billio...

gbmatt wrote at 2021-12-02 16:55:22:

there's some stuff here :

https://github.com/gigablast/open-source-search-engine

entropie wrote at 2021-12-02 22:36:51:

Holy, thats a huge codebase. Github even shows no code/syntax hl for many cpp files because they are so big.

I fiddled around and searched for some not so well known sites in germany and the results were surprisingly good. But it looks really... aged.

kingcharles wrote at 2021-12-03 01:50:54:

Holy shit. Click on random .cpp file. Browser hangs. O_O

DavidCole1 wrote at 2021-12-02 16:58:09:

Thank you.

yumraj wrote at 2021-12-02 21:14:53:

> Cloudflare (owned in part by Google)

Please elaborate. Is there a special relationship between Cloudflare and Google?

spullara wrote at 2021-12-02 21:36:46:

Google Capital is an investor:

https://www.forbes.com/sites/katevinton/2015/09/22/google-mi...

yumraj wrote at 2021-12-02 23:00:58:

That is not the same as being owned by Google.

vitus wrote at 2021-12-02 23:19:47:

Especially since Cloudflare went public back in 2019, at which point any investors cashed out.

- Sincerely,

a Google employee who has nothing to do with the investment branch of the company

yumraj wrote at 2021-12-02 23:36:49:

> at which point any investors cashed out.

Well, actually that is also not true. At IPO preferred stocks convert to common but the investors can keep their ownership, they can but don't have to cash out or can only partially cash out.

Investors can also keep board seats in many (or most?) cases.

mbreese wrote at 2021-12-03 01:54:51:

In this example, I don’t think it matters if Google Ventures kept their shares or not. So long as they are treated as any other stock holder, I don’t see an issue. If they still maintain a board seat, then there might be an issue, but I don’t see a problem with simply holding shares.

jefftk wrote at 2021-12-03 01:03:31:

I don't know anything about this particular case, but it's very common for VCs to cash out at IPO or not long after. VCs identify good investments among early stage companies; they don't want to keep their money tied up in investments outside of their specialty.

kragen wrote at 2021-12-04 08:57:40:

Actually, being an investor in a company _is_ the same as owning that company in part.

mrlinx wrote at 2021-12-02 15:57:20:

Where did you read that google/alphabet owns part of Cloudflare?

bloudermilk wrote at 2021-12-02 16:06:10:

Assuming OP is referring to Google Venture's participation in at least one of Cloudflare's rounds.

https://www.crunchbase.com/funding_round/cloudflare-series-d...

collin128 wrote at 2021-12-02 16:15:11:

Have you ever looked at the Amazon file?

I'll see if I can track down the link but I remember somebody sharing a dump with me from Amazon that apparently was a recent scrape.

Edit:

https://registry.opendata.aws/commoncrawl/

web007 wrote at 2021-12-02 16:32:39:

That's Common Crawl, they do the spidering of some billions of webpages but that's still a tiny percentage of the web versus Google or Bing.

visarga wrote at 2021-12-02 22:45:38:

Common Crawl is being used to train the likes of GPT-3 and mine image-text pairs for CLIP. I wonder how much useful content is missing, we're going to use all the web text, images and video soon and then what do we do? We run out of natural content. No more scaling laws.

cschmidt wrote at 2021-12-02 21:11:04:

Do you have any stats on that? I've always wondered about the coverage of Common Crawl, if you include all the historical crawl files too.

collin128 wrote at 2021-12-03 04:08:49:

Oh interesting, I've played with it a little but not a dev and I've always wondered what the coverage was like.

1_player wrote at 2021-12-02 21:40:53:

If you're serious about this, add a paid tier. Until it's free, I don't trust you will not ever sell my data to make bank.

Nasrudith wrote at 2021-12-02 22:46:51:

Why do people think a paid tier will prevent their data from being sold after pocketing it? Aside from that if they go bankrupt then it isn't theirs to not give away anymore for one.

jermaustin1 wrote at 2021-12-02 21:45:51:

You are going to pay for a generalized web search when DDG/Google/Bing/etc are free?

1_player wrote at 2021-12-02 21:54:54:

Yes. I use Brave Search and I hope they add a paid tier, which I think they have confirmed they'll add at a later date.

If you don't pay, you are the product. Simple as that.

diamondage wrote at 2021-12-03 12:28:40:

Telegram, Signal, Mozilla are counterexamples... Have a large charitable donated cash balance sitting in your account, and your organisational motivation is all different

BrendanEich wrote at 2021-12-04 19:35:35:

Mozilla Foundation does not fund Firefox, that's in an arms-length wholly owned for-profit subsidiary and Google is main source of funding via the search deal.

pythux wrote at 2021-12-02 22:34:53:

https://twitter.com/brave/status/1466510541128548362?s=20

olyjohn wrote at 2021-12-03 01:14:39:

There are a lot of products you pay for, and still are the product.

duckmysick wrote at 2021-12-02 22:16:35:

> If you don't pay, you are the product.

If not enough people pay, there's no product.

1_player wrote at 2021-12-02 22:38:12:

If nobody pays, there's even less of it. Not sure what's your point.

Closi wrote at 2021-12-02 21:52:52:

I would - the problem with those services is that they prioritise the results that generate the search engine the most money rather than give me the best results, and then indexes my searches to track and advertise to me throughout the web.

A clear pricing transaction sounds much nicer to me. Should generate better results too.

_HMCB_ wrote at 2021-12-05 04:58:27:

The Internet is such a fabric of society that I think all nations should contribute to a one-truth index. Not owned by a corporate entity. Tell me I’m wrong and we can consider the alternative: startups of all types with a more even playing field.

carlesfe wrote at 2021-12-04 09:57:56:

Great job, I didn't know aboug Gigablast and it looks very interesting. Can I give you a small piece of feedback? I just tried searching for myself on Gigablast, and the first results are profile pages which haven't been updated since like 2005. Meanwhile, my own personal page appears on the very bottom of the results.

So my suggestion would be to lower the weight of the ranking of the domain, and promote sites which have a more recent update date.

Send me an email (contact in profile) if you want to follow up on this feedback!

melony wrote at 2021-12-02 16:36:41:

What we need is a net neutrality doctrine on the server side. Bandwidth is hardly scarce outside of AWS's business model. Ban the crawler user-agent dominance by the big search engine players. "Good behaviour" should be enforced via rate limiting that equally applies to all crawlers, without exemption for certain big players.

ColinHayhurst wrote at 2021-12-02 16:48:02:

https://knuckleheads.club/

Thespian2 wrote at 2021-12-03 01:28:12:

I hadn't used gigablast before, but a quick test had it find some very old, obscure stuff, as the top hit. Well done. However, the link on the front page to explain privacy.sh comes up with "Not Private" in Chrome. The root Cisco Umbrella CA cert isn't trusted. Oops.

lloydatkinson wrote at 2021-12-02 15:46:18:

With a slightly fresher coat of paint this could be very popular. For example, no grey background.

Archelaos wrote at 2021-12-02 17:53:13:

I tried out four search words with your search engine, and I am not convinced that it is mainly the index size and not the algorithm that is to blame for bad search results. There are way too much high ranking false positives. Here is what I tried:

  a) "Berlin": 

  1. The movie festival "Berlinale"
  2. The Wikipedia entry about Berlin
  3. Something about a venue "Little Berlin", but the link resolves to an online gaming site from Singapure
  4. "Visit Berlin", the official tourism site of Berlin
  5. The hash tag "#Berlin" on Twitter
  6. "1011 Now" a local news site for Lincoln, Nebraska
  7. "Freie Universität Berlin"
  8. Some random "Berlin" videos on Youtube
  9. The Berlin Declaration of the Open Access Initiative
  10. Some random "Berlin" entries on IMDb
  11. A "Berlin" Nightclub from Chicago
  12. Some random "Berlin" books on Amazon
  13. The town of Berlin, Maryland
  14. Some random "Berlin" entries on Facebook
  15. The BMW Berlin Marathon
  
  b) "philosophy"

  1. The Wikipedia entry about philosophy
  2. "Skin Care, Fragrances, and Bath & Body Gifts" from philosophy.com
  3. "Unconditional Love Shampoo, Bath & Shower Gel" from philosophy.com
  4. Definition of Philosophy at Dictionary.com
  5. The Stanford Encyclopedia of Philosophy
  6. PhilPapers, an index and bibliography of philosophy
  7. The University of Science and Philosophy, a rather insignificant institution that happens to use the domain philosophy.org
  8. "What Can I Do With This Major?" section about philosophy
  9. Pages on "philosophy" from "Psychology Today". I looked at the first and found it to be too short and eclectic to be useful.  
  10. The Department of philosophy of Tufts University
  
  c) "history"

  1. Some random pages from history.com
  2. "Watch Full Episodes of Your Favorite Shows" from history.com
  3. Some random pages from history.org
  4. "Battle of Bunker Hill begins" from history.com
  5. Some random "History" pages from bbc.co.uk
  6. Some random pages from historyplace.com
  7. The hash tag "#history" on Twitter
  8. The Missouri Historical Society (mohistory.com)
  9. Some random pages from History Channel
  10. Some random pages from the U.S. Census Bureau (www.census.gov/history/)
  
  d) "Caesar"

  1. The Wikipedia entry about Caesar
  2. Little Caesars Pizza 
  3. "CAESAR", a source for body measurement data. But the link is dead and resolves to SAE International, a professional association for engineering
  4. The Caesar Stiftung, a neuroethology institute
  5. Some random "Caesar" books on Amazon
  6. Hotels and Casinos of a Caesars group
  7. A very short bio of Julius Ceasar on livius.org 
  8. Texts on and from Caesar provided by a University of Chicago scholar
  9. (Extremely short) articles related to Caesar from britannica.com
  10. "Syria: Stories Behind Photos of Killed Detainees | Human Rights Watch". The photos were by an organization called the Caesar Files Group

So what I can see are some high ranked false positives that are somehow using the search term, but not in its basic meaning (a3, a11, b2, b3, d2, d3, d4, d6) or not even that (a6). Some results are ranking prominently although they are of minor importance for the (general) search term (a9, a13, b7, b8 -- perhaps a15 and d10). Then there are the links to the usual suspects such as Wikipedia, Twitter, Amazon, etc. (a2, a5, a8, a10, a12, a14, b7, c5, d1, d5); I understand that Wikipedia articles are featuring prominently, but for the others I would rather go directly to eg. Amazon when I am interested in finding a book (or use a search term like "Caesar amazon" or "Caesar books"). Well, and then there are the search results that are not completely off, but either contain almost no information, at least compared to the corresponding Wikipedia article and its summary (b4, b9, d7, d9), or that are too specific for the general search term (c1, c2, c3, c4, c6, c9, c10).

That leaves me with the following more or less high quality results (outside of the Wikipedia pages): a1, a4, a7, b5, b6, b10, and d8. The a15 and d10 results I could tolerate if there had been more high quality results in front of them; but as a fourth and second, respectively, good result they seem to me to be too prominent. Also in the case of "Berlin" a4 should have been more prominent than a1, and a7 is somewhat arbitrary, because Humbolt University and the Technical University of Berlin are likewise important; what is completely missing is the official Website of the city of Berlin (English version at www.berlin.de/en/).

All in all, I would say that your ranking algorithm lacks semantic context. It seems the prominence of an entry is mainly determined by either just being from the big players like Twitter, Youtube, Amazon, Facebook, etc. or by the search term appearing in the domain name or the path of the resource, regardless of the quality of the content.

kilburn wrote at 2021-12-02 20:32:27:

I don't know about others, but when I think of the "good old google days" I'm _not_ expecting the results for your example queries to be any good.

In those days querying took some effort but the effort paid off. The results for "history" just couldn't matter less in this mindset. You search for "USA history" or "house commons history" or "lake whatever history" instead. If the results come up with unexpected things mixed in, you refine the query.

It was almost like a dialog. As a user, you brought in some context. The engine showed you its results, with a healthy mix of examples of everything it thought was in scope. Then you narrowed the scope by adding keywords (or forcing keywords out). Rinse and repeat. As a user, you were in command and the results reflected that.

The idea that the engine should "understand what you mean" is what took us to the current state. Now it feels like queries don't matter anymore. Google thinks it knows the semantics better than you, and steering it off its chosen path is sometimes obnoxiously hard.

1024core wrote at 2021-12-02 20:46:06:

> The idea that the engine should "understand what you mean" is what took us to the current state. Now it feels like queries don't matter anymore. Google thinks it knows the semantics better than you, and steering it off its chosen path is sometimes obnoxiously hard.

Bingo! If you cede control to Google, it _will_ do what it's optimized to do, and not what _you_ are looking for.

melenaboija wrote at 2021-12-03 16:06:15:

What is optimized to do says nothing.

Optimizing for open text queries means dealing with a massive search space, the thing is choosing a subspace where to search and that is the part engines have to refine, how that is done is a different story. Some people may agree to use their location, search history and visits to online stores to do so but some may not.

Archelaos wrote at 2021-12-03 00:19:50:

This is why, in the good old days, my favourite search engine was Alta Vista. In its left margin it had arranged key words like a directory tree that could be used to further refine the search. So my ideal search should do something like this if I type in a generic term: provide me with relevant information about the general topic and than help me to refine my search. The way of Wikipedia to provide a principal article and a structured disambiguation page is the way I would prefer.

I admit, my evaluation of the search engine was just a simple test how much I can get out of the results for some generic key words in the first place. A more detailed evaluation should, of course, look deeper. It was more of a test balloon to see if this search engine raises any hope that it could be better than Google with regard to my own (subjective) expectations of a decent result set.

yellowstuff wrote at 2021-12-02 21:32:42:

I get what you mean, but part of the whole initial appeal of Google was that it gave much more relevant results initially than Altavista or the other options. That was why Google put in the audacious "I'm feeling lucky" button.

1_player wrote at 2021-12-02 21:39:07:

Yeah but it's from that same philosophy that Google Search is useless as it optimises for the first result.

There is no search engine that searches literally for what you asked and nothing else. Search is shit in 2021 because it tries to be too clever. I'm more clever than it, let me do the refining.

BbzzbB wrote at 2021-12-03 04:59:13:

>"I'm feeling lucky" button

My brain got so used to ignoring it I completely forgot it's a thing. I'm also unclear what it does? On an empty request, it gets me to their doodles page and with text in the box, gets me to my account history landing page.

Too wrote at 2021-12-03 06:13:59:

It automatically redirects to the first search result.

BbzzbB wrote at 2021-12-03 20:16:33:

Right, not sure why it wasn't working yesterday as opposed to now, I swear I wasn't doing it wrong (or how I could've).

DarkmSparks wrote at 2021-12-02 22:24:42:

This was the result of two things.

mapreduce.

using links rank the pages.

Using links to rank the pages is not really possible any longer because of seo spam links.

dgivney wrote at 2021-12-02 20:15:33:

I think you have some great feedback here but for me it also highlights how subjective search results can be for individuals - for example, these false positives that you mention (b2, b3) appear as the top result on Google for me for that query.

It makes me think there must be some fairly large segment of the population that want that domain returned as a result for their query, no?

Archelaos wrote at 2021-12-02 23:46:09:

I would not deny that a large part of subjectivity is involved. This is why I used several markers of subjectivity in my evaluation ("what I can see", "that leaves me", "they seem to me", "I would say", etc.). And related to that: I also agree with other responses that a search often needs to be refined. So my four examples where in no way an exhaustive evaluation, but an explorative experiment, where I just used two proper names, one for a city and one for a historical person, and two general disciplines as search words, in order to see what happens and what is noteworthy (to me). So much to the subjective side.

But what can be said about ideal search results for these terms beyond subjectivity? I do not think that we can arrive at an objective search result, but are nevertheless allowed to criticise search results with respect to their (hidden or obvious) agenda.

Let me give an example of the good old days: When I was searching for my surename on Google in the early 2000s the search results contained a lot of university papers or personal Web-sites (then called "homepages") from other people of that name. But suddenly, I can't remember when exactly this was, the search results contained almost exlusively companies that contained that surename in its company name. The shift was not gradually, as if it were representing a slow shift in the contents of the Internet itself, but abrupt. It was apparently due to an intentional modification of the ranking algorithm that put business far above anything else on the Internet.

My explanation for this is the following: The objective metrics for Google search results is the stream of revenue they generate for Google. But not only for Google. The fundamental monetary incentive to Bing (and its derivative Ecosia) is more or less the same. And how different the impact of the somewhat different business model of Duck Duck Go is, is open for debate.

If maximum revenue is the goal, the aim is to provide the best search results according to the business model (advertisment, market research and whatever else) without driving the users away. But the best search results according to the business model are not necessarily the optimal search results for the typical user. And as long as all relevant competitors are following the same economic pressure of maximizing revenue the basic situation an thus the qualitiy of the search results for the user will not improve above a certain level. If we want this situation to change, we need competitors with a different, non-commerical agenda. Either from the public sector (an analogon to the excellent information services about physical books provided by libraries) or from non-profit organizations (an analogon to Wikipedia or Open Street Map).

To answer your question about b2 and b3: I checked with other search engines; besides Google they appear for me also on Bing (as #8, same product but on a different Web-site) and Duck Duck Go (as #10); Bing also has a reference to them in the right margin as a suggestion for a refined search (this time exactly b2 and b3). Although I do not think that the results from those search engines should be considered as a general benchmark for good search results for the reasons given above, we may speculate why they appear on the first page of search results. I would guess that it is a combination of gaming the search engines by using a generic term as a product and domain name to get free advertising, and search engine algorithms making this possible by generally ranking products and companies high in their search results.

dgivney wrote at 2021-12-03 02:25:21:

Oh of course you can criticise the result, I more found it interesting that a billion dollar, optimized search experience thought your false positive was actually a top result. A huge variance in the subjectivity between your experience and their invested reasoning.

But while we're speculating on how the domain the appears at the top of the list, let me hazard a guess...

Philosophy.com was registered in 1999 and according to waybackmachine, has been selling cosmetics on the site since 2000 (20+ years). The company sold in 2010 for ~$1B to a holding company with revenues of $10B+ today (Unfortunately I couldn't find how much it contributes to that revenue). According to Wikipedia, the Philosophy brand has been endorsed by celebrities, including "long-time endorser" Oprah Winfrey, possibly the biggest endorsement you could get for their industry/demographic.

I think it is a long established business, with strong revenues and there's more people online searching for cosmetic brands than for philosophers.

In the same way (admittedly in the extreme) when I'm researching deforestation and I query to see how things are going for the 'amazon', the top result is another successful company registered pre 2000, with strong revenues that most likely attracts more visitors..

Archelaos wrote at 2021-12-03 12:19:25:

Okay, you convinced me that it should (inter-subjectivly) count not as a real false positive as I first thought.

Nevertheless, when I try to analyze what is going on here, I would rather use the word "context" instead of "subjectivity", since I think (or at least hope) that my surprise to find this brand on place #2 in my Google results for "philosophy" is shared by quite a lot of people who lack the context to give it meaning, because this brand is unknown to them. I have the excuse that it is a North American brand irrelevant in my German context. Interestingly, when I search for "philosophy" on amazon.com (without refining the search), I get almost exclusively beauty products and related items as a result, but when I search for "philosophy" on amazon.de it is only books. Google nevertheless has the beauty brand as #2 in Germany. Can we agree that Amazon is better at considering the context of the search for "philosophy" than Google?

As an aside: Your "amazon" example reminds me when I was searching for "Davidson" expecting to find information about Donald Davidson, but received a lot of results about Harley-Davidson. (But since I was aware of the importance of this brand, it was understandable to me.)

dgivney wrote at 2021-12-03 23:50:10:

We can agree on that, yes =)

I was thinking about this and when you look at the top keyword searches on Google, it's dominated by people searching brands each year, so I think Google is just naturally optimised for this. I think any Search Engine designed for the masses would probably have to behave like this too.

https://www.siegemedia.com/seo/most-popular-keywords

I agree, I think the early web was used more for general information rather than specific brand information (and was more useful for people like myself). I'm not sure what is needed to get more results such as university papers or personal web-sites as I think that people use the internet differently now and that the link structure reflects that.

It's interesting that Google isn't used to search for people anymore (I couldn't see any people in the recent top 100 keyword search data).

Archelaos wrote at 2021-12-04 05:49:37:

Some observations:

Most of the "brands" in the top 100, especially at the beginning, are rather Internet services. These search terms seem to have been entered not with the intention to "search" in the sense to find some new information, but as a substitute for a bookmark to the respectice service. Who searches for #1 "youtube" does not want information about youtube, but wants to use the youtube Web-site as a portal to find videos there.

I would also guess that most of these searches haven't been initiated through the Google Web-site, but directly from the browser's adress/search bar or a smartphone app. They exhibit a specific usage pattern, but do not show what the people, that entered them, were really searching for, if they were searching at all. What are those people who search for "youtube" doing next: either search again on youtube or log into their youtube account and browse their youtube bookmarks.

The early Internet did not have so many different service people used at a daily basis, and those that existed were more diverse (think of the many differen online email providers in those days) so that the search terms spread out more. Also browsers had no direct integration with a search engine. The incentive was higher to use bookmarks for your favourite service, since otherwise you had to use a boomark to a search engine anyway.

Perhaps it would be more approbriate to compare the use of the early Google not with the current Google, but the current Google Scholar?

Terretta wrote at 2021-12-03 00:12:29:

You inspired me to try an even less specific search: thing

Subjectively felt the gigablast results were a relative delight.

Archelaos wrote at 2021-12-03 00:40:33:

No bad idea. At the risk of being sidelined: "philosophy" was not so a bad term either. Start with an arbitrary Wikipedia link and click on the first keyword of the summary after the linguistic annotations (or other annotations in brackets) and repeat the process until you reach a loop. You will almost always end with "philosophy" -> "metaphysics" -> "philosophy" -> ... This works for "Berlin", "history" and "Caesar" as well as for "thing". For the latter very fast: "thing" -> "object" -> "philosophy".

gbmatt wrote at 2021-12-03 02:08:49:

that's tripped out. where did you hear about that?

Archelaos wrote at 2021-12-03 12:36:41:

I can't remember. Probably on Hacker News.

gbmatt wrote at 2021-12-03 01:51:14:

I'll admit I had not been working on the quality of single term queries as much as I should have lately. However, especially for such simple queries, having a database of link text (inbound hyperlinks and the associated hypertest) is very, very important. And you don't get the necessary corpus of link text if you have a small index. So in this particular case the index size is, indeed, quite likely a factor.

And thank you for the elaborate breakdown. It is quite useful and very informative, and was nice of you to present.

And I'm not saying that index size is the only obstacle here. I just feel it's the biggest single issue holding Gigablast's quality back. Certainly, there are other quality issues in the algorithm and you might have touched on some there.

Archelaos wrote at 2021-12-03 12:30:37:

Let me add just one thought on the single term searches: I do not think that a good search result for such terms as "philosophy" should focus on the primary meaning of the term alone. As someone else had pointed out, the beauty brand can be quite important for some people. If we look at a search engine as a tool that needn't present me with near perfect results from the outset, but something I can have a dialogue with to find something, than it is best that results for single terms presents me with a variety of different special meanings (and probably some useful suggestions how to refine my search). Perhaps you can scrap the Wikipedia disambiguation pages and use it somehow to refine your search results.

odonnellryan wrote at 2021-12-02 23:34:10:

Let's compare with google:

- Berlin:

Wiki

Berlin travel site (visit Berlin)

website for Berlin

Youtube videos

Britannica for Berlin

Bunch of US town sites named Berlin

- Philosophy:

Same skincare website is first result

Wiki is second

Britannica is third

Stanford

News stories

Other dictionaries and encyclopedias

- History"

history.com is first result

Then is the "my activity" google site, maybe this is

actually relevant

Youtube, lots of history channel stuff

Twitter history tag

Wikipedia for "History"

How to delete your Chrome browser history

Dictionary definitions

- Caesar:

Wiki for Julius Caesar

Britannica

BBC for JC

Google maps telling me how to get to Little Caesar's Pizza

Dictionary

Apparently some uni has a system called CAESAR

biography.com

Caesar salad recipe

history.com

images for Caesar

1024core wrote at 2021-12-02 20:53:22:

OK, I'll bite. How would _you_ rank the results for each of those queries?

cphoover wrote at 2021-12-03 04:08:01:

what heuristics or AI is being used for blocking your spider? If your spider appears human or organuc it will not be blocked right?

Is this an issue of rate limiting, or request cadence? could you add randomness to the intervals in which you request the page?

Is it more complicated? do they use other signals to ascertain if you are a script or not like checking data from the browser (similar signals to the kind of things browser fingerprinting uses... e.g. screen res, user agent, cache availability, etc...) would it be possible for the browser to spoof this information?

I imagine rate limiting the IP address is the major issue... but could you not bounce the request through a proxy network? I've tried this with the TOR network before when writing web scrapers and had mixed success... seems like Google knows when a request is being made through Tor.

Perhaps you could use the users of your search engine as a proxy network through which to bounce the request for the scrape/indexing... This way the requests would look like they were coming from any of your users instead of one spiders ip address...Im not sure how cloudflare or any other reverse proxy could determine that thise requests were organic or not...

id be ok with contributing to a distributed search service so long as my cpu was not making requests to illegal content, and there were constraints put on the resource usage of my machine.

Sorry if this came off as all over the place, I do not know too much about the offense vs defense of scraping. These are just some thoughts...

rnotaro wrote at 2021-12-06 02:52:45:

> I've tried this with the TOR network before when writing web scrapers and had mixed success... seems like Google knows when a request is being made through Tor.

That's because all the TOR entry/exit nodes and relays IPs addresses are public [1].

[1]

https://metrics.torproject.org/rs.html#toprelays

hdjjhhvvhga wrote at 2021-12-02 16:06:48:

Regarding the gatekeeper problem: it's a wild guess but maybe if there was a way to involve users by organizing distributed scraping just for the sake of building a decent index, I'm sure many of them would help.

gbmatt wrote at 2021-12-02 16:24:40:

yes, large proxy networks are potential solutions. but they cost money, and you are playing a cat and mouse game with turing tests, and some sites require a login. furthermore, people have tried to use these to spider linkedin (sometimes creating fake accounts to login) only to be sued by microsoft who swings the CFAA at them. so you start off with an intellectual desire to make a nice search engine and end up getting sidetracked into this pit of muck and having microsoft try to put you in jail. and, no, i'm not the one microsoft was suing.

InfiniteRand wrote at 2021-12-02 16:24:48:

Not sure if you're looking for feedback, but the News search could use some work, I searched for "Ethiopia" and almost all of the articles were unrelated to Ethiopia except for the existence of some link somewhere on the page.

Your general web search seems pretty good, although I've just given it a casual glance. I think your News search could be improved by just filtering the general search results for News-related content, since the "Ethiopia" content I get there is certainly Ethiopia-related.

In any case, an interesting product, I'll try to keep an eye on it.

ma2rten wrote at 2021-12-02 20:13:17:

_It's much more expensive now to build a large index (50B+ pages)_

Do you have a cost estimate? Also could you be more selective in indexing, e.g. by having users requests sites to be crawled.

ampersandy wrote at 2021-12-02 20:16:21:

Requiring users to know what sites they want in advance somewhat defeats the purpose of a search engine, no?

robbomacrae wrote at 2021-12-02 20:20:43:

Not at all. You only have to fail the first request. It is an approach I took with my own attempt at a search engine way back! In fact I know personally that there is at least one patent out there that suggests initial 1st time request users being asked to provide the appropriate response as an efficient way to teach systems for future users.

Obviously failing first requests isn't ideal but for popular requests it quickly becomes insignificant. Wikipedia might (if they don't already) want to make a similar suggestion for users to contribute when finding a low content/missing page.

lowwave wrote at 2021-12-03 01:48:35:

> Obviously failing first requests isn't ideal but for popular requests it quickly becomes insignificant.

The first request can also be called asynchronously, and display a message to the user that it is 'processing....'.

ma2rten wrote at 2021-12-03 03:58:22:

More often than not I have an idea which site a result might be on when I issue a query:

If I search for a news event it's a news site.

If I search an error message, I know the result is going to likely be stackoverflow, github issues or the forum of the library.

etc.

I don't think this strategy will get you all the way there, but it could be combined with other ways of curating sites to crawl.

convolvatron wrote at 2021-12-02 21:43:42:

since sites are so desperate to be indexed, doesn't it seem better to put the onus on them to announce themselves? it would be great if dns registries publshed public keys .. maybe they do in newer schemes?

ma2rten wrote at 2021-12-03 03:42:18:

That works once your search engine is more widely used, but not a lot of sites are going to register with a niche search engines. Many users on the other hand really want a search engine like this and would be willing to invest some time.

fragmede wrote at 2021-12-03 05:20:36:

Certificate Transparency (CT) Logs are this.

kazinator wrote at 2021-12-03 01:31:04:

Is there a way to get the results to be formatted for desktop?

It looks like the layout is hard-coded for a mobile browser, in portrait mode.

znpy wrote at 2021-12-03 01:54:28:

I just looked myself up in your search engine and I can confirm that it finds stuff old enough that google wouldn't find them (eg: and old patch I submitted on gnu savannah).

I tried looking up a game I'm interested in and the second results cluster from your search engine is a reddit thread about linux support for that game... I love this.

Great job!

kf6nux wrote at 2021-12-03 04:07:21:

What are your sources for hostnames to crawl?

I looked into it a long time ago and seem to remember there was a way to get access to registration records, but I imagine combining that with HTTP certificate transparency records would significantly increase your hostname list. Anything else?

justinzollars wrote at 2021-12-02 20:19:21:

This is great! I found something other engines do not pick up! apparently I signed an agile manifesto in 2010

https://agilemanifesto.org/display/000000190.html

tasogare wrote at 2021-12-03 02:36:52:

I just tried it and the UI is kinda old and not mobile friendly but the English results I got were satisfying. Not the case for French though. I'll try again in the future, diversity in this landscape is important.

agencies wrote at 2021-12-03 00:52:48:

Re: crawling being too hard

Have you contributed your crawl data to common crawl?

kingcharles wrote at 2021-12-03 02:12:28:

I tried searching for an answer, but how do you get a site added to your directory? Who maintains it? Directories are a real PITA to maintain with any quality.

fsflover wrote at 2021-12-02 19:54:44:

> 2) Hardware costs are too high.

Which is why the next big search engine should be distributed:

https://yacy.net

.

fcantournet wrote at 2021-12-02 23:13:29:

"distributed" doesn't make things more hardware efficient...

It literally always make them less efficient.

If e.g : mastodon had the same number of users as Twitter it would use 10x the ressources for the same traffic.

betwixthewires wrote at 2021-12-03 00:02:12:

Sure, but it does spread the costs among users and makes them more manageable. One guy shouldering the cost of a search index is less viable than letting users shoulder the costs. Some charge customers as a solution to this, and that works, but then they need a minimum revenue to continue, or have to monetize with investors which usually means changing direction and goals. The other option, letting people host portions of the index, spreads the cost out, and the product gets about as good (best case scenario) as it's utility to people.

wruza wrote at 2021-12-02 20:08:46:

No way to test it right away, demo peer 502-es.

fsflover wrote at 2021-12-03 09:37:58:

You could search for other public-facing instances, e.g.,

http://sokrates.homeunix.net:6060

.

JPKab wrote at 2021-12-02 15:34:22:

Regarding the Gatekeeper companies like Cloudflare, it sounds like anti-competitive behavior that could potentially be targeted with anti-trust legislation, correct?

technobabbler wrote at 2021-12-02 15:46:26:

Cloudflare functions kinda like a private security company. They don't go around blocking sites willy-nilly, site owners have to specifically choose to use their service (and maybe pay for it), configuring the bot blocking rules themselves.

That's not really Cloudflare's fault. Someone has to do it, whether it's them or a competitor or sys admins manually making firewall rules. Cloudflare just happens to be good enough and darned affordable, so many choose to use them.

Hosting costs for small site owners would be much more expensive without Cloudflare shielding and caching.

gbmatt wrote at 2021-12-02 15:55:20:

I've had extensively dealing with Cloudflare. They have a complex whitelisting system that is difficult to get on, and they also have an 'AI' system that determines if you should be kicked off that whitelist for whatever reason.

Furthermore, they give Google preferred treatment in their UIs and backend algos because it is the incumbent and nobody cares about other smaller search engines. So there's a lot of detail to how they work in this domain.

It's 100% Cloudflare's fault, and it's up to them to give everyone a fair shot. They just don't care. Also, you are overlooking the fact that Google is a major investor (and so is Bing and Baidu). So really this exacerbates the issue. Should Google be allowed (either directly or indirectly) to block competing crawlers from dowloading web pages?

danielmarkbruce wrote at 2021-12-02 22:50:08:

It isn't up to them to give everyone a fair shot. That isn't what their customers actually want. Cloudflare aren't in the "fair shots for all search engines" business. They are in the "stop requests you don't want hitting your servers" business.

gbmatt wrote at 2021-12-03 02:17:25:

I'd argue that a level playing field and more competition in the search space is a good thing.

technobabbler wrote at 2021-12-02 16:14:57:

These are all great points.

foxfluff wrote at 2021-12-02 16:14:56:

No, I think it is partially Cloudflare's fault because they offer this service and make it easy to deploy. This shit has exploded with Cloudflare's popularity.

Nobody _has_ to do it, but a lot of people will do it when they notice there's an easy way to do it. Cloudflare is very much an enabler of bad behavior here. Now a lot of sites just toggle that on without even thinking about collateral.

AtlasBarfed wrote at 2021-12-02 19:29:17:

"targeted with anti-trust legislation"

Um, this is America. Every market is basically a trust, cartel, or monopoly.

And I don't know if you can hear that, but there is literally laughter in the halls of power. All the show hearings by congress on social media and tech companies only has to do with two things:

1) one political party thinking the other is getting an advantage by them

2) shaking them down for more lobbying and campaign donations

No one in the halls of power give two shits about competition. Larger companies mean larger campaign donations, and more powerful people to hobnob with if/when you leave or lose your political office.

Of course I think that breaking up the cartels in every major sector would lead to massive improvements: more companies is more employment, more domestic employment, more people trying to innovate in management and product development, more product choice, lower prices, more competition, more redundancy/tolerance to supply chain disruption, less corruption in government and possibly better regulation.

Every large company brazenly does market abuse up and to the point of one and only one limiter: the "bad PR" line. So I guess we have that.

astrange wrote at 2021-12-03 05:13:04:

Companies don't make campaign donations. The people "exposing" them are showing their employees making donations, and employees don't have the same interests as their employer.

gbmatt wrote at 2021-12-02 15:45:05:

it should be. there should be some sort of 'bots rights' to level the playing field. perhaps this is something the FTC can look into. but, as it is right now big tech continues to keep their iron grip on the web and i don't see that changing any time soon. big tech has all the money and controls access to all the data and supply chains to prevent anyone else from being a competitive threat.

look at linkedin (owned by microsoft unspiderable by all but google/bing).

github (now microsoft using this to fuel its AI coding buddy, but if you try to spider this at capacity your IP is banned)

facebook (unspiderable)

.. the list goes on and on ..

and as you can see, data is required to train advanced AI systems, too. So big tech has the advantage there as well. especially when they can swoop in and corrupt once non-profit companies like openai, and make them [partially] for-profit.

and to rant on (yes, this is what i do :)) it very difficult to buy a computer now. have you tried to buy a raspberry pi or even a jetson nano lately? Who is getting preferred access to the chip factories? Does anyone know? Is big tech getting dibs on all the microchips now too?

danielmarkbruce wrote at 2021-12-02 22:46:57:

No, it is not.

Cloudflare is giving it's customers what they want. They don't want all kinds of bots claiming to be search engines crawling their sites. Cloudflare isn't hurting cloudflare competitors by doing this. Cloudflare isn't hurting their customers by doing this. To repeat - most websites don't want lots and lots of crawlers. They want the 2 or 3 which matter and no more, because at some point it's difficult to tell what the crawler is doing... (is it a search engine???). They aren't obliged to help search engines. Even if Cloudflare wasn't offering this, bigger customers would roll their own and do.. more or less the same thing.

adolph wrote at 2021-12-02 16:12:09:

At a theoretical level it looks like Cloudflare won't block search engine crawlers. The docs are very Google and Bing oriented and also oriented towards supporting their customers, not random new search engine crawler.

_Cloudflare allows search engine crawlers and bots. If you observe crawl issues or Cloudflare challenges presented to the search engine crawler or bot, contact Cloudflare support with the information you gather when troubleshooting the crawl errors via the methods outlined in this guide._

https://support.cloudflare.com/hc/en-us/articles/200169806

shashashasha___ wrote at 2021-12-02 15:42:36:

i would assume its mostly anti scraping protection which is mostly for privacy.

you don't want to allow everyone scrap your website, pull and use your info. for example from fb, ig, LinkedIn, github, ....

you can build a really big profiling db on people that way.

so websites need to know you are a legit search engine first

karmanyaahm wrote at 2021-12-02 15:52:04:

people can still be targeted if that information is public. anti scraping sounds like security by obscurity

mandeepj wrote at 2021-12-03 03:54:55:

> Hardware costs are too high

I want to say - you don’t know what are talking about. But, it’ll be rude.

Hardware is much cheaper and powerful now compared to 2005.

BbzzbB wrote at 2021-12-03 04:27:44:

You've said it and it is rude, what's the point of that first sentence except to spite him? I'm sure he's well aware of the price per capability trend since 2005, you don't code a search engine without knowing that. Could be the costs of servicing his free users and/or maintaining an ever-growing database/index that is costly - in spite of cheaper hardware on a relative basis.

gbmatt wrote at 2021-12-03 05:43:37:

the complexity of the search algorithm has also increased substantially since 2005 And, in 2005, a billion page index was pretty big. Now it's closer to 100 billion.

systemBuilder wrote at 2021-12-03 08:23:46:

There were ~60B pages on Facebook in 2015 I think your numbers are outdated. - Google search SRE

skyde wrote at 2021-12-02 22:28:13:

what kind of index is Gigablast using?

traditional inverted index like Lucene or something more esoteric?

I know Google and Bing both use weird data-structure like BitFunnel

https://www.microsoft.com/en-us/research/publication/bitfunn...

gbmatt wrote at 2021-12-03 02:15:45:

100% custom.

alok-g wrote at 2021-12-03 22:30:22:

Oh my god! This works so much better than every Internet search engine I have tried.

mirker wrote at 2021-12-02 15:47:54:

If you have customers, does that mean the incremental gain from an improved index costs too much to store? Or are you talking about computational costs?

gbmatt wrote at 2021-12-02 16:13:48:

it's both storage and computational. they go hand in hand.

smt88 wrote at 2021-12-02 16:09:28:

What if you allowed trusted contributors to "donate" their browsing to your index?

AltaVista and Yahoo did that with browser plugins in the 90s.

varispeed wrote at 2021-12-03 10:45:38:

Make sure to file complaints to any competition market authority you have in your country.

readonthegoapp wrote at 2021-12-04 00:20:13:

did you ever try to raise funds? why/not? not accusing, just curious.

did you ever think, let me just focus on Italy-relevant results? or job search only? or some slice of search.

Minor49er wrote at 2021-12-02 21:33:19:

I really love how the results organize multiple matching pages from the same domain. This is really cool.

zandorg wrote at 2021-12-03 00:04:06:

I wanted to add my site to Gigablast, but it said it would cost 25 cents. How is this a good thing?

subsubzero wrote at 2021-12-02 21:40:44:

curious how you implemented the index, memory based or disk based? Either way you are right, HW costs are extremely expensive and you would need a lot of high RAM/high core count machines to return such a large index to the endusers in a low latency fashion.

beamatronic wrote at 2021-12-03 01:15:35:

Storing information about the pages you can't index, is also useful

1vuio0pswjnm7 wrote at 2021-12-02 21:53:42:

I really like GigaBlast.

I wrote a "meta" search utility for myself that can query multiple search engines from the command line.^1 It mixes the results into a simplified SERP ("metaSERP"), optimised for a text-only browser, with indicators to show which search engine each result came from. The key feature is that it allows for what I might call "continuation searches". Each metaSERPs contains timestamps in its page source to indicate when searches were executed, as well as preformatted HTTP paths. The next search can thus pick up where the previous one left off. Thus I can, if desired, build a maximum-sized metaSERP for each query.

The reason I wrote this is because search engines (not GigaBlast) funded by ads are increasingly trying to keep users on page one, where the "top ads" are, and they want to keep the number of results small. That's one change from 2005 and earlier. With AltaVista I used to dig deep into SERPs and there was a feeling of comprehensiveness; leave no stone unturned. Google has gradually ruined the ability to perform this type of searching with their now secretive and obviously biased behind-the-scenes ranking procedures.

Why is there no way to re-order results according to objective criteria, e.g., alphabetical; the user must accept the search engines' ordering, giving them the ability to "hide" results on pages the user will never view or simply not return them. That design is more favorable to advertising and less favorable to intellectual curiosity.

Each metaSERP, OTOH, is a file and is saved in a search directory for future reference; I will often go back to previous queries. I can later add more results to a metaSERP if desired. I actually like that GigaBlast's results are different than other search engines. The variety of results I get from different sources arguably improves the quality of the metaSERP. And, of course, metSERPs can be sorted according to objective criteria.

This is, AFAIK, a different way of searching. The "meta-search engines" of yesteryear did not do "continuations", probably because it was not necessary. Nor was there en expectation that user would want to save meta-searches to local files. Users were not trying to minimise their usage of a website, they were not trying to "un-google".

Today's world of web search is different, IMO. There seems to be a belief that the operator of a search engine can guess what a user is searching for, that a user who sends a query is only searching for one specific thing, and that the website has an ad to match with that query. At least, those are the only searches that really matter for advertising purposes. Serendipitous discovery while perusing results is not contemplated in the design. By serendipitous discovery I do not mean sending a random query, e.g., adding an "I'm feeling lucky" button, which to me always seemed like a bad joke.

The only downside so far is I ocassionally have to prune "one-off" searches that I do not want to save from the search directory. I am going to add an indicator at search time that a search is to be considered "ephemeral" and not meant to be saved. Periodically these ephemeral searches can then be pruned from the search directory automatically.

1. Of course this is not limited to web search engines. I also include various individual site search engines, e.g., Github.

betwixthewires wrote at 2021-12-03 00:04:50:

Wow, do you happen to have published your utility so that other people can play with it?

1vuio0pswjnm7 wrote at 2021-12-03 01:08:03:

The problem is that (1) I am a minimalist and dislike lots of "features" and (2) I prefer extremely simple HTML that targets the links browser. Most users are probably using graphical, Javascript- and CSS-enabled browsers so while this may work great for me, it may be of little interest to others who have higher aesthetic expectations. Another problem is I prefer to write tiny shell scripts and small programs in C that can be used in such scripts. To be interesting to a wider audience, I would likely have to be re-write this in some popular language I do not care for.

If I see people on HN complain about how few results they get from search engines, then that could provide some motivation to publish. I am just not sure this is a problem for others besides me.

Many results I get from search engines are garbage. By creating a metaSERP with a much higher number of results overall, from a variety of sources, I believe I get a higher number of quality ones.

betwixthewires wrote at 2021-12-03 01:30:44:

Well something like that would be interesting to a particular demographic. I prefer minimal aesthetic cruft as well, and like terminal stuff like links.

If you ever do decide to publish, be sure to post it here!

xwdv wrote at 2021-12-02 22:35:26:

How much cash do you need?

ramboldio wrote at 2021-12-02 16:31:33:

maybe just add small webpages into your index, dont bother yo execute JS and dont download any images.

The content quality will be higher and it's a lot cheaper.

woutr_be wrote at 2021-12-02 17:52:30:

Out of curiosity, why would not executing JavaScript or not downloading images equal higher content quality?

1cvmask wrote at 2021-12-02 15:53:04:

Why do you have a user account with a login?

bullen wrote at 2021-12-02 20:48:50:

Do you have some sort of PageRank?

afrcnc wrote at 2021-12-02 15:48:36:

how recent are your results? 1-2h? 1 day?

gbmatt wrote at 2021-12-02 15:58:47:

it's continually spidering. just not at a high rate. actually, back in the day i had real time updates while google was doing the 'google dance'. that caused quite a stir in the web dev community because people could see their pages in the index being updated in real time whereas google took up to 30 days to do it.

garaetjjte wrote at 2021-12-02 16:01:44:

>Gigablast has teamed up with Imperial Family Companies

Associating with that crank (responsible for recent freenode drama) is very off-putting.

loo wrote at 2021-12-02 16:13:04:

Oh no, you see he isn't responsible, it's everyone else! /s

djbusby wrote at 2021-12-02 16:26:37:

I don't get it, what's the fuzz here?

meepmorp wrote at 2021-12-02 19:38:17:

The guy who took over Freenode styles himself as the crown prince of korea; IFC is his company.

mrkramer wrote at 2021-12-02 15:39:39:

I'm sorry to say but your project is 20 years old and it had no impact at all. You are doing something wrong. Innovation and initiative is needed ala Bitcoin and DeFi not hobby projects which are not picking up in popularity and utility.

ErrrNoMate wrote at 2021-12-02 15:44:32:

Bitcoin and DeFi don't have utility outside of gambling and pump and dumps. Not everything (tbh not really anything) needs crypto.

jquery wrote at 2021-12-02 15:58:22:

Crypto’s biggest achievement is being the financial equivalent of the gulf war oil fires. Just massive pollution. Think of all the good things that computing could be used for… we used to have all kinds of interesting collaboration projects. Instead we are setting those CPU cycles on fire for short term profit.

Sohcahtoa82 wrote at 2021-12-02 16:31:09:

Imagine if all that processing power was used for Folding@Home.

The problem is that cryptocurrencies do not inherently need tons of processing power to operate. You could theoretically run the entire Bitcoin network on a Raspberry Pi. But the PoW algorithm was designed to always produce a block every 10 minutes, no matter how much hashing power was dedicated to the network. Everyone wanted a piece of the block reward pie, so the arms race was created.

Proof-of-stake algorithms would eliminate this problem entirely, but PoS is a shitty "rich get richer" method. Granted, with how expensive mining power is, even PoW results in the rich getting richer, but at least it doesn't result in the wasting of gigawatts of electricity.

ilammy wrote at 2021-12-02 23:44:55:

> _Everyone wanted a piece of the block reward pie, so the arms race was created._

And that's intentional – getting people pursue the goal for their own egoistic reasons, because that's bound to succeed. As a result, they all increase the security and stability of the network whether they want it or not, the only way to not do this is to not participate. If the network were running on a single Raspberry, someone bringing two Raspberries could effectively outcompete the other person on block rewards.

I'm not sure how this can be avoided without fundamental changes in society with regards to competition and adversity.

mrkramer wrote at 2021-12-02 15:52:42:

Read Bitcoin whitepaper. Bitcoin was meant to decentralize trust and to eradicate fraud through transparent decentralized database called Blockchain. It is certainly more impactful than hobby search engines taking in consideration Bitcoin was also hobby project but really revolutionary one.

Go search what Larry Page said 20 years ago: If innovation is commercially successful it can have more widespread impact.

spiderice wrote at 2021-12-02 16:04:10:

So your response to the author saying "I'm trying to be commercially successful, but it's really hard for these reasons" is "You should try being commercially successful"?

Ok...

mrkramer wrote at 2021-12-02 16:20:50:

I respect his effort but the project is 20 years old and yet not commercially successful? There must be a reason behind it. The project is not good enough. Like I said only innovation can displace Google. Innovation is not something new and different innovation is something better.

_jal wrote at 2021-12-02 16:30:26:

The bitcoin brainworms do bad things to people.

I suggest you update your patter some, though. A good coin scam needs to sound a lot less dated.

S5yDyAk3XoQH5 wrote at 2021-12-03 01:01:03:

<div id=content style=padding-left:40px;>

</div id=box>

lmao, hopefully the C code isn't nearly as bad as your html

dang wrote at 2021-12-03 01:53:30:

Please make your substantive points without snark or swipes. We ban accounts that do the latter, because it's poisonous to the culture we're trying to develop here.

If you wouldn't mind reviewing

https://news.ycombinator.com/newsguidelines.html

and taking the intended spirit of the site more to heart, we'd be grateful.

S5yDyAk3XoQH5 wrote at 2021-12-03 13:13:28:

Probably been here longer than you, so really irrelevant. Anyway every single page of his site has html errors, not pointing it out is more poisonous than doing so.

dang wrote at 2021-12-03 20:40:51:

HN users need to follow the site guidelines regardless of how long they've been here or how strongly they feel about HTML errors.

lolinder wrote at 2021-12-02 22:09:49:

The consistent theme every time this comes up is that dealing with the sheer weight of the internet is almost impossible today. SEO spam is hard to fight and the index gets too heavy. However, I wonder if this is a sign that we're looking at the problem wrong.

What if instead of even _trying_ to index the entire web, we moved one step back towards the curated directories of the early web? Give users a search engine and indexer that they control and host. Allow them to "follow" domains (or any partial URLs, like subreddits) that they trust.

Make it so that you can configure how many hops it is allowed to take from those trusted sources, similar to LinkedIn's levels of connections. If I'm hosting on my laptop, I might set it at 1 step removed, but if I've got an S3 bucket for my index I might go as far as 3 or 4 steps removed.

There are further optimizations that you could do, such as having your instance _not_ index Wikipedia or Stack Overflow or whatever (instead using the built-in search and aggregating results).

I'm sure there are technical challenges I'm not thinking of, and this would absolutely be a tool that would best serve power users and programmers rather than average internet users. Such an engine wouldn't ever replace Google, but I'd think it would go a long way to making a better search engine for a single user's (or a certain subset of users') everyday web experience.

djwayne35 wrote at 2021-12-02 22:49:02:

I agree, I think we are looking at the problem wrong. And this is a very insightful comparison with the linkedin levels of connections idea. I am working on something with this.

One thing to point out is that when we think of searching through information, we are searching though an information structure aka graph of knowledge. Whatever idea or search term we are thinking of is connected to a bunch of other ideas. All those connected ideas represent the search space or the knowledge graph we are trying to parse.

So one way in the past people have tried to approach this is they try to make a predefined knowledge graph or an ontology around a domain. They try to set up the structure of how the information should be and then they fill in the data. The goal is to dynamically create an ontology., Idk if anyone has really figured this out. But, Palantir with Foundry does something related. They sorta dynamically create an ontology ontop of a company's data. This lets people find relationships between data and more easily search through their data. Check this out to learn more

https://sudonull.com/post/89367-Dynamic-ontology-As-Palantir...

lessname wrote at 2021-12-02 22:21:46:

This might work well in some situations (e.g. research, development), however it would also increase the effect of echo chambers I think.

lolinder wrote at 2021-12-02 22:33:25:

Possibly, but I'm not convinced.

Google's not exactly working against the echo chamber problem, and I think that's because to do so would be to work against its own reason for existing. There are two goals here that are fundamentally at odds with each other:

1) Finding what you're looking for.

2) Finding a new perspective on something.

A search engine's job is to address the first challenge: finding something that the user is looking for. The search engine might end up serving both needs if they're looking for a new perspective on something, but if these two goals ever come into conflict with each other the search engine does (and I would argue it _should_) choose the first goal. Failing to do so will just lead to people ignoring the results.

jonathankoren wrote at 2021-12-02 22:54:32:

Part of the thing with echo chambers is that the search terms themselves can be indicative of a particular bubble. For example, there's a difference in the people that refer to the Bureau of Alcohol, Tobacco, and Firearms by the official initialism, "ATF", and those that use "BATF". There's a strong antigun control bent to the `"BATF" guns` query, compared to the `"ATF" guns` query.

If you're indexing forums or social media, the same site is going to give back the bubbled responses, possibly without the person even being aware they're in a bubble.

https://www.google.com/search?q=%22BATF%22+guns&client=safar...

https://www.google.com/search?q=%22ATF%22+guns&client=safari...

GuB-42 wrote at 2021-12-02 23:20:33:

Kind of like when searching for "jew" on Google led to antisemitic websites, that's because jews usually prefer the term "jewish".

Interestingly, back then, Google was big on neutrality and refused to do anything, stating that it reflected the way people used the word. It was finally addressed using "Google bombing" techniques. Something that Google didn't care much about back them because of its low impact.

theduder99 wrote at 2021-12-02 22:45:22:

echo chambers are what most people want :)

dougSF70 wrote at 2021-12-03 00:43:39:

echo chambers are what most people want :)

Nasrudith wrote at 2021-12-02 22:31:54:

The retro idea of curation seems popular here but everybody forgets why it lost out in the first place. It just doesn't scale in the first place. Not to mention demands - people usually want tools which lower mental effort and are intuitive as opposed to ones which are precise but in an obtuse metric. Most would not find a hardware mouse that consisted of two keypads for X and Y coordinates and a left click and right click button very useful.

Similarly everyone maintaining your their own index is cumbersome overkill in redundancy, processing power, and human effort in return for a stunted network graph which is worse for all metrics people usually actually care about. In terms of catching on even "antipattern search engines" that attempt to create an ideological echo chamber would probably catch on better.

Short of search engine experiments/start up attempts the only other useful application I can see is "rude web-spidering" which deliberately disrespects all requests to not index pages left publicly accessible as search engines generally try not to be good tools for cracker wardriving for PR and liability reasons. It would be a good whitehat or greyhat tool as doors secured by politeness only are not secure.

hyperpallium2 wrote at 2021-12-02 23:04:02:

I like the idea of a subset of the web, and for a niche purpose. Not sure about user-hosting.

Capital is the huge barrier to entry today:

Larry Page's genius was to extend google's tech, consumer-habit and PR barriers-to-entry into a capital-based advantage: massive geo server farms, giving faster responses. Consumers have a demonstrated huge preference for faster response.

le-mark wrote at 2021-12-02 23:40:10:

I’ve often thought Alexa.com top n sites Would be a good starting point.

throwaddzuzxd wrote at 2021-12-03 10:58:59:

I wonder if we could use some kind of federation (ActivityPub?) to build an aggregate of the search indexes of a curated community. Something like a giant federated whitelist of domains to index.

mclightning wrote at 2021-12-02 22:35:34:

That's basically what I'm doing with my search "site:reddit.com" I wonder if anyone at Google is aware of this trend and taking notes.

copperx wrote at 2021-12-02 22:43:24:

I estimate that about half of my searches have either site:reddit.com or site:news.ycombinator.com at the end. In fact, I have an autocomplete snippet on my Mac so I don't have to type all that.

marksbrown wrote at 2021-12-02 22:55:54:

FYI this is exactly what the hashbangs in DDG do!

a-r-t wrote at 2021-12-02 22:41:28:

Reddit is missing a huge opportunity by not improving their crappy search functionality.

kokanee wrote at 2021-12-02 23:09:32:

What if we allow users to upvote and downvote search results. Too many downvotes and you get dropped from the index.

krapp wrote at 2021-12-02 23:11:11:

Companies will simply hire people or purchase bots to downvote their competitors and upvote themselves, and then an entire economy will develop around gaming search engine algorithms, so that eventually search results will be completely useless.

Basically, SEO. SEO is the real problem, not search engine algorithms. Those algorithms are a result of the arms race between Google and black-hat SEO BS. Remove SEO and search engines work just fine.

skyde wrote at 2021-12-02 22:20:25:

what you are suggesting would make the problem of echo chamber (bubble) worse than it is today!

Nasrudith wrote at 2021-12-02 22:39:11:

Awkwardly complaints about echo chamber as a problem tends to not refer to feedback dynamics (crudely but disambiguating refered to as circle jerk) so much "People disagree with me, the nerve of them!". It is not viable to have parties A through Z sharing the same world and all having absolute control over all others. We see this same complaint every time modernation comes up, let alone the fundamentals of democracy.

loonster wrote at 2021-12-02 23:01:02:

Bubbles are great if you are on the outside looking in at how a specific group thinks. Bubbles are horrible if you are on the inside trying to explore your thoughts.

supernovae wrote at 2021-12-02 22:46:42:

It's flawed from the get go if reddit is the basis.

loonster wrote at 2021-12-02 22:51:37:

As much as I like to hate on reddit (I'm a permanently suspended user), not every sub there is trash. There are some great subs there on very specific niche topics.

myridium wrote at 2021-12-03 02:12:45:

Badge of honour I'd say. What was your transgression?

loonster wrote at 2021-12-03 13:58:23:

Someone asked about the Hunter Biden files. I responded with g n e w s . c o m . It took a few weeks, but they finally found it and suspended me for it. Others they suspended for mentioning the news organization that mentioned gnews.

hardlianotion wrote at 2021-12-02 23:09:49:

I'm a permanently suspended member too (permanent for technical reasons), and I have never posted on there.

lawwantsin17 wrote at 2021-12-02 22:24:59:

I'm sure the algorithms are making echo chambers worse. Curating news opinion sites based on a prediction score of how often Chicken Little was right about the sky falling after the fact would surface reliable journalists and actual psychics!

BitwiseFool wrote at 2021-12-02 15:22:54:

Natural Language Processing is a pox on modern search engines. I suspect that Google et. al. wanted to transform their product into an answer engine that powers voice assistants like Siri and just assumed everyone would naturally like the new way better. I can't stand how Google is always trying to guess what I want, rather than simply returning non-personalized results solely based on exactly what I typed in the textbox.

While that may be good for most people, there is still a lot of power and utility in simple keyword-driven searches. Sadly, it seems like every major search engine _has_ to follow Google's lead.

marginalia_nu wrote at 2021-12-02 15:32:51:

I think _some_ NLP is strictly beneficial for a search engine. You may think "grep for the web" sounds like a good idea, but let me tell you, having tried this, manually going through every permutation of plural forms of words and manually iterating the order of words to find a result is a chore and a half.

Like, instead of trying

  PDP11 emulator
  PDP-11 emulator
  "PDP 11" emulator
  PDP11 emulators
  PDP-11 emulators
  "PDP 11" emulators
  PDP11 emulation
  PDP-11 emulation
  "PDP 11" emulation

Basic NLP can do that a lot faster without introducing a lot of problems.

I do think Google currently goes way overboard with the NLP. It often feels like the query parser is an adversary you need to outsmart to get to the good results, rather than something that's actually helpful. That's not a great vibe. However, I think the big problem isn't what they are doing, but how little control you have over the process.

kenny11 wrote at 2021-12-02 15:56:37:

I get that for general-purpose searches this is a good idea, but it would be nice if there was an easy way to disable this when you know you don't want it - for example, for most programming searches, if I type SomeAPINameHere the most relevant results will always be those that include my search term verbatim. I don't need Google to helpfully suggest "Did you mean Some API Name Here?", which will virtually always return lower-quality search results.

Early Google was a breath of fresh air compared to the stemming that its competitors at the time did, but nowadays even putting search terms in quotes doesn't seem to return the same quality of results for these types of queries that Google used to have.

thisisnotatest wrote at 2021-12-02 18:07:39:

I feel your pain. Two workarounds when Google gets it wrong are to put the term in quotation marks, or to enable Verbatim mode in the toolbelt. (I know various people have come up with ways to add "Google Verbatim" as a search engine option in their browser, or use a browser extension to make Verbatim enabled by default.)

Disclaimer: I work on Google search.

Y_Y wrote at 2021-12-02 21:13:02:

Both of these options are disappointing, in my experience. Verbatim mode seems weirdly broken sometimes (maybe it's overly strict), and quoting things is rarely enough to convince Google that you really want to search for exactly that thing and not some totally different thing that it considers to be a synonym.

One porridge is too hot and the other is too cold. I know Google could find a happy compromise here if it wanted to. In fact, I bet there's some internal-only hacked-together version that works this way and actually gives an acceptable user experience for the kind of people who have shown up to this thread to show their dissatisfaction.

vdqtp3 wrote at 2021-12-02 23:02:10:

Try this, go to Google and type in "eggzackly this".

Two results not containing "eggz" at all.

Two results containing "eggzackly<punctuation>this"

Two results containing "eggzackly" but missing "this".

Google Search is broken. It no longer does what it's directed, it just takes a guess. I suspect part of this is because someone decided that "no results found" was the worst possible result a search engine could give.

BbzzbB wrote at 2021-12-03 05:40:46:

Googling that with the brackets I get results containing "eggzackly this" ranked 3, 4, 6 (your comment) and 7 whereas the others contain just eggzackly (or with the 'this' preceded by punctuation as you mention).

Therefor I don't see how your last sentence is the explanation (there _are_ results), I've also happened to land on no results found sometimes with overly precise quoted queries (for coding errors mostly IIRC). But it is annoying that it doesn't seem stricktly enforced even when you want it to.

KennyBlanken wrote at 2021-12-02 15:43:18:

Google does go way overboard with "NLP". Starting at least 5 years ago there was a trend toward "similar" matching and search result quality nose-dived.

You can search for, say, "cycling (insert product category here)" and get motorcycle related results. Why? Because to google "cycling" = "biking" and "motorcycles" are "bikes", bob's your uncle, now you're getting hits for motorcycle products.

Every time I try to do a very specific search I can see from the search results how google tries to "help", especially if the topic is esoteric. The pages actually about the esoteric thing I'm searching for get drowned in a sea of SEO'd bullshit about a word/topic that is 1-2 degrees of separation from each other in a thesaurus. I'm sure someone at google is very, very proud of this because it increases their measure for search user satisfaction X percent.

It does this thesaurus crap even with words in quotes, which is especially infuriating.

marginalia_nu wrote at 2021-12-02 15:50:22:

Yeah. It's one of those things where it's invisible where it works and enraging when it doesn't. That's generally not a failure mode that's desirable. It at least should require extremely low failure rates to motivate.

JohnHaugeland wrote at 2021-12-02 15:50:31:

"Basic NLP can do that a lot faster without introducing a lot of problems."

This is called "stemming" and is not sensibly approached with machine learning.

marginalia_nu wrote at 2021-12-02 15:52:43:

Of course, but stemming is a fairly basic technique in NLP, as is POS-tagging. NLP is not machine learning.

brokensegue wrote at 2021-12-02 16:16:22:

Modern NLP basically is machine learning

marginalia_nu wrote at 2021-12-02 16:19:24:

You can still do NLP without machine learning though, and a lot of the sorts of computational linguistics a search engine needs for keyword extraction and query parsing doesn't require particularly fancy algorithms. What it needs is fast algorithms, and that's not something you're gonna get with ML.

JohnHaugeland wrote at 2021-12-02 16:16:59:

Stemming is not meaningfully a natural language processing technique, any more than arithmetic is a technique of linear equations.

necovek wrote at 2021-12-02 16:51:58:

At the very least,

https://en.wikipedia.org/wiki/Natural_language_processing

seems to disagree.

(So do I: NLP does not have to be machine learning/AI based)

marginalia_nu wrote at 2021-12-02 16:24:43:

Is it not the processing of natural language?

JohnHaugeland wrote at 2021-12-02 16:37:07:

Would you call addition a system of linear equations?

No, you don't use the college senior label for the highschool freshman topic. You use the smallest label that fits.

It's string processing.

NLP is actually understanding the language. Stemming is simple string matching.

Playing the technicality game to stretch fields to encompass everything you think even marginally related isn't being thorough or inclusive; it's being bloated, and losing track of the meaning of the term.

Splitting on spaces also isn't NLP.

marginalia_nu wrote at 2021-12-02 16:43:36:

Stemming is a task specific to a natural language. You can't run an English stemmer on French and get good results, for example.

All NLP is, strictly speaking, more or less elaborate string matching.

> Splitting on spaces also isn't NLP.

String splitting can be, but it's a bit borderline. I'll argue you're in NLP territory if it doesn't split "That FBI guy i.e. J. Edgar Hoover." into four "sentences".

necovek wrote at 2021-12-02 17:01:27:

> NLP is actually understanding the language.

That's actually not an accepted terminology. There's, indeed, this:

              https://en.wikipedia.org/wiki/Natural-language_understanding

Not sure why are you so adamant that yours is the "true meaning", when NLP existed long before machine learning and AI were used for it. And even if not, every term can be defined differently, so it should be normal to have different institutions/people define NLP differently.

JPKab wrote at 2021-12-02 15:35:50:

Semantic search requires NLP. So does the Q&A format the OP is complaining about. People conflate all things NLP to the latter, and forget about the former.

BitwiseFool wrote at 2021-12-02 15:37:41:

Maybe I'm not using the right qualifiers around the term NLP. The kind of NLP I was referring to is something like "Hey google, what is natural language processing?" and orienting the search around people asking questions in standard(ish) English like they would to another person.

gk1 wrote at 2021-12-02 16:35:27:

That's known as Open Domain Question Answering[1] and is only a subset of NLP.

[1]

https://www.pinecone.io/learn/question-answering/

marginalia_nu wrote at 2021-12-02 15:48:09:

NLP is very heavily integrated into search, so I don't think it's really possible to decouple them. But I agree the whole BonziBuddy thing they've got going now is annoying and it's especially unfortunate how it's replaced the search functionality. I'd have a lot more patience with it if I could choose this functionality when I wanted to ask a question.

wpietri wrote at 2021-12-02 16:26:28:

I doubt they assumed it was better. I expect they did a ton of user testing and found that it was better for most people. And I'm sure it is. HN users are very much a niche audience these days.

gk1 wrote at 2021-12-02 16:37:07:

Right. Bing switched to this method as well, as did Facebook, Twitter, Amazon, and pretty much every other company that has the ML resources to do this. They obviously had a good reason to do so, beyond assumptions.

maxlamb wrote at 2021-12-02 15:43:11:

What’s a pox?

rocqua wrote at 2021-12-02 16:07:35:

Saying X is a pox on Y means saying X is bad for Y.

It originates from the disease 'the pox'.

mattanimation wrote at 2021-12-02 16:27:36:

a disease or plague

vincent_s wrote at 2021-12-02 15:26:16:

Some people try:

https://www.mojeek.com/

https://fireball.com/

https://search.brave.com/

ColinHayhurst wrote at 2021-12-02 15:53:08:

Mojeek founder story here:

https://blog.mojeek.com/2021/03/to-track-or-not-to-track.htm...

No-tracking and independent from the start. Now at 4.6 billion pages with own infrastructure and IP. Went to market in 2020 with contextual ads and API. Self-disclosure: CEO

snovv_crash wrote at 2021-12-02 16:22:00:

HN is wild: 30m after something is mentiond, the CEO chimes in.

kingcharles wrote at 2021-12-03 02:36:55:

Now, if we could just get that on the Facebook thread... ;)

prox wrote at 2021-12-02 16:32:15:

Never heard of Mojeek. I will try it for a month and see how it works. Currently using DDG 99% of the time.

no_time wrote at 2021-12-03 08:47:58:

For a fully independent indexer, probably the best results I have seen so far. For me the minimum baseline is searching for "444" and if it doesn't return 444.hu as the first result, its a no-go.

bullen wrote at 2021-12-02 20:51:41:

Do you use some sort of PageRank?

ColinHayhurst wrote at 2021-12-03 08:54:07:

Yes, something conceptually similar to PageRank but our own thing which we call Gravity.

gompertz wrote at 2021-12-03 02:17:40:

Mojeek is returning great results I'm not seeing from any other search engine!

7373737373 wrote at 2021-12-02 15:52:11:

Time for an

https://github.com/sindresorhus/awesome

search engines?

abhaynayar wrote at 2021-12-02 16:27:28:

I'm probably the only person who doesn't think Google search has deteriorated. I play security CTFs, so a lot of times I have to search for peculiar technical details on various software. Also, like any other human being, I also make generic queries. In both cases, I feel like I almost always get to the desired webpage within the top few results.

Ellipse0934 wrote at 2021-12-02 17:12:36:

It honestly depends on what you are searching for.

Case 1: You just want the name of the website, or an article, example "Facebook" -> fb.com, "Gordan Ramsay" -> Wiki/official website/Celeb gossip website you are good. Not much competition here.

Case 2: You are looking for something technical like "GNU rnano CVE-abcde"/"OpenBSD ARM64 Qualcomm Wifi driver not working", you are again in the fine territory, not much if any money to be made here so very less competition. There will be the official forums, websites, maybe some conference websites in this category.

Case 3: "Chicken potpie recipe", "How to be more organised": This is the category where people are trying to game the SEO algos. How the hell do recipe websites with 27 popups, 12000 word essay on the secret family history ends up on top ? There are a huge number of passionately made simple recipe websites but they have to be "found" by us. For the second query I mention about being more organized I think most people are looking for some sort of a review article which looks at some various schools of thoughts regarding discipline, cleanliness pointing to further resources and exploring the why and what to do for this. Here the search engine needs to determine the context of the query which is fairly abstract and then the internal heuristics it uses are supposed to drive it to a meaningful list of websites. Maybe the average joe would like to click cosmopolitan's article but I would never do that. Based on my previous click history maybe google should determine what I kind of links am I looking for. But when they figure that out they'd much faster use this behavioral insight for advertisers. A great search engine is basically a primitive personal librarian, I'd pay a yearly subscription for one.

The internet is vast and it has stuff that I don't know about. How my 7 word abstract query is gonna get me there is the question mark. Also, for a lot of queries the top results can be plagued by spammy/fraud results which are on top because they managed to trick the SEO algos. These bad actors were not as prevalent for 2005 google.

jeffbee wrote at 2021-12-02 16:30:45:

Well no, it’s you and me and the whole google search quality evaluation team and everyone who works on google search and like 99% of the general public as well. The meme of falling search quality infects only HN. Mostly what people are complaining about is that the quality of the web itself is in free-fall.

tsian2 wrote at 2021-12-03 00:56:21:

It's always worked fine for me when it comes to finding simple things with a

search. I think it's deteriorated in some ways though. I don't find the advanced search operators reliable anymore (eg. give me all the news about a topic published between certain dates) and I think it caps collections of things very early now, rather than returning the "billion" of results it says it has (eg. give me more than the 1000 most popular cat memes that I've seen before, or all the books about beaches).

pydry wrote at 2021-12-02 15:32:09:

Early 2000s google index ran in a garage. The current google index has dedicated power stations.

It's a bit like the car industry - you could run a startup from your garage in the early days but you need titanic amounts of capital to compete now thanks to vertical integration.

Major governments and billionaires can compete but everybody else is locked out of the market (most "startups" use bings index).

mleonhard wrote at 2021-12-03 08:28:28:

Google's datacenters are huge because they save user behavior data, not because their web search index is particularly big. Also, Google Search wastes a lot of resources on the "search as you type" feature.

Running a search engine in your garage is feasible today because hardware and connectivity have improved much faster than the size of the WWW.

pydry wrote at 2021-12-03 10:55:40:

It's the frequency of updates that chews power.

Also, that user data is used to improve search results and mitigate webspam that didnt exist in 2005.

R0b0t1 wrote at 2021-12-02 21:31:55:

I was thinking about exactly that. If they used simpler index would they be getting better results? There's not a lot of selective pressure so they just keep adding to the index algorithm.

abdel_nasser wrote at 2021-12-03 00:48:01:

whatsapp was run out of a single cabinet.

rovingEngine wrote at 2021-12-02 15:38:27:

I think Google was “better” from a users point of view in 2005 because it wasn’t that good at selling ads yet. I still remember the epiphany of the first time I used Google in 1999. It was amazing.

I’ve thought the same about pre-ad Twitter and Facebook.

Early on, startups with free services look a lot like non-profits and just maximize user benefit to grow. The problem is they’re not non-profits, and have to make money at some point. That has tended to mean ads.

I’d easily pay, say, $9/mo to have access to an ad-free search engine that made me feel the way 1999 Google did.

mmmmmbop wrote at 2021-12-02 16:05:52:

$9/mo is not going to cut it. Google's domestic annual revenue per user in 2019 was $256. [0] That's $21.33 per month. Not all of Google's revenue is from Ads, of course, but the vast majority is. (Let's ignore for now the valid counterpoint that Ads are increasingly served on other Google properties than Search.)

But even charging users $21.33/mo for an ad-free search experience most likely wouldn't be enough. By providing such an option, you'd greatly reduce the value of the remaining Ads pool.

The optimistic perspective on this is that if you are one of the users with disposable income, you're essentially subsidizing a great search engine and a suite of other tools for the less well-off ones.

[0]

https://miro.medium.com/max/6545/0*YTqXb-F5UiVhtlIS

rovingEngine wrote at 2021-12-02 16:51:56:

Let’s say ads will always make more money (I have no reason to believe they won’t), and that’s required to be the dominant search engine because the web is big and expensive to organize.

I’d bet there’s some way to characterize what I and others liked about the earlier web and create a search engine that just worries about that stuff. I’d pay $9/mo for whatever 1/3 of Google’s spend per user would get me. That’s not to say this thing would “beat” Google, but it could profitably exist.

themacguffinman wrote at 2021-12-03 00:02:07:

I doubt it, because 1/3 of Google's spend per user isn't enough when you can't attract many paying users in the first place, because you would charge much more than $9/mo, because almost no one wants to pay for a search engine so your revenue will have to make up for those people too, and then even fewer people are willing to pay more than $9/mo for 1/3 of the quality.

And then I'd guess the 20 remaining users will still complain because 1999 Google is a nostalgic memory impossible to recreate without a 1999 internet for a 1999 self to live in and has little to do with raw search quality.

wodenokoto wrote at 2021-12-02 15:24:20:

The web has changed drastically. I’d imagine 2005-google engine today would be nothing but abandoned Wordpress blogs with comment spam.

warning26 wrote at 2021-12-02 15:58:57:

I suspect this is exactly it—a lot of what made 2005-era-Google good wasn't necessarily Google's own doing. It was that SEO people hadn't yet fully figured out how to game the system yet.

If you took an exact copy of Google circa-2005 and had it crawl today's web, you'd probably get mostly "SEO optimized" irrelevant blogspam.

ghaff wrote at 2021-12-02 15:31:34:

And even more copy-pasted spam than already exists.

The early Google (and other even earlier search engines) were invented for an Internet world which, if not pristine and pure, was at least mostly fairly legit content. Today's Internet is probably 90% deliberate spammers and scammers.

nfriedly wrote at 2021-12-02 15:50:11:

I think DuckDuckGo is closer to what you want. Same results for everyone, better privacy, and they're proactive about improving their results.

https://duckduckgo.com/

Part of the problem is that there's a lot more low-quality content to wade through now than there was in 2005. I think the Google of 2005 would have trouble delivering quality results today also.

DavideNL wrote at 2021-12-02 16:02:35:

> a lot more low-quality content

I wish there was an easy way to filter ALL search results, by permanently excluding specific websites, and/or keywords.

Surely there has to be some browser extension that does this...

MayeulC wrote at 2021-12-02 16:34:17:

Excluding, or penalizing for, advertising and trackers could do wonders against perverse incentives and SEO, IMO. It would also be a better experience for the reader.

BoxOfRain wrote at 2021-12-02 16:18:36:

https://news.ycombinator.com/item?id=29404860

Not got round to trying it yet though.

DavideNL wrote at 2021-12-02 16:51:12:

Great and it even supports iOS…!

Kiro wrote at 2021-12-02 15:54:23:

Try searching for the same thing from your computer and your phone and you will get different results. Also, their results come from Bing so any improvement happens at Microsoft.

JohnFen wrote at 2021-12-02 16:02:02:

They do use Bing, but not solely Bing. DDG isn't just a frontend to a different search engine.

bla3 wrote at 2021-12-02 16:39:51:

It's a bing frontend with a few special cases handled differently. For most queries, you get bing results. Easy to check by comparing results.

kevin_thibedeau wrote at 2021-12-03 00:49:23:

I see Russian sites from Yandex all the time on DDG.

Sunspark wrote at 2021-12-02 15:55:09:

This. DDG is my primary search engine now and has been for awhile.

I don't use Google anymore to search unless I really need to. The algos they use today are not the same classic ones that actually returned results.

JohnFen wrote at 2021-12-02 16:03:24:

Same. For the sorts of searching I do, anyway, the results I get from DDG tend to be better than what I get from Google. Google tries to infer what I want rather than take me at my word, and is very bad at it.

kspacewalk2 wrote at 2021-12-02 16:02:03:

And if you really need to, DDG !bangs[0] make a search as simple as "!g mother google help me". The keyword thing is also available in Firefox as a browser feature, and elsewhere I'm sure, but nevertheless, it makes switching to DDG easier.

(Plus I can directly go to the wiki page by using "!w", "!gm" for google maps, etc.)

[0]

https://duckduckgo.com/bang

eevilspock wrote at 2021-12-02 16:57:15:

The only bang I use is !gvb since DDG doesn't support verbatim searches.

jay3ss wrote at 2021-12-02 21:15:57:

Is this the same as enclosing the terms in quotes and using the !g bang?

eevilspock wrote at 2021-12-03 00:31:35:

it's a google "verbatim" search. I don't know if enclosing each term separately in quotes does the same thing, but this is easier anyway.

jay3ss wrote at 2021-12-03 00:59:04:

I didn't know about the verbatim search. I'm going to give it a try, thanks

jpswade wrote at 2021-12-02 21:06:09:

DuckDuckGo isn’t really a search engine, it’s a website that uses bing’s api.

cyberbanjo wrote at 2021-12-02 21:46:56:

Not just Bing, but nearly every search engine you've ever used

https://duckduckgo.com/bang?q=

ricardo81 wrote at 2021-12-02 16:02:13:

Does DDG have any of its own organic results yet, or is it still entirely Bing/Yandex?

MuffinFlavored wrote at 2021-12-02 16:01:03:

> I think the Google of 2005 would have trouble delivering quality results today also.

What would you attribute to their modern 2021 success then? Just throwing a ton of money at amazing engineers to hone in their complex algorithm to tweak it to still return what us humans quantify as "good" results? Especially if they are waning through a sea of low-quality content as you say.

gbmatt wrote at 2021-12-02 16:02:13:

both ddg and brave are bing (microsoft) in disguise.

pythux wrote at 2021-12-02 16:35:28:

This is not correct. Brave Search owns its own (growing) index and relies on third-parties like Bing for some fraction of the requests. Which is not the same thing as relying fully on Bing or third-parties for results like so many meta-search engines. More detailed answer here:

https://search.brave.com/help/independence

Edit: Forgot to say that I work on Brave Search.

gbmatt wrote at 2021-12-02 16:47:14:

brave 'falls back' to bing. which in my experience is most of the time. in fact, out of all the queries i did a while back, they all seemed to come directly from bing. is there a way to disable the reliance on bing and get pure 'brave only' results? and can you be more specific as to what this fraction is? do you blend at all?

pythux wrote at 2021-12-02 17:02:50:

You can check exactly which fraction of the results were fetched from Brave's index vs. third-parties using the "independence score" found in the setting drawers (opening can be done with the cog icon at the top right of any page on search.brave.com). There is there a global and personalized score of independence (respectively aggregated on all user's and for your queries only).

Explanation is also found here with screenshots:

https://search.brave.com/help/independence

gbmatt wrote at 2021-12-02 17:14:38:

So Brave is still dependent on Google and Bing it seems.

Also is this Brave's CEO:

https://www.bbc.com/news/technology-26868536

https://www.nytimes.com/2020/12/22/business/brave-brendan-ei...

?

"Brendan Eich's opposition to same-sex marriage cost him his job at Mozilla."

"Covid comments get a tech C.E.O. in hot water, again."

BrendanEich wrote at 2021-12-04 19:34:25:

What independence percentage do you see when you click on the gear in upper right of the Brave Search results page?

I get 84% personal (browser-based), 87% global (which means we hit Bing only 13% of the time from our server side).

hunterb123 wrote at 2021-12-02 15:59:28:

DDG never worked great for me, and it doesn't have it's own index.

Brave search has been my daily driver and it works wonderful.

adolph wrote at 2021-12-02 16:15:05:

I'll give it a try, somehow I missed the announcement even though I'm a Brave user...

https://brave.com/search/

moralestapia wrote at 2021-12-02 16:47:56:

Please do it! Google is now complete trash.

Also gmail, used to have the best spam filters out there, now it's utter crap. Emails from my google analytics account, for whatever reason and disregarding how many times I have clicked on "Not Spam", go to spam, and it's their own service; while messages who are textbook spam ("Hi, I just got some inheritance ...") go to my inbox.

AI (in its current state) is crap, when is the industry going to accept these are the emperor's new clothes.

mrkramer wrote at 2021-12-02 15:35:11:

They do[0] but nobody cares anymore. Google controls web distribution through Google Chrome. I think we are at the point of no return. There won't be any competition anytime soon no matter what US government does. Only innovation can displace Google.

[0]

https://search.marginalia.nu/

BbzzbB wrote at 2021-12-02 15:38:00:

Marginalia is great to find blog posts, personal sites and other long form content, but it's not a replacement for Google nor intends to.

marginalia_nu wrote at 2021-12-02 15:40:47:

It does operate on a scale and principle fairly similar to early 2000s google, so the comparison isn't that far off, but yeah, it's quite some way before it's viable for general search. Dunno if I'll ever get there, but it does consistently seem to get better so who knows.

BbzzbB wrote at 2021-12-02 16:12:25:

Isn't it's familiarity to early Google a side-effect of the early Internet being text-heavy sites in the first place rather than a similarity in the search engine? Unless I am misunderstanding your site's intent, even if you reach the dream engine you are trying to achieve, I won't be using it to search answers for coding questions on SO, how-tos for car repair, sites to stream movies, governmental page for X need, transcript for earnings calls, etc.

In my experience it is better than Google at what it does if I'm looking for long-form texts (exception being scientific/peer-reviewed articles, Google tends to shoot me those for the type of queries I make on Marginalia), but is very much complementary rather than a replacement.

marginalia_nu wrote at 2021-12-02 16:23:15:

I guess it depends on what you are looking for on the Internet I guess.

Right now the biggest problem with Marginalia is that it has a fairly uneven quality level. For some queries it's absolutely incredible. For others, it doesn't really provide much useful results at all. I do think it's possible to even that out a considerable bit, to make it more viable for general queries. It's never going to be able to answer every query, but it probably could answer a lot more than it does.

BbzzbB wrote at 2021-12-03 06:01:21:

Basically I understand Marginalia's proposition as a search engine focused on retrieving text-heavy/long form content. Unless I misunderstand it's intent, that can't replace a generalist engine (nor does it have to) as not every search request will lends itself to long form texts. I guess that's the only point I was going for (I do feel the old-Google sentiment has got more to do with the state of the web than the engine, but am out of my league for a proper opinion), and it certainly wasn't a jab at it - I'm thankful for your neat website and will be looking forward to see it get even better over time! Maybe it is somewhat uneven, but it is nonetheless great at finding thoughtful pieces written on subjects XYZ and surfacing more obscure/personal websites.

mrkramer wrote at 2021-12-02 15:41:14:

But it is a good start and foundation for something bigger and better.

egberts1 wrote at 2021-12-02 20:20:42:

Funny. Marginalia has an option for No JavaScript but I cannot even do an HTTP “POST” with JavaScript disable at my web browser.

Disclaimer: I study for malicious JS stuff.

nickpp wrote at 2021-12-02 15:23:11:

Because we're not having a 2005-Web anymore. More to the point, SEO & Google have evolved together. To have barely relevant results today you need to be _good_. That takes stellar talent which costs huge amounts of money.

Thus, the Google of today, which is optimized to extract that money from us.

Const-me wrote at 2021-12-02 19:40:02:

> To have barely relevant results today you need to be good

An easy way to become way better than google — detect google ads on pages, and penalize these pages in the index. For obvious reason, google search is incapable of doing so.

thisisnotatest wrote at 2021-12-02 16:02:15:

Yes, I think you'd call it a Red Queen Problem:

“Here, you see, it takes all the running you can do to keep in the same place.”

-Lewis Carroll's Through the Looking Glass

ginko wrote at 2021-12-02 15:30:15:

But shouldn't all the blogspam be so hyperoptimized for Google's algorithm that is should be straightforward to detect and ignore/downrank it?

nickpp wrote at 2021-12-02 15:38:46:

I _read_ auto-generated pages almost to the end before realizing it was SEO spam. (I am not a native English speaker though)

With content copying, shuffling and AI generating, I am afraid we are on the cusp of auto content generators passing some restricted Turing test where readers really think it's an actual human that wrote it.

As for me, I leant that for certain "hot topics", simply doing a generic search on Google is not a good idea anymore.

marginalia_nu wrote at 2021-12-02 16:01:08:

Yeah, I do this with my search engine. Works pretty well. A complementary approach that works well is to look at where blogs written by humans link. Very few spam blogs get links from humans.

beingflo wrote at 2021-12-02 15:37:41:

No because google's algorithm is not well known publicly. Also, if it was straightforward to detect then google could downrank it as well.

kbelder wrote at 2021-12-02 17:22:21:

I wonder if you could evaluate a page using your own algorithm, which is probably not gamed as much as Google's (because who cares about your search engine?)

Then, check Google's ranking of the page. If it is much higher than it seems the page should be, assume the page is being SEO hyper-optimized and penalize the page proportionately.

Basically, using the variance between Google's model and your model as an indicator of an SEO spam page.

ginko wrote at 2021-12-02 15:39:43:

The point is that SEO would just immediately adapt to Google's changes. If a smaller search engine filtered these out then it would likely stay under the radar.

thefreeman wrote at 2021-12-02 16:02:54:

you know that legitimate sites perform SEO as well, right?

marginalia_nu wrote at 2021-12-02 16:11:19:

SEO often seems to be a compensation for the fact that a site doesn't have particularly worthwhile content. So punishing SEO surprisingly does promote higher quality of search results.

all2 wrote at 2021-12-02 16:48:32:

Yes and no. A lot of those sites are small local businesses trying to get found. A front page listing can be the difference between surviving and going under. Much of the time the blog spam is what floats hours, contact info, and services provided to the first page.

marginalia_nu wrote at 2021-12-02 17:05:06:

Be that as it may, search ranking is a zero sum game. The unfair advantage SEO gives this particular struggling business means another goes under. I'd rather punish the guy trying to game the system than the one with enough principles not to.

pessimizer wrote at 2021-12-02 17:29:16:

The difference is far more likely to be in capability or expertise than principles.

marginalia_nu wrote at 2021-12-02 17:42:09:

Either way, capability for fuckery is not something I'd want to encourage.

Nasrudith wrote at 2021-12-02 23:08:02:

Would you rather have a surgeon who knows how to kill you with a narrow slice to the right artery or one who doesn't even know where your kidneys appendix are located? Selecting for incompetence doesn't work well.

marginalia_nu wrote at 2021-12-03 13:25:56:

Eschewing SEO isn't incompetence, it's moral principle and good character. I'd much rather have a surgeon who doesn't moonlight harvesting organs OD:ing junkies.

elcomet wrote at 2021-12-02 15:33:59:

It's not that easy, they are optimized for many metrics..

pkamb wrote at 2021-12-02 16:42:58:

I would use a search engine that only indexed Reddit, Stack Exchange, Wikipedia, and a small number of other sites.

And that specifically blocked Pinterest, Quora, most non-personal “blogs”, etc.

People suggest DDG ! operators, but I don’t want to use a site’s (bad, single-site) search box. I want a multi-site SERP that only displays results from known good sites, which are customizable.

guynamedloren wrote at 2021-12-02 23:19:07:

I've been thinking about this as well. As Google search results get increasingly worse, I find myself subconsciously filtering out all the garbage and gravitating towards a small number of known sites; and, as many other HNers do, I frequently mitigate this filtering step altogether by adding add "reddit" to any search in which I'm seeking out real human sentiment.

I've done similar optimizations elsewhere to counter Google's trash results, e.g., I've been beefing up my personal recipe database, with the goal being that I can avoid a google search altogether whenever possible, only hitting google as a last resort.

More and more I wonder, with the modern internet, is it even a _feature_ that the whole web is indexed? Might be a bug.

all2 wrote at 2021-12-02 16:52:43:

If I could add sites I liked to the index that'd be great. Find a blogger/hacker I like? Add to the index. Can I share my index with others? Can I include their indices in my searches?

Search engine as a social media platform? If I follow you, now I can search in your indices?

betwixthewires wrote at 2021-12-03 00:28:57:

Yacy might serve your needs well. It is a sort of distributed engine where users run their own index and "neighbors" share their indices with one another.

monkeybutton wrote at 2021-12-02 22:27:52:

Too bad they whitelist which bots can access their sitemaps!

pkamb wrote at 2021-12-02 17:21:14:

Even rules such as “if there is a Wikipedia result in the top 10, display it first”.

indymike wrote at 2021-12-02 15:44:13:

Brave's new search engine seems to work pretty well. Have been using it as my primary for about 10 days, and so far, I've only had to revert to Google once, and when I did the results were chock full of spam.

concinds wrote at 2021-12-02 16:21:21:

The nice thing about Brave Search is that they're trying to create an index completely independent from Bing/Google, and they seem to be trying to innovate on ways to get there as well with their Web Discovery Project[0], unlike DuckDuckGo. They've announced Brave Search will get ads soon, with a premium version without ads, which I think is acceptable given the costs of running an independent index sustainably.

[0]:

https://brave.com/privacy/browser/#web-discovery-project

travisgriggs wrote at 2021-12-02 15:52:43:

Can echo this. About 30 days going all devices. I’d say about once a day I do a !g, and rarely do I actually find something there, it usually just ends up being a confirmation search.

freediver wrote at 2021-12-02 16:52:13:

We are building one [1] as well as a few other people that I am aware of with different approaches and business models.

We also need to be aware that when we remember past times it usually carries a romantic, nostalgic note. Web is very different than it was 15 years ago and the problem of search has evolved.

What you are looking for is basically 'grep for the web' but it is just one facet of search that we use today. 15 years ago you would not get an instant answer to a question like you do today and many users would not be able to live without that today. There are also maps and location based answers, all sorts of widgets like translation etc. Also world became more polarized so an objective best search result became more difficult to produce, specially for events covered in news, which means bias inevitably starts to creep in.

This is not to say that Google is good or bad today, it is what it is and they are doing best they can. Startups like ours see an opportunity on the market, in large part to help savvy users find what they want.

[1]

https://kagi.com

ColinHayhurst wrote at 2021-12-02 16:59:36:

You might call this a search engine based on the principle of Information Neutrality.

“Information Neutrality is the principle to treat all information provided (by a service) equally. The information provided, after being processed by an information-neutral service, is the same for every user requesting it, independent of the user’s attributes, including, e.g., origin, history or personal preferences and independent of the financial or influential interest of the service provider, as well as independent of the timeliness of information."

I wrote about this in relation to search [0]. We need to be allowed more freedom to choose search engines and services. One (default or selected) choice for search is unhealthy. We shouldn't have to choose between Google or Bing; DuckDuckGo or Startpage; Brave or Ecosia; Mojeek or Gigablast ..... Personally I use all 8 of these and more, as also explained [0].

[0]

https://blog.mojeek.com/2021/09/multiple-choice-in-search.ht...

betwixthewires wrote at 2021-12-03 00:51:02:

I'm with you (you run a great engine BTW) and I've considered the UX of some attempts to help users do this.

I like Firefox's UI when searching, where you can select the search engine of choice while typing a query.

I like customizable metasearch engines like searx, I think it is a phenomenal idea. I wish more niche engines would implement OpenSearch so that they could easily be added.

I have considered just making a simple web page with search boxes for multiple engines for personal use as a default home page, but there's friction and again, lots of engines I'd like to use don't implement OpenSearch.

I wonder if there's some novel UX approaches to this out there. Meta search engines seem to be the best way so far to do it but there's the problem of customizing ranking, relevance of results and the like that just compounds the problems users experience.

ravenstine wrote at 2021-12-02 21:04:59:

I think what [some] people actually want isn't the Google of 2005 but to have a search engine where they don't feel like they're being manipulated.

ab_testing wrote at 2021-12-02 22:21:14:

I think a lot of people are ignoring the issue that the web has changed considerably since 2005. It is approximately 10 times larger in terms of number of websites and web pages. And a lot of it is SEO junk that is just designed for search engines to be easier to parse and show ads in your face.

Also user preferences have changed in the last decade or so. I know millenaials and users in their late 30's or early 40's still yearn for the old web where they would type a search term and correct results would astonish them. However, younger users tend to gravitate to videos and that is why a large portion of the google results are now video results.

jakub_g wrote at 2021-12-02 15:42:13:

Cliqz wanted to build new search engine but failed. It's just too difficult to operate at that scale and break the existing monopoly of big G.

https://www.burda.com/en/news/cliqz-closes-areas-browser-and...

https://news.ycombinator.com/item?id=23031520

https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...

hunterb123 wrote at 2021-12-02 16:05:35:

And then Brave bought them and it succeeded.

Cliqz is now Brave Search, I use it for all my devices, it's great.

Works better than DDG and sometimes better than Google.

I only hash bang every 100 searches or so, most of the time Google doesn't have it either. It's just to make sure.

http://search.brave.com/

arthur_sav wrote at 2021-12-02 16:06:30:

What if we didn't try to replicate google. Smaller and niche search engines would probably work better in this new world of vast information.

dave333 wrote at 2021-12-04 19:54:08:

Does Gigablast ignore or downrate stuff on .info domains?

Seems to like

https://www.fiendishsudoku.com/

for "fiendish sudoku" search but doesn't know about

https://www.extremesudoku.info/

for "extreme sudoku" search.

gorgoiler wrote at 2021-12-02 22:06:30:

Random thought, based especially on using DuckDuckGo for two years:

Search engine isn’t singular, it’s plural.

(1) Search engine for something I know exists.

(2) Search engine for finding something new.

There’s a market for both, but you don’t have to solve both problems with the same product.

Sometimes I switch to Google for the former, but the latter works well enough for me that I don’t care what else Google would’ve shown me.

More often than not, my feeling is Google would only have shown me more ads in addition to whatever I could already find elsewhere.

ineedasername wrote at 2021-12-03 01:27:35:

SEO wars are at least part of it. Google's algorithm has evolved over time not just top optimize advertising views/clicks and take over more screen space, but also to battle the constant gamification of their algorithm by SEO that, once you eventually get to the real results, will surface less relevant/spammy/scammy etc results if Google doesn't constantly push back against the worst SEO abusers.

greyman wrote at 2021-12-02 16:00:00:

1) Google is better at AI, for example let's take this sloppy search: "some joke where you can't tell if it is serious or joke"

It is called Poe's law, and Google returned it at #4. Bing or Duckduckgo don't have a clue...

2) They have a years of user's data, like for specific term, they see what users clicked most, so they see which results were perceived as most relevant. It is hard to catch up if you dont have such data.

3) They developed anti-spamming tools during the years of fighting against SEO-spammers.

wmil wrote at 2021-12-02 20:39:24:

> Google is better at AI, for example let's take this sloppy search: "some joke where you can't tell if it is serious or joke"

My problem there is that I don't expect or want my search engine to do that. The counter case is where I remember a quote from and article and want to find the article. Old Google would help me find matching text and I could quickly find the original article. Current Google will try to interpret the text and give me some nonsense based on that.

AI has ruined other Google features... the "search by image" feature now analyzes the image, returns a generic tag like "woman", and shows me the wikipedia article on women as the first result.

Old search by image had tineye like functionality and you could find the source of images.

lolpython wrote at 2021-12-02 16:51:13:

> 1) Google is better at AI, for example let's take this sloppy search: "some joke where you can't tell if it is serious or joke"

> It is called Poe's law, and Google returned it at #4. Bing or Duckduckgo don't have a clue...

Interesting, I was looking for a good benchmark like this. For me Google returned it at #5 with an image/related terms carousel before it which places it physically more around #7 on the page. Brave Search (never tried it before today) puts Poe's Law at #8. So Google is still better.

But the other results are mostly worse (IMO) on Google. Here are the first 8 results:

- 175 Bad Jokes That You Can't Help But Laugh At - Reader's (rd.com)

- 57 Hilarious, Silly Jokes No One Is Too Old to Laugh At (bestlifeonline.com)

- 145 Best Dad Jokes That Will Have the Whole Family Laughing (countryliving.com)

- Sarcasm, Self-Deprecation, and Inside Jokes: A User's Guide (hbr.org)

- Poe's law - Wikipedia (wikipedia.org)

- Managing Conflict with Humor - HelpGuide.org (helpguide.org)

- 175 Bad Jokes That Are So Cringeworthy, You Can't ... - Parade (parade.com)

- Encouraging Your Child's Sense of Humor (for Parents) - Kids ... (kidshealth.org)

And here are the first 8 results from Brave Search:

- phrase requests - Is there a word for "pretending to joke when ... (english.stackexchange.com)

- Joke - Wikipedia (wikipedia.org)

- “Are you joking or serious?” – The Caffeinated Autistic (thecaffeinatedautistic.wordpress.com)

- How do I tell when people are joking or being serious? (reddit.com/r/socialskills)

- be a joke | meaning of be a joke in Longman Dictionary of (ldoceonline.com)

- Quote by Ricky Gervais: “If you can't joke about the most (goodreads.com)

- How can you tell if someone is joking with you or not? (quora.com)

- Poe's law - Wikipedia (wikipedia.org)

-----

edit: I did not count to 8 correctly the first time. Fixed that.

pkamb wrote at 2021-12-02 17:19:08:

The Brave results though seem to contain “good sites” whereas the Google results are content mill blogspam. The exact placement of Poe’s Law is somewhat less important.

lolpython wrote at 2021-12-02 18:46:18:

I agree. I switched to Brave Search after running this test.

keddad wrote at 2021-12-02 15:24:03:

While I feel that Google has become worse in last couple of years, I'm pretty sure it is still better now when 15 years ago. Maybe it is just some kind of nostalgia?

micromacrofoot wrote at 2021-12-02 15:41:43:

the internet has changed, partially due to google's influence

instead of discussion forums and Q&A sites, everyone's on facebook/twitter/discord/slack/snapchat/tiktok/etc... none of that is really very google friendly

online marketing and SEO is a _much_ larger industry now, so with less (by % of total) searchable content generated by people (which is on social media) a lot of the high-ranking content that appears in search is highly optimized marketing

then you have other kind of weird things like... half of all internet traffic being bots

phendrenad2 wrote at 2021-12-02 23:12:58:

The 2005 Google model only made sense in the 2005 internet. Google had the luck to become a search monopoly, and they quickly created Chrome to ensure that no one would ever switch away from Google search, so they could maintain the monopoly.

Now that Google exists, you can't create another one. There's only room for one.

Another thing is the rise of "content sites", like this one (Hacker News). I'm sure YCombinator doesn't like getting hit by dozens of crawlers. The impulse to ban everything that crawls except (Google|Bing|Baidu|VK) is too great.

A lot of alternative suggestions are being thrown into this discussion. Let me throw in mine: Reverse the concept of the "crawler". Instead of following links around the internet randomly, require sites to register with you and request to be crawled and/or submit a sitemap. It would be hard to get started, but once something like this gained momentum, I believe that there's room for several of these reverse-search-engines to compete.

boyter wrote at 2021-12-03 00:30:19:

I had a brief stab at this with

https://bonzamate.com.au

although its Australia specific to reduce the crawling and indexing requirements. It's main twist is that it runs entirely in AWS Lambda's meaning it costs nothing when it's not being used.

causi wrote at 2021-12-03 08:51:21:

Lately I've noticed Google has just started ignoring search operators. Search results are missing terms in quotes and include terms with a leading - sign on them. It's like they've decided we're too stupid to know what we're looking for.

chilling wrote at 2021-12-02 15:32:02:

Yesterday there was a discussion[1] about it and someone suggested yandex.com. I'm using it since than and really love it. It's like going back to 2003 where everything was just plain and simple.

[1]:

https://news.ycombinator.com/item?id=29393467

jerhewet wrote at 2021-12-03 00:52:47:

Do not force me into autocomplete mode when I'm typing in my search terms. I don't care what anyone's "reasons" are for forcing me to put up with flashing, irrelevant bullshit when I'm searching for something. I don't care how "fast" it is.

Just let me type stuff into the search box -- including typo corrections and modifications to what I'm searching for -- and hit ENTER to start the actual search.

When I'm ready to start my search I'll hit the fucking ENTER key. Stop annoying me with your stupid assumptions about what I'm looking for.

This ONE THING is why I switched to Webcrawler.com two years ago. I type in five or ten words with ZERO craptastic guesses flashing around on my screen, hit ENTER, and THEN it returns what I'm looking for.

erpellan wrote at 2021-12-02 22:58:25:

Even if Google dusted off their 2005 codebase and ran it on today's web it wouldn't come close to the results quality of Google in 2005. The SEO industry has been in an arms race with the search engines for 16 years. 2005 Google would be like a goldfish in a piranha tank.

8bitsrule wrote at 2021-12-03 05:03:27:

Looks like millionshort.com (which I learned-of on HN) died recently. For me, its results were more useful than most others (even without the 'leave out the top nnn sources' feature). Hoping it was an experiment that will bear fruit.

flipdot wrote at 2021-12-02 22:01:54:

Not sure if this is any close to what you’re trying to find, but there’s

https://github.com/benbusby/whoogle-search

drcongo wrote at 2021-12-02 16:47:43:

I've been using kagi.com for a month or so now, and it consistently beats DDG and Ecosia for result quality. I'd guess it beats Google too, since last time I used Google it was nothing but ads and spam which is why I stopped.

freediver wrote at 2021-12-02 16:59:05:

Thank you for the vote of confidence! Better than Google is our goal, glad you perceive it that way.

drcongo wrote at 2021-12-02 17:22:03:

You're welcome. I'm really impressed with it most of the time. Still not made it on to the Orion beta though ;)

BbzzbB wrote at 2021-12-02 15:36:10:

No mention of DDG in the comments? Is there a reason I'm not seeing or it's just not the preferred alt-search on HN? Seems to have been working fine for me when I struggle to get past the funnels and content mills on Google.

Kiro wrote at 2021-12-02 15:55:27:

DDG doesn't have their own index (they're getting their results from Bing) so not really relevant to this question.

BbzzbB wrote at 2021-12-02 16:25:36:

I.. didn't know that. However, trying it just now in incognito I don't get the same results[0] (some different links, and most re-ordered). Is Duck repurposing Bing's results? I've tested with "how to get rich", a great bait for bad content (try it on Google without an adblocker, if you dare).

[0]:

https://pastebin.com/xC45hL1i

Kiro wrote at 2021-12-02 22:13:30:

I don't know what DDG is doing but I'm imagining that they send in the raw queries while you can't get around Bing's personalisation even in incognito. I get very similar results for "how to get rich", but only after setting "All regions" on DDG.

Bing:

1. How to Get Rich: 10 Things Wise and Rich People Do

2. 5 Ways to Get Rich - wikiHow

3. 16 Proven Ways On How To Get Rich Quick (2021 Edition) - TPS

4. How to Get Rich - NerdWallet

5. How to Get Rich: Follow our Step by Step Plan to Build ...

DDG:

1. How to Get Rich: 10 Things Wise and Rich People Do

2. 5 Ways to Get Rich - wikiHow

3. How to Get Rich - NerdWallet

4. 16 Proven Ways On How To Get Rich Quick (2021 Edition) - TPS

5. How to Get Rich: 8 Steps to Make Your First Million ...

It's no secret that DDG is using Bing so they're not trying to hide it. An easy way to verify it is to search for "what is my ip" on DDG and look for results where the IP number has been cached in the snippet, e.g.:

www.myipnumber.com

What is my IP number - my IP address - MyIpNumber.com

What is my IP Number? The IP Number of this machine is: 157.55.39.192. This number can also be represented as a 32-bit decimal number 2637637568, or as a 32-bit hexadecimal number 0x9D3727C0 . (Note that if you are part of an internal network then this is the IP number of your local server, the machine which is connected to the external ...

If you do an IP lookup on 157.55.39.192 you will see that it's in fact "Microsoft bingbot".

KennyBlanken wrote at 2021-12-02 15:46:16:

For me, DDG results are even worse than Google. It's set as my default and I'd say at least half of my searches in DDG generate completely useless results...pages of obviously SEO'd garbage.

DDG also doesn't support showing a site's basic structure in the search results (ie, the card of a company's website with Products, Contact Us, Support, etc) and the preview text is garbo as well...it reminds me of 1990's era electronic card catalog search excerpts.

I look at the first page or two, give up, search google. While I have to hunt a bit in the results, I do eventually get what I wanted.

infinitezest wrote at 2021-12-02 15:57:00:

Every time this comes up I'll see a few people talk about how the results aren't relevant but it has not been my experience. I've been using DDG as my main search engine for a few years and never have to go beyond the first page. I really curious why that is.

JohnFen wrote at 2021-12-02 16:10:22:

My experience is like yours -- DDG is legitimately better than Google. My hypothesis is that it's related to how you construct searches. I expect Google probably does better if you learn how to talk to it, since it seems to want to interpret your query rather than take it literally.

My searches tend to be keyword-oriented rather than natural language. I think DDG does better with those.

pantulis wrote at 2021-12-02 15:41:36:

I dont find search results to be too relevant (at least for me, also Spaniard here). It is my default search engine only for the bang commands.

RDaneel0livaw wrote at 2021-12-02 15:49:28:

I was looking for this as well! I use it daily and have for years. Love it.

not2b wrote at 2021-12-02 22:13:30:

I think you're being nostalgic for something you don't remember very well.

In that era, Google would return a match based on words that appear in the links to a URL but not in the article itself, meaning that it was easy to produce "Googlebombs". For example, from 2005-2007 the top hit for "miserable failure" was the Wikipedia article for George W. Bush.

See

https://www.screamingfrog.co.uk/google-bombs/

for some of the "better" ones.

karmasimida wrote at 2021-12-03 00:47:28:

Google does its job.

I heard HN constantly crying over its deteriorating quality, but I am not noticing it that much, not better not worse, it just does its job.

To create 05 Google, it is easily billions of dollars and years of investment, before people will treat you seriously.

The reason we didn't get 05 Google could only because it is not profitable. Some nation state attempt to demonopolize the search engine business might work, but I didn't expect any for profit organization to easily attempt doing this, let alone individual hobbyists

llaolleh wrote at 2021-12-02 15:27:06:

Everyone runs in the other direction anytime a search engine is mentioned. The thought of competing with Google turns people off.

Even in 2021, despite how bad it's become, it's still miles ahead of other competitors.

prox wrote at 2021-12-02 16:39:26:

I disagree. A lot of people I know already switched to Duckduckgo. Google’s ability to get relevant results is dropping like a brick, while the quality of DDG has been improving slowly but steadily.

datenarsch wrote at 2021-12-02 17:25:19:

I wish I could agree but from my experience, DDG's search results aren't really that great. Often even worse than Google's.

And another private company is not the answer I believe. We need something more drastic, an open-source search engine organized as a genuine non-profit organization. Something like that. Otherwise, whatever replaces Google will just turn into another Google as soon as it gets any momentum.

llaolleh wrote at 2021-12-03 20:57:28:

I think open source will be tough because you're going to need a lot of saints to work on a search engine of Google's caliber.

Maybe an alternative revenue model instead of ads.

onecommentman wrote at 2021-12-05 05:48:23:

Consortium of universities, perhaps? Every top school (globally) kicks in some design and development time. It seems odd that the most critical link to access information on the planet is _not_ the product of academia. With a country’s skin in the global game, there may be better leverage to keep it free and open for their citizens.

jacquesm wrote at 2021-12-03 06:47:48:

I'd love a much simpler version of search engines: an engine that I can give a long list of websites to crawl, and to completely ignore the rest.

simonebrunozzi wrote at 2021-12-02 18:15:47:

These guys [0] have built something really close to 2005-Google, and possibly slightly better.

The parent company, Tiscali, was a huge hit in the 1990s, as it provided internet access to millions of Italians. It went through some struggle for several years, but lately the original founder, Renato Soru, came back to run the company.

The company is based in Cagliari, the capital of Sardinia, Italy.

[0]:

https://www.istella.it/en/

criddell wrote at 2021-12-02 15:48:55:

Why don't you want personalized results? If I search for "subaru service" I want to find Austin Subaru, not Thorp Subaru in Cape Town.

vikingerik wrote at 2021-12-02 15:54:01:

Why didn't you just search "austin subaru service"? If you want a query narrowed down by location, that's your job to say so.

Sure, it feels great when the engine guesses something like that correctly -- but it comes out worse overall for the plentiful cases where you have to try to compensate for it guessing wrong.

criddell wrote at 2021-12-02 16:24:17:

Why should I have to do all that work? I want the machine to do it for me.

I can only think of examples where I want personalization. What's an example query where it interferes?

jeffbee wrote at 2021-12-02 16:39:30:

Amazing that the same site that thinks copilot will just generate programs for us also thinks it is literally a crime for a search engine to infer anything.

arthur_sav wrote at 2021-12-02 16:16:10:

I pretty much hate "personalized" search recommendations. If i'm looking for something it's usually not in relation to me but in relation to the world.

If i wanted something more relevant to me, then i would specify what aspect of relevance (country, gender, age etc...) i would like instead of playing the guessing game.

criddell wrote at 2021-12-02 16:27:43:

> If i'm looking for something it's usually not in relation to me but in relation to the world.

If that's true, then I don't think you are a typical search engine user.

The personalization should just be used for defaults. You can always make a more specific query to focus on aspects you are interested in.

est wrote at 2021-12-04 05:24:06:

Because today's web are full of walled gardens, and most content are going mobile , in streaming, and SPA rendered, which is no longer plain text based.

Hakashiro wrote at 2021-12-05 11:43:41:

What is your gripe with DuckDuckGo?

emodendroket wrote at 2021-12-03 01:53:27:

Well, Cuil had a lot of money and couldn't do it. I don't know how you quantify your assertions but I suspect that if you brought back 2005 Google it would be easily gamed and struggle to deal with social media sites where a lot of content people are looking for is now found.

willcipriano wrote at 2021-12-02 16:26:15:

I'd like to see a "just search" engine, all it does it search for a specific string, case insensitively, across the entire web. No curation or anything, just sorted in lexicological order closest match first maybe falling back to page age if it has more then one exact match. Perhaps give me some regular expressions as well.

jeffbee wrote at 2021-12-02 16:34:12:

That would be easily the worst search engine ever deployed. Imagine just returning all docs containing the word “bicycle” in chronological order. Useless.

willcipriano wrote at 2021-12-02 16:38:20:

For "Bicycle" it would suck but I don't often use search engines that way, for "High Timber ALX 29" you'd probably get something like this:

https://www.schwinnbikes.com/products/high-timber-alx-29?var...

I wouldn't use it for everything but sometimes that is the exact behavior that I want. I'd use duck duck go for more general searches.

jeffbee wrote at 2021-12-02 16:42:11:

That is the top hit on google for that search, so what’s your complaint?

willcipriano wrote at 2021-12-02 16:52:59:

Take a random part number off your car, or a portion of a error message and try finding that. It's annoying to have to scroll down over a page or two of autogenerated SEO answers to get to something useful. The first result to appear on the internet is less likely to be SEO and more likely to be the manufacturers documentation or the git commit that spawned your error. It isn't always, but that's why you have more then one search engine.

Secondarily I think a search engine that is very simple in it's model and operation is useful for more general free speech purposes. If the major search engines decide they don't like a site like the pirate bay a search for '"Pirate Bay" And "Torrents"' on a search engine that does not curate could still get you there. I guess the point is without curation you have to work harder to find what you want, but nobody is actively preventing you from finding anything. It would help keep everybody honest.

prox wrote at 2021-12-02 16:37:10:

Maybe a “stability factor” could be calculated. Whereas earlier new content was king, I now value a stable long term source of information. So domain age + page age + content variability + dependency on ads. That might give more honest sources a go.

willcipriano wrote at 2021-12-02 17:01:40:

That's a good idea, I'd make it a option. Do you want newest first, oldest first or by stability?

hvasilev wrote at 2021-12-03 07:35:31:

All big tech businesses at their core are monopolies. Once a significant field has been figured out, it is very difficult to compete with the market standard, unless they screw up so hard that that THE AVERAGE user starts searching for an alternative.

swframe2 wrote at 2021-12-02 15:47:35:

Have a look at gpt-3 if you want to see what the future dominant search engine will be. It will not find relevant results, it will write it on the fly customized for exactly what you want to read. (Maybe products will just ship to your door and be auto paid because the future ad targeting AIs will know you so well.)

marginalia_nu wrote at 2021-12-02 15:51:18:

What if you are looking for something written by a human?

swframe2 wrote at 2021-12-02 16:07:12:

You can always go to a library or bookstore.

marginalia_nu wrote at 2021-12-02 16:08:50:

Let's imagine I want to talk to the author of the content. How can I do that if it's just a souped up markov chain?

throwawayffffas wrote at 2021-12-02 16:29:28:

The markov chain can also power a chat bot.

marginalia_nu wrote at 2021-12-02 16:37:30:

But then they would need to know that the person sending the email is the same person that read a specific article.

kumarsw wrote at 2021-12-02 16:21:24:

I feel like we are at the low point or even losing the battle between search engines and SEO spam. Maybe it is time for the Yahoo-style curated directory to return? We seem to be getting a microcosm of this with the awsome-* GitHub lists and Gemini with its near-nonexistent search.

WalterBright wrote at 2021-12-02 22:37:07:

I'd like to see categories like travel, science, history, art, etc. The web pages could pick which categories their page falls into using meta tags. The user has the option of selecting which category they are interested in searching within.

motohagiography wrote at 2021-12-02 16:37:10:

I do like the idea that instead of crawling and indexing, the next generation search will likely be more like a federated community search app that indexes the stuff members actually read. Google search isn't so much a repository as a consensus about what's important, hence why it's so politicized to the point of becoming unreliable, but also why it too is vulnerable to disruption.

Imo, 2005 google got initial traction because of its tech forum post indexing, as I remember my switch to it was because it became an extension and then replacement for manpages. In that sense, what made it good was it reflected the consensus of what its incredibly influential userbase thought was important and just managed that really well. The demographic impact of the U.S. Gen X all using it at once didn't hurt either.

The equivalent today, as a lot of us say, is that blockchains are in the 1997 internet phase, and the service that makes the content of those as navigable as the 90's internet, will likely grow in a similar way.

Search that provides young people with privacy and freedom to pursue their true interests will be the dominant strategy. Its success will be because it's a product that rides growth, and not because it "solved a problem." Imo, we all index too much on the privacy pattern because the freedom pattern is too risky.

What's changed since that time are the maturity of things like Bloom and other probabilistic filters, Apple's private set intersection, differential privacy, zksnarks, and everybody you'd ask an opinion from now gets their content through mobile devices. Apple's ecosystem is equipped to do this kind of search, but they're too exposed politically to get into it. Meta will likely go there, but nobody's going to trust them willingly.

A protocol that generated a cryptograpically strong anonymous index from your browsing - and instead of putting it on google's servers, it was on a chain, or the content index information and its evolving consensus score was included in something like a DNS record - may still unseat these ensconced interests. IPFS and other P2P or torrents might do something like that as well. Blockchains maybe good for that consensus/desire score.

It's not something you architect and design top down that has to solve all cases, it will be just another useful product that grows while riding a demographic change. It would be on the level of inventing HTML/HTTP again, which, when you think about it, was just another dude making a thing he needed.

baggachipz wrote at 2021-12-02 16:47:37:

https://kagi.com/

is a new engine (and Orion Browser) which seems like what you're talking about. I've been using it some and like it so far. The browser is fantastic.

dang wrote at 2021-12-03 01:56:24:

Ongoing related thread:

_Gigablast Search Engine_ -

https://news.ycombinator.com/item?id=29421898

- Dec 2021 (10 comments)

lgrialn wrote at 2021-12-02 22:07:39:

What I miss most of all from the Good Old Days was getting as many hits back as I could read.

Rather than being told "No, there are only eight pages of results on anything in the goddamned world. Really. Would I lie to you?"

marksbrown wrote at 2021-12-02 23:00:08:

I'd like a way of automatically filtering for websites that :

That would be a place to begin.

XCSme wrote at 2021-12-03 09:23:25:

Almost all websites use JavaScript, including this one.

betwixthewires wrote at 2021-12-03 00:37:13:

Wiby.me might work for you.

mrfusion wrote at 2021-12-02 22:20:01:

I’ve always wondered why you can’t use SEO optimizations for GOOGLE as a negative weight and penalize those pages.

For example if my search term appears in the URL I can almost guarantee I don’t want that page.

dragonwriter wrote at 2021-12-02 15:50:55:

Why doesn't anyone create a search engine comparable to 2005-Google?

Because the universe being searched isn't the internet of 2005 and earlier, and because user expectations have moved on, too.

Plus the index expense.

hermann123 wrote at 2021-12-03 18:43:21:

I use

https://swisscows.com/

gompertz wrote at 2021-12-02 23:16:34:

I've been having a lot of good luck with Lycos (yeah, that Lycos, from 1995!)... It never returns pay wall or "opinion" based results (I.e. Medium).

tigerlily wrote at 2021-12-02 16:07:20:

Surely there must be some way to have distributed search compute a la folding/seti@home or those mersenne prime guys.

I'd gladly pool in some of my CPU time if it helps build a better search.

teddyh wrote at 2021-12-02 16:16:40:

https://yacy.net/

tigerlily wrote at 2021-12-02 16:32:01:

Thanks!

s1k3s wrote at 2021-12-02 21:43:14:

I don’t know how Google was in 2005, but in ~2010 I was able to pull a website on #1 with 0 cash spent, just by manipulating PR. That doesn’t seem great to me.

vangelis wrote at 2021-12-02 21:09:58:

They have, sort of:

https://search.marginalia.nu/

beefman wrote at 2021-12-02 16:48:44:

Can you also create a web comparable to the 2005 web?

Well, it's wikipedia. So just create a search engine for that, since their search sucks rocks.

anotheraccount9 wrote at 2021-12-02 22:40:49:

Check out the dead internet theory. If most people browse 1% of the web, what's up with popular search engines?

amelius wrote at 2021-12-02 15:43:50:

Also, where are the books about writing a search engine?

Knuth's "Searching and Sorting" volume desperately needs an update.

mindcrime wrote at 2021-12-02 16:37:52:

I don't even know if anybody has written a book specifically about search at "web scale" (no MongoDB jokes here, please). But about the closest things I know of would be something like:

https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...

https://www.amazon.com/Information-Retrieval-Implementing-Ev...

https://www.amazon.com/Introduction-Information-Retrieval-Ch...

axegon_ wrote at 2021-12-02 16:05:32:

Two major reasons: costs to build and maintain and manpower needed. Both are practically impossible to come by.

hereforphone wrote at 2021-12-02 23:02:59:

Because the money lies in modulating your product according to the whims of the highest bidders.

ChrisArchitect wrote at 2021-12-02 16:54:16:

related 2 days ago:

_Ask HN: Has Google search become quantitatively worse?_

https://news.ycombinator.com/item?id=29392702

Inviting all the paranoid/speculative/hearsay/personal experience responses. Lame Ask HNs!!!!!

mkbkn wrote at 2021-12-02 17:17:16:

I am a non-dev and Ecosia and DuckDuckGo are perfect for me. Not used Google since more than 3 years now.

chrisgoman wrote at 2021-12-02 20:23:45:

too many crappy websites, probably needs a "committee" to whitelist domains (only good quality ones) but probably too much work for not enough money or needs some monetization strategy

rasengan wrote at 2021-12-02 15:11:36:

This is how Private Search [1] works since it decouples the search from the user. This means nobody knows both who searched and what they searched for. This is a huge leap for privacy in search.

[1]

https://private.sh

jaywalk wrote at 2021-12-02 15:24:07:

Looks like your comment here caused enough curiosity to take the service down.

Lucasoato wrote at 2021-12-02 15:21:58:

Tried it but it just says: "Something went wrong. Please try again."

ZetaZero wrote at 2021-12-02 15:27:50:

same here

rasengan wrote at 2021-12-02 15:48:50:

It should be working now! Thanks for the heads up. There was a traffic issue.

snarkypixel wrote at 2021-12-02 16:05:14:

Is it a proxy to other search engines or are they building their own?

rasengan wrote at 2021-12-02 16:32:27:

It's a multi part partnership with Gigablast. Gigablast sees the searches, but not who searches. Private.sh sees who searches, but not what they search for.

gbmatt wrote at 2021-12-02 15:36:57:

and i work with rasengan on private.sh so yes there's some issue there. one of the back end servers is returning a max capacity error of sorts... we are checking into it.

gadrev wrote at 2021-12-02 17:06:24:

Just tried it and it worked for me.

peanut_worm wrote at 2021-12-02 23:04:40:

Isn’t that what DuckduckGo is?

DDG is pretty useless though unfortunately.

richardsocher wrote at 2021-12-03 09:33:30:

you.com supports many of the standard operators and has specific reddit, stackoverflow, MDN apps for developers.

fnord77 wrote at 2021-12-02 21:39:42:

information-dense pages of yore have been replaced by really wordy, probably generated SEO optimized blog junk.

aaron695 wrote at 2021-12-03 00:03:24:

I seem to recall that Google consistently produced relevant results and strictly respected search operators in 2005 (?), unlike the modern Google.

You recall wrong.

Probably because you were a child and searching for "reddit".

Now as an adult, Google can't just hand you adult results by magic.

Search operators have changed, but that's because the internet is 1000's of times bigger since 2005. Where as the number of people went from ~1 Billion to ~5 Billion.

I think search results were the same for everyone, rather than being customized for each user.

You are not a baby, turn off the customisation. The same issue existed ~2005, Google customised and we had to work to turn it off. Also my idea we were become one world was totally wrong. Google customising for my location was more correct than my idealism. It also helped local businesses get online. That Google is 'evil' by default is a shitty assumption.

andrewclunn wrote at 2021-12-02 15:48:38:

What about a search engine that only indexed information and technology "alternative" sites, specifically to give you the results most likely to be purged or demoted from Google's results? Would be simple enough in scope and have a built in market and use.

michaelyuan2012 wrote at 2021-12-02 15:54:12:

for the people in logistic business area who search in Google, Flatbed Container Semi Trailer, it should be good for your reference.

https://www.dreamtruegroup.com/flatbed-container-semi-traile...

michaelyuan2012 wrote at 2021-12-02 15:53:15:

here is Drop Side Trailer information, it should be helpful for those logistic company.

https://www.dreamtruegroup.com/drop-side-trailer/

grouphugs wrote at 2021-12-02 23:43:05:

the nazis don't really believe in markets, freedom, fairness, competition, democracy, or even capitalism for that matter, it's all just old school oligarchical authoritarianism