________________________________________________________________________________
Ha, yes, I've done that at
.
The biggest problems now are the following:
1) Too hard to spider the web. Gatekeeper companies like Cloudflare (owned in part by Google) and Cloudfront make it really difficult for upstart search engines to download web pages.
2) Hardware costs are too high. It's much more expensive now to build a large index (50B+ pages) to be competitive.
I believe my algorithms are decent, but the biggest problem for Gigablast is now the index size. You do a search on Gigablast and say, well, why didn't it get this result that Google got. And that's because the index isn't big enough because I don't have the cash for the hardware. btw, I've been working on this engine for over 20 years and have coded probably 1-2M lines of code on it.
You can be whitelisted so Cloudflare doesn't slow you down (or block you):
https://support.cloudflare.com/hc/en-us/articles/36003538743...
It's not quite that easy. Have you ever tried it? See my post below. Basically, yes, I've done it, but i had to go through a lot and was lucky enough to even get them to listen to me. I just happened to know the right person to get me through. So, super lucky there.
Furthermore, they have an AI that takes you off the whitelist if it sees your bot 'misbehave', whatever that is. So if you have a certain kind of bug in your spider, or your bot 'misbehaves', whatever that means is anyone's guess, then you're going to get kicked off the list. So then what? You have to try to get on the whitelist again? They have Bing and Google on some special short lists so those guys don't have to sweat all these hurdles.
Lastly, their UI and documentation is heavily centered around Google and Bing, so upstart search engines aren't getting the same treatment.
Cloudflare is not the only gatekeeper, too. Keep that in mind. There's many others and, as an upstart search engine operator, it's quite overwhelming to have to deal with them all. Some of them have contempt for you when you approach them. I've had one gatekeeper actually list my bot as a bad actor in an example in some of their documentation. So, don't get me wrong, this is about gatekeepers in general, not just only Cloudflare and Cloudfront.
But your treatment one could say sites fronted by cloudflares are part of a closed web
I dunno if y'all realise this but I'd pay for a search engine that black holes CloudFlare and any other sites that think bots shouldn't read their sites.
rip the internet if you do that =/
> You do a search on Gigablast and say, well, why didn't it get this result that Google got. And that's because the index isn't big enough
I wionder how much this is true, and how much (despite all our rhetoric to the contrary) it's because we have actually come to expect Google's modern proprietary page ranking, which counts more than just inbound links but all sorts of other signals (freshness, relevance to our previous queries, etc.).
We dislike the additional signals when it feels like Google is trying to second-guess our intentions, but we probably don't notice how well they work when they give us the result we expect in the first three links.
>but we probably don't notice how well they work when they give us the result we expect in the first three links.
For me the experienced quality of Google search results gave have dropped massively since 2008, despite (and maybe even because of) all their new parameters.
When someone says this someone else usually immediately says it is because of web spam and black hat SEO.
But black hat SEO doesn't explain why verbatim doesn't work for many of us.
Black hat SEO doesn't explain why double quotes doesn't work.
Black hat SEO doesn't explain why there is no personal blacklists so all those who hate pintrest can blacklist them.
Black hat SEO probably also doesn't explain why I cannot find a unique strings in open source repos and instead get pages of not exactly webspam but answers to questions I didn't ask.
I think people also have an inflated recollection of how good Google actually was back in 2005.
Back then Google was only going up against indexes and link-rings, not 2021 Google/Bing/DDG/etc.
> I think people also have an inflated recollection of how good Google actually was back in 2005.
I've been pointing this out for at least close to a decade.
I know since I bothered to screenshot and blog about it in 2012.
I'll admit mistakes happened back then too, but they were more forgivable like keyword stuffing on unrelated pages. Back then Google were on our side and removed those as fast as possible.
Today however the problem isn't that someone hss stuffed the keyword into an unrelated page but that Google themselves mix a whole lot of completely irrelevant pages into the results, probably because some metrics go up when they do that.
Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - tonsome degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.
Note that I'm not necessarily suggesting an grand evil master plan here, only that end-to-end metrics will improve as long as there is no realistic competition.
> Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - tonsome degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.
This would mean that google were measuring the quality of their search results by the number of ad impressions which seems unlikely to me. Maybe in some big, wooly sense this is sort of true but it seems pretty unlikely that anyone interested in search quality (i.e. the search team at google) is looking at ad impressions.
I've been using Altavista at that time, every now and then switching to Northern Light. Everything else was abysmal. Google blew them out of the water in terms of speed, quality, simplicity, unclutterdness and everything else. I can't remember ever retraining muscle memory so fast when switching to Google. So, no, Google has been great then and apart from people actively working against the algorithm is still good now, but obviously a completely different beast.
I think the parent's point was that people say Google 2005 >> Google 2021, but it's pretty hard to make this comparison in an objective way. No doubt Google 2005 was way better than other offerings around at the time.
2005? There were loads of other search engines (SE), and many meta-SE: hotbot, dogpile, metacrawler, ... (IIRC), plenty more.
There was also indexes, which Yahoo, AOL (remember them!) had but there was, what was it called, dmoz?, the open web directory. When Google started, being in the right web directory gave you a boost in SERPs as it was used as a domain trust indicator, and the categories were used for keywords. Of course it got gamed hard.
Google was good, but I used it as an alt for maybe 6 months before it won over my main SE at the time. I've tried but can't remember what SE that was, Omni-something??
One of the main things Google had was all the extra operators like link: inurl:, etc., but they had Boolean logic operators too at one point I think.
_I've tried but can't remember what SE that was, Omni-something??_
Google replaced Altavista in my usage, who in turn were usually better than their predecessors.
I used them all and kept using the ones that gave me unique results. Google was hands down better because of pagerank and boosts to dmoz listed sites and because they scanned the whole page ignoring keywords.
Google was good, actually very good back in 2000s. Their PageRank algorithm practically eliminated spam pages that were simply a list of keywords. Before Google, those pages came up on the first page of Altavista.
I don't specifically remember 2005, but the quality went down with more modern but still shady SEO practices.
No, quality went down because google shat the bed. All the changes have been deliberate.
I hate google now. Every time I use it by accident I’m reminded how infuriating it is. I know DuckDuckGo is just bing in a Halloween mask, but I’ll gladly use something that’s not awesome as long as it’s also not infuriating. I’d take 2005 google any day.
Well if the result didn't appear in the first 5-10 pages, it's probably not in the index.
You can see it with other search engines. I challenge you to come up with a Google query for which a first-page result won't be seen within the first 10 pages of Bing results for the same query.
(Bonus points if that result is relevant).
There's only so much tweaking that personalization and other heuristic can do.
But if something is missing from them index, that's it.
I would like to see the least relevant search result Google comes up with. :)
Yes, I realize this is probably trivial with an API call, but I always found it interesting there isn't a way to see what the site with the lowest pagerank in the index is.
It sounds to me like your challenge includes anything which is in Google's index but not Bing's? Is that intentional?
I assume the author has the ability to search the index to see if your preferred Google result is even indexed.
I've used Gigablast off and on for a long time (I think I first discovered Gigablast in 2006 or so). Would be cool to have a registration service for legitimate spiders. I used to run a team that scraped jobs and delivered them (by fax, email, us mail as require by law) to local veteran's employment staffers for compliance. We were contracted by huge companies (at one point about 700 of the fortune 1000) to do so, and often our spiders would be blocked by the employer's IT department even though the HR team was paying us big bucks to do so.
Dude, I use your engine regularly, it is spectacular. The amount of work you put into this takes some dedication.
I was curious if you ever intend to implement OpenSearch API so that we could use it as default in browser or embed it in applications?
Also how can people contribute to help you maintain a larger index and/or keep the service going?
Nice.
I'd pay 5-10$/mo for a search engine that didn't just funnel me into the revenue-extracting regions of the web like Google does.
A subscriber-supported search engine sounds cool to me. Any precedent?
Copernic (
) had Copernic Agent Professional, a for-pay desktop application that had really good search features, a while ago . Not sure if they discontinued it.
Wow blast from the past. I think I was using Copernic all the way back in 2003... Forgot all about them. Thanks!
As a general rule, nobody is willing to pay what they are worth to advertisers. Facebook makes 70$ / y / user in the US. You would pay $70 for an ad-free Facebook? Congratulation, you must be an above-average earner. Also: your value to advertisers just tripled. If you are willing to pay $210, it will immediately triple again.
Great point! So simple, but as someone who has never worked on this side of things I never thought about it.
How would legal limitations on data collection, like GDPR, influence the ratio? None? Only an insignificant degree? Or enough to actually influence business decisions?
You'll like
How do they pay for it?
From the FAQ:
> …Eventually, we plan to charge our members $4.95/month.
Kagi.com does this. In closed beta at the moment, but you can email and request access.
I've tested Kagi a bit. It nicely gave me exactly what I wanted even in cases where names could have different meanings in different contexts (I tested with Kotlin)
The basic results are good with some nice touches here and there like including a "blast from the past" section with older results which is actually what I sometimes want and another section where it widens up a bit (i.e. what Google does by default?).
Furthermore you can apply pre defined search "lenses" that focuses your search, or even make your own, and you can boost or de-rank sites.
I had not expected this to happen so quickly but I'm going to move from DDG to kagi as my default search engine for at least a couple of days because I am fed up with both Googles and DDGs inability to actually respect my queries.
If ir continues to work as well as it does today I'll happily pay $10 a month and I might also buy 6 months gift cards for close friends and family for next Christmas.
Think about it, unlike an ad financed engine incentives are extremely closely aligned here: the smartest thing Kagi can do is to get my results as fast as possible to conserve server resources (and delight their customer).
For an ad financed engine abd especially one that also serves ads on search results pages as well the obvious thing to do is to keep me bouncing between tweaking my search query and various that almost has my question answered but not quite.
(That said, if one us going to stay mainstream I recommend DDG over Google since 1. for me at least Googles results are just as bad and 2. with DDG it is at least extremely easy to check with Google as well to see if they have a better result 3. competition is good)
Perhaps trolling the entire web is not useful today? I’d love a search engine where I can whitelist sites or take an existing whitelist from trusted curators.
Heh, I guess you mean "trawling" - trolling the entire web is something very different :)
Then again, if you look at today's search results, where everything above the fold belongs to Google, maybe we have been trolled indeed.
Depending on the intended metaphor, trolling could work too :)
https://en.wikipedia.org/wiki/Trolling_(fishing)
What would trolling the entire web look like?
It would look like a modern search engine with innovative technology offerings like Advanced Mobile Pages.
Wow, you’re right. Trolling the entire web would involve an organization that carries considerable authority whose decisions can impact every member of the web.
AMP is the perfect way to troll websites into making shitty versions of their content, for no real reason other than just because you feel like it. And then when you’re satisfied with your trolling you just abandon the standard.
"Trolling" is fine, see e.g.
https://grammarist.com/usage/trawl-troll/#:~:text=Troll%20fo...
.
Not in this context - "trolling" as described there would apply to targeted indexing of a specific site; while "trawling" would refer to a wide net that attempts to catch all the sites.
Well, no, it's not fine.
See e.g. _the source you linked_, which explains the difference.
Did you read to the end? Methinks not!
>Did you read to the end? Methinks not!
Methink harder.
>Troll for means to patrol or wander about an area in search of something. Trawl for means to search through or gather from a variety of sources.
We were talking about _gathering_ information from a _variety of sources_ to build a search engine index.
Trusted curators is a dangerous dependency
Trusted consumers are better. The original page-rank algo was organic and bottom-up. But now it's the person not the page. Businesses compete for interaction not inbound links. So if you can make a modern page-rank that follows interaction instead of links and isn't a walled garden then I'd invest.
I could make that work, but what do you mean by "walled garden" in this context?
the business and allies of google - those entrenched interests that limit the current visibility of the web to themselves
That’s why you don’t make it a hard dependency and let people curate their own list of taste makers. They can share and exchange info about who good taste makers are and good one might even charge for access to exclusive flavors.
It is. The alternative is scooping everything and using algos to curate. That seems worse imo.
Perhaps vote on results like on Reddit posts? Gets the junk sites down (and out of the index eventually).
Any open voting system is going to be under serious SEO pressure.
That’s the real issue, Google has indirectly infected the web with junk sites optimized for it. Any new search engine now has a huge hurdle to sort through all the junk and if it succeeds the SEO industry is just going to target them.
A more robust approach is simply pay people to evaluate websites. Assuming it costs say 2$ per domain to either whitelist or block that’s ~300 million for the current web and you need to repeat that effort over time. Of course it’s a clear cost vs accuracy tradeoff. Delist sites that have copies and suddenly people will try to poison the well to delist competitors etc etc.
Adding a gatekeeper collecting rent isn't a solution - the people using SEO are already spending money to get their name up high on the list.
This is money spent by a search engine not money collected from websites. People don’t ever want to be sent to a domain parking landing page for example.
More abstractly SEO is inherently a problem for search engines. Algorithms have no inherent way to separate clusters of websites setup to fake relevance from actually relevant websites. Personally I would exclude Quora from all search results, but even getting to the point your trying to make that kind of assessment is extremely difficult in the modern web. Essentially the minimum threshold for usefulness has become quite high which is a problem as Google continues to degenerate into uselessness.
Given Reddit is notorious for it's problems with astroturfing and vote bots, I don't think this is a particularly promising approach.
Reddit is a heavily gatekeeped community by the mods in regards to specific topics
Reddit is an extreme example of group think. Try posting something pro-Trump (I mean, surely even that guy has a positive thing or two to be said about him) and you'll get banned in some subs. Or you may get banned simply because the mod doesn't like the fact that you don't toe the party line.
Also, vote bots
That just means that you have to curate the people allowed to vote. Otherwise, it would be rule by the obsessed and the search engine optimizers, and the junk sites will dominate the index.
I'm not convinced that Google's recursive AI algos aren't a functional equivalent. They let you vote by tracking your clicks.
Plus, it scales less well than pure algorithmic search. This fight already happened, with a much smaller internet.
It works really, really well for libraries. Research libraries (and research librarians) are phenomenally valuable. I've missed them any time I'm not at a university.
Both curators and algorithms are valuable. This goes for finding books, for finding facts and figures, for finding clothes, for finding dishwashers, and for pretty much everything else.
I love the fact that I have search engines and online shopping, but that shouldn't displace libraries and brick-and-mortar. Curation and the ability to talk to a person are complementary to the algorithmic approach.
> It works really, really well for libraries
It scales extremely poorly. It works very well for situations where there are customers/sponsors are willing to spend lots of money for quality, because then the cost scaling doesn't matter as much; research libraries, Lexis/Nexus Westlaw, etc. all do this, but it's not cheap, and the cost scaling with the size of the corpus _sucks_ compared to algorithmic search.
It is among the approaches to internet search that lost to more purely algorithmic search, because it scales poorly in cost.
+book stores. Curators can use algorithms to help them curate… Google’s SE is taking signals from poor curators imo.
How about just a meritocratic rating? Even here on HN I would appreciate some sort of weight on expert/experienced opinion. Although in theory I like the idea that every thought is judged on its own, the context of the author is more relevant the deeper the subject. That's one of the reasons I still read
. It has a niche audience with industry experience.
Lobsters is a great example of the benefits _and dangers_ of expert/experienced opinion. Lobsters is highly oriented around programming languages and security and leaves out large swaths of what's out there in computing. That's fine of course, but it creates a pretty big distortion bubble that's largely driven by the opinions of the gatekeepers on the site rather than a more wide computing audience.
Nothing is meritocratic. I think the term came into our lexicons because of a sociologist satirizing society and writing about how awful a “true” meritocracy would be.
> meritocratic rating
That is literally PageRank.
Pagerank was mostly based on inbound links. A popularity contest with some nuance is just that. Nothing is meritocratic including any Google algo.
It's not merely a democratic vote, where the most links wins, but what the algorithm does is evaluate the links based on the popularity of the originating domain. In other words, meritocratic rating.
You can apply the algorithm to any graph, and what it does is find the most influential nodes.
I’m really interested in this as well. I use DDG and whenever I’m doing research I tend to add “.edu” because there are so many spam sites.
ha nice to hear this idea. I'm planning to work on this as a side project, just started recently
If the user requests a website, you could at least crawl on request, which would be an excuse to bypass the rules in robots.txt. It would be a loophole, let’s say.
That's a great idea.
Interesting. I had some interests in building a search engine myself (for playing around ofcourse). I had read a blog post by Michael Nielson [1] which had sparked my interest. Do you have any written material about your architecture and stuff like that? Would love to read up.
[1]:
https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billio...
there's some stuff here :
https://github.com/gigablast/open-source-search-engine
Holy, thats a huge codebase. Github even shows no code/syntax hl for many cpp files because they are so big.
I fiddled around and searched for some not so well known sites in germany and the results were surprisingly good. But it looks really... aged.
Holy shit. Click on random .cpp file. Browser hangs. O_O
Thank you.
> Cloudflare (owned in part by Google)
Please elaborate. Is there a special relationship between Cloudflare and Google?
Google Capital is an investor:
https://www.forbes.com/sites/katevinton/2015/09/22/google-mi...
That is not the same as being owned by Google.
Especially since Cloudflare went public back in 2019, at which point any investors cashed out.
- Sincerely,
a Google employee who has nothing to do with the investment branch of the company
> at which point any investors cashed out.
Well, actually that is also not true. At IPO preferred stocks convert to common but the investors can keep their ownership, they can but don't have to cash out or can only partially cash out.
Investors can also keep board seats in many (or most?) cases.
In this example, I don’t think it matters if Google Ventures kept their shares or not. So long as they are treated as any other stock holder, I don’t see an issue. If they still maintain a board seat, then there might be an issue, but I don’t see a problem with simply holding shares.
I don't know anything about this particular case, but it's very common for VCs to cash out at IPO or not long after. VCs identify good investments among early stage companies; they don't want to keep their money tied up in investments outside of their specialty.
Actually, being an investor in a company _is_ the same as owning that company in part.
Where did you read that google/alphabet owns part of Cloudflare?
Assuming OP is referring to Google Venture's participation in at least one of Cloudflare's rounds.
https://www.crunchbase.com/funding_round/cloudflare-series-d...
Have you ever looked at the Amazon file?
I'll see if I can track down the link but I remember somebody sharing a dump with me from Amazon that apparently was a recent scrape.
Edit:
https://registry.opendata.aws/commoncrawl/
That's Common Crawl, they do the spidering of some billions of webpages but that's still a tiny percentage of the web versus Google or Bing.
Common Crawl is being used to train the likes of GPT-3 and mine image-text pairs for CLIP. I wonder how much useful content is missing, we're going to use all the web text, images and video soon and then what do we do? We run out of natural content. No more scaling laws.
Do you have any stats on that? I've always wondered about the coverage of Common Crawl, if you include all the historical crawl files too.
Oh interesting, I've played with it a little but not a dev and I've always wondered what the coverage was like.
If you're serious about this, add a paid tier. Until it's free, I don't trust you will not ever sell my data to make bank.
Why do people think a paid tier will prevent their data from being sold after pocketing it? Aside from that if they go bankrupt then it isn't theirs to not give away anymore for one.
You are going to pay for a generalized web search when DDG/Google/Bing/etc are free?
Yes. I use Brave Search and I hope they add a paid tier, which I think they have confirmed they'll add at a later date.
If you don't pay, you are the product. Simple as that.
Telegram, Signal, Mozilla are counterexamples... Have a large charitable donated cash balance sitting in your account, and your organisational motivation is all different
Mozilla Foundation does not fund Firefox, that's in an arms-length wholly owned for-profit subsidiary and Google is main source of funding via the search deal.
https://twitter.com/brave/status/1466510541128548362?s=20
There are a lot of products you pay for, and still are the product.
> If you don't pay, you are the product.
If not enough people pay, there's no product.
If nobody pays, there's even less of it. Not sure what's your point.
I would - the problem with those services is that they prioritise the results that generate the search engine the most money rather than give me the best results, and then indexes my searches to track and advertise to me throughout the web.
A clear pricing transaction sounds much nicer to me. Should generate better results too.
The Internet is such a fabric of society that I think all nations should contribute to a one-truth index. Not owned by a corporate entity. Tell me I’m wrong and we can consider the alternative: startups of all types with a more even playing field.
Great job, I didn't know aboug Gigablast and it looks very interesting. Can I give you a small piece of feedback? I just tried searching for myself on Gigablast, and the first results are profile pages which haven't been updated since like 2005. Meanwhile, my own personal page appears on the very bottom of the results.
So my suggestion would be to lower the weight of the ranking of the domain, and promote sites which have a more recent update date.
Send me an email (contact in profile) if you want to follow up on this feedback!
What we need is a net neutrality doctrine on the server side. Bandwidth is hardly scarce outside of AWS's business model. Ban the crawler user-agent dominance by the big search engine players. "Good behaviour" should be enforced via rate limiting that equally applies to all crawlers, without exemption for certain big players.
I hadn't used gigablast before, but a quick test had it find some very old, obscure stuff, as the top hit. Well done. However, the link on the front page to explain privacy.sh comes up with "Not Private" in Chrome. The root Cisco Umbrella CA cert isn't trusted. Oops.
With a slightly fresher coat of paint this could be very popular. For example, no grey background.
I tried out four search words with your search engine, and I am not convinced that it is mainly the index size and not the algorithm that is to blame for bad search results. There are way too much high ranking false positives. Here is what I tried:
a) "Berlin": 1. The movie festival "Berlinale" 2. The Wikipedia entry about Berlin 3. Something about a venue "Little Berlin", but the link resolves to an online gaming site from Singapure 4. "Visit Berlin", the official tourism site of Berlin 5. The hash tag "#Berlin" on Twitter 6. "1011 Now" a local news site for Lincoln, Nebraska 7. "Freie Universität Berlin" 8. Some random "Berlin" videos on Youtube 9. The Berlin Declaration of the Open Access Initiative 10. Some random "Berlin" entries on IMDb 11. A "Berlin" Nightclub from Chicago 12. Some random "Berlin" books on Amazon 13. The town of Berlin, Maryland 14. Some random "Berlin" entries on Facebook 15. The BMW Berlin Marathon b) "philosophy" 1. The Wikipedia entry about philosophy 2. "Skin Care, Fragrances, and Bath & Body Gifts" from philosophy.com 3. "Unconditional Love Shampoo, Bath & Shower Gel" from philosophy.com 4. Definition of Philosophy at Dictionary.com 5. The Stanford Encyclopedia of Philosophy 6. PhilPapers, an index and bibliography of philosophy 7. The University of Science and Philosophy, a rather insignificant institution that happens to use the domain philosophy.org 8. "What Can I Do With This Major?" section about philosophy 9. Pages on "philosophy" from "Psychology Today". I looked at the first and found it to be too short and eclectic to be useful. 10. The Department of philosophy of Tufts University c) "history" 1. Some random pages from history.com 2. "Watch Full Episodes of Your Favorite Shows" from history.com 3. Some random pages from history.org 4. "Battle of Bunker Hill begins" from history.com 5. Some random "History" pages from bbc.co.uk 6. Some random pages from historyplace.com 7. The hash tag "#history" on Twitter 8. The Missouri Historical Society (mohistory.com) 9. Some random pages from History Channel 10. Some random pages from the U.S. Census Bureau (www.census.gov/history/) d) "Caesar" 1. The Wikipedia entry about Caesar 2. Little Caesars Pizza 3. "CAESAR", a source for body measurement data. But the link is dead and resolves to SAE International, a professional association for engineering 4. The Caesar Stiftung, a neuroethology institute 5. Some random "Caesar" books on Amazon 6. Hotels and Casinos of a Caesars group 7. A very short bio of Julius Ceasar on livius.org 8. Texts on and from Caesar provided by a University of Chicago scholar 9. (Extremely short) articles related to Caesar from britannica.com 10. "Syria: Stories Behind Photos of Killed Detainees | Human Rights Watch". The photos were by an organization called the Caesar Files Group
So what I can see are some high ranked false positives that are somehow using the search term, but not in its basic meaning (a3, a11, b2, b3, d2, d3, d4, d6) or not even that (a6). Some results are ranking prominently although they are of minor importance for the (general) search term (a9, a13, b7, b8 -- perhaps a15 and d10). Then there are the links to the usual suspects such as Wikipedia, Twitter, Amazon, etc. (a2, a5, a8, a10, a12, a14, b7, c5, d1, d5); I understand that Wikipedia articles are featuring prominently, but for the others I would rather go directly to eg. Amazon when I am interested in finding a book (or use a search term like "Caesar amazon" or "Caesar books"). Well, and then there are the search results that are not completely off, but either contain almost no information, at least compared to the corresponding Wikipedia article and its summary (b4, b9, d7, d9), or that are too specific for the general search term (c1, c2, c3, c4, c6, c9, c10).
That leaves me with the following more or less high quality results (outside of the Wikipedia pages): a1, a4, a7, b5, b6, b10, and d8. The a15 and d10 results I could tolerate if there had been more high quality results in front of them; but as a fourth and second, respectively, good result they seem to me to be too prominent. Also in the case of "Berlin" a4 should have been more prominent than a1, and a7 is somewhat arbitrary, because Humbolt University and the Technical University of Berlin are likewise important; what is completely missing is the official Website of the city of Berlin (English version at www.berlin.de/en/).
All in all, I would say that your ranking algorithm lacks semantic context. It seems the prominence of an entry is mainly determined by either just being from the big players like Twitter, Youtube, Amazon, Facebook, etc. or by the search term appearing in the domain name or the path of the resource, regardless of the quality of the content.
I don't know about others, but when I think of the "good old google days" I'm _not_ expecting the results for your example queries to be any good.
In those days querying took some effort but the effort paid off. The results for "history" just couldn't matter less in this mindset. You search for "USA history" or "house commons history" or "lake whatever history" instead. If the results come up with unexpected things mixed in, you refine the query.
It was almost like a dialog. As a user, you brought in some context. The engine showed you its results, with a healthy mix of examples of everything it thought was in scope. Then you narrowed the scope by adding keywords (or forcing keywords out). Rinse and repeat. As a user, you were in command and the results reflected that.
The idea that the engine should "understand what you mean" is what took us to the current state. Now it feels like queries don't matter anymore. Google thinks it knows the semantics better than you, and steering it off its chosen path is sometimes obnoxiously hard.
> The idea that the engine should "understand what you mean" is what took us to the current state. Now it feels like queries don't matter anymore. Google thinks it knows the semantics better than you, and steering it off its chosen path is sometimes obnoxiously hard.
Bingo! If you cede control to Google, it _will_ do what it's optimized to do, and not what _you_ are looking for.
What is optimized to do says nothing.
Optimizing for open text queries means dealing with a massive search space, the thing is choosing a subspace where to search and that is the part engines have to refine, how that is done is a different story. Some people may agree to use their location, search history and visits to online stores to do so but some may not.
This is why, in the good old days, my favourite search engine was Alta Vista. In its left margin it had arranged key words like a directory tree that could be used to further refine the search. So my ideal search should do something like this if I type in a generic term: provide me with relevant information about the general topic and than help me to refine my search. The way of Wikipedia to provide a principal article and a structured disambiguation page is the way I would prefer.
I admit, my evaluation of the search engine was just a simple test how much I can get out of the results for some generic key words in the first place. A more detailed evaluation should, of course, look deeper. It was more of a test balloon to see if this search engine raises any hope that it could be better than Google with regard to my own (subjective) expectations of a decent result set.
I get what you mean, but part of the whole initial appeal of Google was that it gave much more relevant results initially than Altavista or the other options. That was why Google put in the audacious "I'm feeling lucky" button.
Yeah but it's from that same philosophy that Google Search is useless as it optimises for the first result.
There is no search engine that searches literally for what you asked and nothing else. Search is shit in 2021 because it tries to be too clever. I'm more clever than it, let me do the refining.
>"I'm feeling lucky" button
My brain got so used to ignoring it I completely forgot it's a thing. I'm also unclear what it does? On an empty request, it gets me to their doodles page and with text in the box, gets me to my account history landing page.
It automatically redirects to the first search result.
Right, not sure why it wasn't working yesterday as opposed to now, I swear I wasn't doing it wrong (or how I could've).
This was the result of two things.
mapreduce.
using links rank the pages.
Using links to rank the pages is not really possible any longer because of seo spam links.
I think you have some great feedback here but for me it also highlights how subjective search results can be for individuals - for example, these false positives that you mention (b2, b3) appear as the top result on Google for me for that query.
It makes me think there must be some fairly large segment of the population that want that domain returned as a result for their query, no?
I would not deny that a large part of subjectivity is involved. This is why I used several markers of subjectivity in my evaluation ("what I can see", "that leaves me", "they seem to me", "I would say", etc.). And related to that: I also agree with other responses that a search often needs to be refined. So my four examples where in no way an exhaustive evaluation, but an explorative experiment, where I just used two proper names, one for a city and one for a historical person, and two general disciplines as search words, in order to see what happens and what is noteworthy (to me). So much to the subjective side.
But what can be said about ideal search results for these terms beyond subjectivity? I do not think that we can arrive at an objective search result, but are nevertheless allowed to criticise search results with respect to their (hidden or obvious) agenda.
Let me give an example of the good old days: When I was searching for my surename on Google in the early 2000s the search results contained a lot of university papers or personal Web-sites (then called "homepages") from other people of that name. But suddenly, I can't remember when exactly this was, the search results contained almost exlusively companies that contained that surename in its company name. The shift was not gradually, as if it were representing a slow shift in the contents of the Internet itself, but abrupt. It was apparently due to an intentional modification of the ranking algorithm that put business far above anything else on the Internet.
My explanation for this is the following: The objective metrics for Google search results is the stream of revenue they generate for Google. But not only for Google. The fundamental monetary incentive to Bing (and its derivative Ecosia) is more or less the same. And how different the impact of the somewhat different business model of Duck Duck Go is, is open for debate.
If maximum revenue is the goal, the aim is to provide the best search results according to the business model (advertisment, market research and whatever else) without driving the users away. But the best search results according to the business model are not necessarily the optimal search results for the typical user. And as long as all relevant competitors are following the same economic pressure of maximizing revenue the basic situation an thus the qualitiy of the search results for the user will not improve above a certain level. If we want this situation to change, we need competitors with a different, non-commerical agenda. Either from the public sector (an analogon to the excellent information services about physical books provided by libraries) or from non-profit organizations (an analogon to Wikipedia or Open Street Map).
To answer your question about b2 and b3: I checked with other search engines; besides Google they appear for me also on Bing (as #8, same product but on a different Web-site) and Duck Duck Go (as #10); Bing also has a reference to them in the right margin as a suggestion for a refined search (this time exactly b2 and b3). Although I do not think that the results from those search engines should be considered as a general benchmark for good search results for the reasons given above, we may speculate why they appear on the first page of search results. I would guess that it is a combination of gaming the search engines by using a generic term as a product and domain name to get free advertising, and search engine algorithms making this possible by generally ranking products and companies high in their search results.
Oh of course you can criticise the result, I more found it interesting that a billion dollar, optimized search experience thought your false positive was actually a top result. A huge variance in the subjectivity between your experience and their invested reasoning.
But while we're speculating on how the domain the appears at the top of the list, let me hazard a guess...
Philosophy.com was registered in 1999 and according to waybackmachine, has been selling cosmetics on the site since 2000 (20+ years). The company sold in 2010 for ~$1B to a holding company with revenues of $10B+ today (Unfortunately I couldn't find how much it contributes to that revenue). According to Wikipedia, the Philosophy brand has been endorsed by celebrities, including "long-time endorser" Oprah Winfrey, possibly the biggest endorsement you could get for their industry/demographic.
I think it is a long established business, with strong revenues and there's more people online searching for cosmetic brands than for philosophers.
In the same way (admittedly in the extreme) when I'm researching deforestation and I query to see how things are going for the 'amazon', the top result is another successful company registered pre 2000, with strong revenues that most likely attracts more visitors..
Okay, you convinced me that it should (inter-subjectivly) count not as a real false positive as I first thought.
Nevertheless, when I try to analyze what is going on here, I would rather use the word "context" instead of "subjectivity", since I think (or at least hope) that my surprise to find this brand on place #2 in my Google results for "philosophy" is shared by quite a lot of people who lack the context to give it meaning, because this brand is unknown to them. I have the excuse that it is a North American brand irrelevant in my German context. Interestingly, when I search for "philosophy" on amazon.com (without refining the search), I get almost exclusively beauty products and related items as a result, but when I search for "philosophy" on amazon.de it is only books. Google nevertheless has the beauty brand as #2 in Germany. Can we agree that Amazon is better at considering the context of the search for "philosophy" than Google?
As an aside: Your "amazon" example reminds me when I was searching for "Davidson" expecting to find information about Donald Davidson, but received a lot of results about Harley-Davidson. (But since I was aware of the importance of this brand, it was understandable to me.)
We can agree on that, yes =)
I was thinking about this and when you look at the top keyword searches on Google, it's dominated by people searching brands each year, so I think Google is just naturally optimised for this. I think any Search Engine designed for the masses would probably have to behave like this too.
https://www.siegemedia.com/seo/most-popular-keywords
I agree, I think the early web was used more for general information rather than specific brand information (and was more useful for people like myself). I'm not sure what is needed to get more results such as university papers or personal web-sites as I think that people use the internet differently now and that the link structure reflects that.
It's interesting that Google isn't used to search for people anymore (I couldn't see any people in the recent top 100 keyword search data).
Some observations:
Most of the "brands" in the top 100, especially at the beginning, are rather Internet services. These search terms seem to have been entered not with the intention to "search" in the sense to find some new information, but as a substitute for a bookmark to the respectice service. Who searches for #1 "youtube" does not want information about youtube, but wants to use the youtube Web-site as a portal to find videos there.
I would also guess that most of these searches haven't been initiated through the Google Web-site, but directly from the browser's adress/search bar or a smartphone app. They exhibit a specific usage pattern, but do not show what the people, that entered them, were really searching for, if they were searching at all. What are those people who search for "youtube" doing next: either search again on youtube or log into their youtube account and browse their youtube bookmarks.
The early Internet did not have so many different service people used at a daily basis, and those that existed were more diverse (think of the many differen online email providers in those days) so that the search terms spread out more. Also browsers had no direct integration with a search engine. The incentive was higher to use bookmarks for your favourite service, since otherwise you had to use a boomark to a search engine anyway.
Perhaps it would be more approbriate to compare the use of the early Google not with the current Google, but the current Google Scholar?
You inspired me to try an even less specific search: thing
Subjectively felt the gigablast results were a relative delight.
No bad idea. At the risk of being sidelined: "philosophy" was not so a bad term either. Start with an arbitrary Wikipedia link and click on the first keyword of the summary after the linguistic annotations (or other annotations in brackets) and repeat the process until you reach a loop. You will almost always end with "philosophy" -> "metaphysics" -> "philosophy" -> ... This works for "Berlin", "history" and "Caesar" as well as for "thing". For the latter very fast: "thing" -> "object" -> "philosophy".
that's tripped out. where did you hear about that?
I can't remember. Probably on Hacker News.
I'll admit I had not been working on the quality of single term queries as much as I should have lately. However, especially for such simple queries, having a database of link text (inbound hyperlinks and the associated hypertest) is very, very important. And you don't get the necessary corpus of link text if you have a small index. So in this particular case the index size is, indeed, quite likely a factor.
And thank you for the elaborate breakdown. It is quite useful and very informative, and was nice of you to present.
And I'm not saying that index size is the only obstacle here. I just feel it's the biggest single issue holding Gigablast's quality back. Certainly, there are other quality issues in the algorithm and you might have touched on some there.
Let me add just one thought on the single term searches: I do not think that a good search result for such terms as "philosophy" should focus on the primary meaning of the term alone. As someone else had pointed out, the beauty brand can be quite important for some people. If we look at a search engine as a tool that needn't present me with near perfect results from the outset, but something I can have a dialogue with to find something, than it is best that results for single terms presents me with a variety of different special meanings (and probably some useful suggestions how to refine my search). Perhaps you can scrap the Wikipedia disambiguation pages and use it somehow to refine your search results.
Let's compare with google:
- Berlin:
Wiki
Berlin travel site (visit Berlin)
website for Berlin
Youtube videos
Britannica for Berlin
Bunch of US town sites named Berlin
- Philosophy:
Same skincare website is first result
Wiki is second
Britannica is third
Stanford
News stories
Other dictionaries and encyclopedias
- History"
history.com is first result
Then is the "my activity" google site, maybe this is
actually relevant
Youtube, lots of history channel stuff
Twitter history tag
Wikipedia for "History"
How to delete your Chrome browser history
Dictionary definitions
- Caesar:
Wiki for Julius Caesar
Britannica
BBC for JC
Google maps telling me how to get to Little Caesar's Pizza
Dictionary
Apparently some uni has a system called CAESAR
biography.com
Caesar salad recipe
history.com
images for Caesar
OK, I'll bite. How would _you_ rank the results for each of those queries?
what heuristics or AI is being used for blocking your spider? If your spider appears human or organuc it will not be blocked right?
Is this an issue of rate limiting, or request cadence? could you add randomness to the intervals in which you request the page?
Is it more complicated? do they use other signals to ascertain if you are a script or not like checking data from the browser (similar signals to the kind of things browser fingerprinting uses... e.g. screen res, user agent, cache availability, etc...) would it be possible for the browser to spoof this information?
I imagine rate limiting the IP address is the major issue... but could you not bounce the request through a proxy network? I've tried this with the TOR network before when writing web scrapers and had mixed success... seems like Google knows when a request is being made through Tor.
Perhaps you could use the users of your search engine as a proxy network through which to bounce the request for the scrape/indexing... This way the requests would look like they were coming from any of your users instead of one spiders ip address...Im not sure how cloudflare or any other reverse proxy could determine that thise requests were organic or not...
id be ok with contributing to a distributed search service so long as my cpu was not making requests to illegal content, and there were constraints put on the resource usage of my machine.
Sorry if this came off as all over the place, I do not know too much about the offense vs defense of scraping. These are just some thoughts...
> I've tried this with the TOR network before when writing web scrapers and had mixed success... seems like Google knows when a request is being made through Tor.
That's because all the TOR entry/exit nodes and relays IPs addresses are public [1].
[1]
https://metrics.torproject.org/rs.html#toprelays
Regarding the gatekeeper problem: it's a wild guess but maybe if there was a way to involve users by organizing distributed scraping just for the sake of building a decent index, I'm sure many of them would help.
yes, large proxy networks are potential solutions. but they cost money, and you are playing a cat and mouse game with turing tests, and some sites require a login. furthermore, people have tried to use these to spider linkedin (sometimes creating fake accounts to login) only to be sued by microsoft who swings the CFAA at them. so you start off with an intellectual desire to make a nice search engine and end up getting sidetracked into this pit of muck and having microsoft try to put you in jail. and, no, i'm not the one microsoft was suing.
Not sure if you're looking for feedback, but the News search could use some work, I searched for "Ethiopia" and almost all of the articles were unrelated to Ethiopia except for the existence of some link somewhere on the page.
Your general web search seems pretty good, although I've just given it a casual glance. I think your News search could be improved by just filtering the general search results for News-related content, since the "Ethiopia" content I get there is certainly Ethiopia-related.
In any case, an interesting product, I'll try to keep an eye on it.
_It's much more expensive now to build a large index (50B+ pages)_
Do you have a cost estimate? Also could you be more selective in indexing, e.g. by having users requests sites to be crawled.
Requiring users to know what sites they want in advance somewhat defeats the purpose of a search engine, no?
Not at all. You only have to fail the first request. It is an approach I took with my own attempt at a search engine way back! In fact I know personally that there is at least one patent out there that suggests initial 1st time request users being asked to provide the appropriate response as an efficient way to teach systems for future users.
Obviously failing first requests isn't ideal but for popular requests it quickly becomes insignificant. Wikipedia might (if they don't already) want to make a similar suggestion for users to contribute when finding a low content/missing page.
> Obviously failing first requests isn't ideal but for popular requests it quickly becomes insignificant.
The first request can also be called asynchronously, and display a message to the user that it is 'processing....'.
More often than not I have an idea which site a result might be on when I issue a query:
If I search for a news event it's a news site.
If I search an error message, I know the result is going to likely be stackoverflow, github issues or the forum of the library.
etc.
I don't think this strategy will get you all the way there, but it could be combined with other ways of curating sites to crawl.
since sites are so desperate to be indexed, doesn't it seem better to put the onus on them to announce themselves? it would be great if dns registries publshed public keys .. maybe they do in newer schemes?
That works once your search engine is more widely used, but not a lot of sites are going to register with a niche search engines. Many users on the other hand really want a search engine like this and would be willing to invest some time.
Certificate Transparency (CT) Logs are this.
Is there a way to get the results to be formatted for desktop?
It looks like the layout is hard-coded for a mobile browser, in portrait mode.
I just looked myself up in your search engine and I can confirm that it finds stuff old enough that google wouldn't find them (eg: and old patch I submitted on gnu savannah).
I tried looking up a game I'm interested in and the second results cluster from your search engine is a reddit thread about linux support for that game... I love this.
Great job!
What are your sources for hostnames to crawl?
I looked into it a long time ago and seem to remember there was a way to get access to registration records, but I imagine combining that with HTTP certificate transparency records would significantly increase your hostname list. Anything else?
This is great! I found something other engines do not pick up! apparently I signed an agile manifesto in 2010
https://agilemanifesto.org/display/000000190.html
I just tried it and the UI is kinda old and not mobile friendly but the English results I got were satisfying. Not the case for French though. I'll try again in the future, diversity in this landscape is important.
Re: crawling being too hard
Have you contributed your crawl data to common crawl?
I tried searching for an answer, but how do you get a site added to your directory? Who maintains it? Directories are a real PITA to maintain with any quality.
> 2) Hardware costs are too high.
Which is why the next big search engine should be distributed:
.
"distributed" doesn't make things more hardware efficient...
It literally always make them less efficient.
If e.g : mastodon had the same number of users as Twitter it would use 10x the ressources for the same traffic.
Sure, but it does spread the costs among users and makes them more manageable. One guy shouldering the cost of a search index is less viable than letting users shoulder the costs. Some charge customers as a solution to this, and that works, but then they need a minimum revenue to continue, or have to monetize with investors which usually means changing direction and goals. The other option, letting people host portions of the index, spreads the cost out, and the product gets about as good (best case scenario) as it's utility to people.
No way to test it right away, demo peer 502-es.
You could search for other public-facing instances, e.g.,
http://sokrates.homeunix.net:6060
.
Regarding the Gatekeeper companies like Cloudflare, it sounds like anti-competitive behavior that could potentially be targeted with anti-trust legislation, correct?
Cloudflare functions kinda like a private security company. They don't go around blocking sites willy-nilly, site owners have to specifically choose to use their service (and maybe pay for it), configuring the bot blocking rules themselves.
That's not really Cloudflare's fault. Someone has to do it, whether it's them or a competitor or sys admins manually making firewall rules. Cloudflare just happens to be good enough and darned affordable, so many choose to use them.
Hosting costs for small site owners would be much more expensive without Cloudflare shielding and caching.
I've had extensively dealing with Cloudflare. They have a complex whitelisting system that is difficult to get on, and they also have an 'AI' system that determines if you should be kicked off that whitelist for whatever reason.
Furthermore, they give Google preferred treatment in their UIs and backend algos because it is the incumbent and nobody cares about other smaller search engines. So there's a lot of detail to how they work in this domain.
It's 100% Cloudflare's fault, and it's up to them to give everyone a fair shot. They just don't care. Also, you are overlooking the fact that Google is a major investor (and so is Bing and Baidu). So really this exacerbates the issue. Should Google be allowed (either directly or indirectly) to block competing crawlers from dowloading web pages?
It isn't up to them to give everyone a fair shot. That isn't what their customers actually want. Cloudflare aren't in the "fair shots for all search engines" business. They are in the "stop requests you don't want hitting your servers" business.
I'd argue that a level playing field and more competition in the search space is a good thing.
These are all great points.
No, I think it is partially Cloudflare's fault because they offer this service and make it easy to deploy. This shit has exploded with Cloudflare's popularity.
Nobody _has_ to do it, but a lot of people will do it when they notice there's an easy way to do it. Cloudflare is very much an enabler of bad behavior here. Now a lot of sites just toggle that on without even thinking about collateral.
"targeted with anti-trust legislation"
Um, this is America. Every market is basically a trust, cartel, or monopoly.
And I don't know if you can hear that, but there is literally laughter in the halls of power. All the show hearings by congress on social media and tech companies only has to do with two things:
1) one political party thinking the other is getting an advantage by them
2) shaking them down for more lobbying and campaign donations
No one in the halls of power give two shits about competition. Larger companies mean larger campaign donations, and more powerful people to hobnob with if/when you leave or lose your political office.
Of course I think that breaking up the cartels in every major sector would lead to massive improvements: more companies is more employment, more domestic employment, more people trying to innovate in management and product development, more product choice, lower prices, more competition, more redundancy/tolerance to supply chain disruption, less corruption in government and possibly better regulation.
Every large company brazenly does market abuse up and to the point of one and only one limiter: the "bad PR" line. So I guess we have that.
Companies don't make campaign donations. The people "exposing" them are showing their employees making donations, and employees don't have the same interests as their employer.
it should be. there should be some sort of 'bots rights' to level the playing field. perhaps this is something the FTC can look into. but, as it is right now big tech continues to keep their iron grip on the web and i don't see that changing any time soon. big tech has all the money and controls access to all the data and supply chains to prevent anyone else from being a competitive threat.
look at linkedin (owned by microsoft unspiderable by all but google/bing).
github (now microsoft using this to fuel its AI coding buddy, but if you try to spider this at capacity your IP is banned)
facebook (unspiderable)
.. the list goes on and on ..
and as you can see, data is required to train advanced AI systems, too. So big tech has the advantage there as well. especially when they can swoop in and corrupt once non-profit companies like openai, and make them [partially] for-profit.
and to rant on (yes, this is what i do :)) it very difficult to buy a computer now. have you tried to buy a raspberry pi or even a jetson nano lately? Who is getting preferred access to the chip factories? Does anyone know? Is big tech getting dibs on all the microchips now too?
No, it is not.
Cloudflare is giving it's customers what they want. They don't want all kinds of bots claiming to be search engines crawling their sites. Cloudflare isn't hurting cloudflare competitors by doing this. Cloudflare isn't hurting their customers by doing this. To repeat - most websites don't want lots and lots of crawlers. They want the 2 or 3 which matter and no more, because at some point it's difficult to tell what the crawler is doing... (is it a search engine???). They aren't obliged to help search engines. Even if Cloudflare wasn't offering this, bigger customers would roll their own and do.. more or less the same thing.
At a theoretical level it looks like Cloudflare won't block search engine crawlers. The docs are very Google and Bing oriented and also oriented towards supporting their customers, not random new search engine crawler.
_Cloudflare allows search engine crawlers and bots. If you observe crawl issues or Cloudflare challenges presented to the search engine crawler or bot, contact Cloudflare support with the information you gather when troubleshooting the crawl errors via the methods outlined in this guide._
https://support.cloudflare.com/hc/en-us/articles/200169806
i would assume its mostly anti scraping protection which is mostly for privacy.
you don't want to allow everyone scrap your website, pull and use your info. for example from fb, ig, LinkedIn, github, ....
you can build a really big profiling db on people that way.
so websites need to know you are a legit search engine first
people can still be targeted if that information is public. anti scraping sounds like security by obscurity
> Hardware costs are too high
I want to say - you don’t know what are talking about. But, it’ll be rude.
Hardware is much cheaper and powerful now compared to 2005.
You've said it and it is rude, what's the point of that first sentence except to spite him? I'm sure he's well aware of the price per capability trend since 2005, you don't code a search engine without knowing that. Could be the costs of servicing his free users and/or maintaining an ever-growing database/index that is costly - in spite of cheaper hardware on a relative basis.
the complexity of the search algorithm has also increased substantially since 2005 And, in 2005, a billion page index was pretty big. Now it's closer to 100 billion.
There were ~60B pages on Facebook in 2015 I think your numbers are outdated. - Google search SRE
what kind of index is Gigablast using?
traditional inverted index like Lucene or something more esoteric?
I know Google and Bing both use weird data-structure like BitFunnel
https://www.microsoft.com/en-us/research/publication/bitfunn...
100% custom.
Oh my god! This works so much better than every Internet search engine I have tried.
If you have customers, does that mean the incremental gain from an improved index costs too much to store? Or are you talking about computational costs?
it's both storage and computational. they go hand in hand.
What if you allowed trusted contributors to "donate" their browsing to your index?
AltaVista and Yahoo did that with browser plugins in the 90s.
Make sure to file complaints to any competition market authority you have in your country.
did you ever try to raise funds? why/not? not accusing, just curious.
did you ever think, let me just focus on Italy-relevant results? or job search only? or some slice of search.
I really love how the results organize multiple matching pages from the same domain. This is really cool.
I wanted to add my site to Gigablast, but it said it would cost 25 cents. How is this a good thing?
curious how you implemented the index, memory based or disk based? Either way you are right, HW costs are extremely expensive and you would need a lot of high RAM/high core count machines to return such a large index to the endusers in a low latency fashion.
Storing information about the pages you can't index, is also useful
I really like GigaBlast.
I wrote a "meta" search utility for myself that can query multiple search engines from the command line.^1 It mixes the results into a simplified SERP ("metaSERP"), optimised for a text-only browser, with indicators to show which search engine each result came from. The key feature is that it allows for what I might call "continuation searches". Each metaSERPs contains timestamps in its page source to indicate when searches were executed, as well as preformatted HTTP paths. The next search can thus pick up where the previous one left off. Thus I can, if desired, build a maximum-sized metaSERP for each query.
The reason I wrote this is because search engines (not GigaBlast) funded by ads are increasingly trying to keep users on page one, where the "top ads" are, and they want to keep the number of results small. That's one change from 2005 and earlier. With AltaVista I used to dig deep into SERPs and there was a feeling of comprehensiveness; leave no stone unturned. Google has gradually ruined the ability to perform this type of searching with their now secretive and obviously biased behind-the-scenes ranking procedures.
Why is there no way to re-order results according to objective criteria, e.g., alphabetical; the user must accept the search engines' ordering, giving them the ability to "hide" results on pages the user will never view or simply not return them. That design is more favorable to advertising and less favorable to intellectual curiosity.
Each metaSERP, OTOH, is a file and is saved in a search directory for future reference; I will often go back to previous queries. I can later add more results to a metaSERP if desired. I actually like that GigaBlast's results are different than other search engines. The variety of results I get from different sources arguably improves the quality of the metaSERP. And, of course, metSERPs can be sorted according to objective criteria.
This is, AFAIK, a different way of searching. The "meta-search engines" of yesteryear did not do "continuations", probably because it was not necessary. Nor was there en expectation that user would want to save meta-searches to local files. Users were not trying to minimise their usage of a website, they were not trying to "un-google".
Today's world of web search is different, IMO. There seems to be a belief that the operator of a search engine can guess what a user is searching for, that a user who sends a query is only searching for one specific thing, and that the website has an ad to match with that query. At least, those are the only searches that really matter for advertising purposes. Serendipitous discovery while perusing results is not contemplated in the design. By serendipitous discovery I do not mean sending a random query, e.g., adding an "I'm feeling lucky" button, which to me always seemed like a bad joke.
The only downside so far is I ocassionally have to prune "one-off" searches that I do not want to save from the search directory. I am going to add an indicator at search time that a search is to be considered "ephemeral" and not meant to be saved. Periodically these ephemeral searches can then be pruned from the search directory automatically.
1. Of course this is not limited to web search engines. I also include various individual site search engines, e.g., Github.
Wow, do you happen to have published your utility so that other people can play with it?
The problem is that (1) I am a minimalist and dislike lots of "features" and (2) I prefer extremely simple HTML that targets the links browser. Most users are probably using graphical, Javascript- and CSS-enabled browsers so while this may work great for me, it may be of little interest to others who have higher aesthetic expectations. Another problem is I prefer to write tiny shell scripts and small programs in C that can be used in such scripts. To be interesting to a wider audience, I would likely have to be re-write this in some popular language I do not care for.
If I see people on HN complain about how few results they get from search engines, then that could provide some motivation to publish. I am just not sure this is a problem for others besides me.
Many results I get from search engines are garbage. By creating a metaSERP with a much higher number of results overall, from a variety of sources, I believe I get a higher number of quality ones.
Well something like that would be interesting to a particular demographic. I prefer minimal aesthetic cruft as well, and like terminal stuff like links.
If you ever do decide to publish, be sure to post it here!
How much cash do you need?
maybe just add small webpages into your index, dont bother yo execute JS and dont download any images.
The content quality will be higher and it's a lot cheaper.
Out of curiosity, why would not executing JavaScript or not downloading images equal higher content quality?
Why do you have a user account with a login?
Do you have some sort of PageRank?
how recent are your results? 1-2h? 1 day?
it's continually spidering. just not at a high rate. actually, back in the day i had real time updates while google was doing the 'google dance'. that caused quite a stir in the web dev community because people could see their pages in the index being updated in real time whereas google took up to 30 days to do it.
>Gigablast has teamed up with Imperial Family Companies
Associating with that crank (responsible for recent freenode drama) is very off-putting.
Oh no, you see he isn't responsible, it's everyone else! /s
I don't get it, what's the fuzz here?
The guy who took over Freenode styles himself as the crown prince of korea; IFC is his company.
I'm sorry to say but your project is 20 years old and it had no impact at all. You are doing something wrong. Innovation and initiative is needed ala Bitcoin and DeFi not hobby projects which are not picking up in popularity and utility.
Bitcoin and DeFi don't have utility outside of gambling and pump and dumps. Not everything (tbh not really anything) needs crypto.
Crypto’s biggest achievement is being the financial equivalent of the gulf war oil fires. Just massive pollution. Think of all the good things that computing could be used for… we used to have all kinds of interesting collaboration projects. Instead we are setting those CPU cycles on fire for short term profit.
Imagine if all that processing power was used for Folding@Home.
The problem is that cryptocurrencies do not inherently need tons of processing power to operate. You could theoretically run the entire Bitcoin network on a Raspberry Pi. But the PoW algorithm was designed to always produce a block every 10 minutes, no matter how much hashing power was dedicated to the network. Everyone wanted a piece of the block reward pie, so the arms race was created.
Proof-of-stake algorithms would eliminate this problem entirely, but PoS is a shitty "rich get richer" method. Granted, with how expensive mining power is, even PoW results in the rich getting richer, but at least it doesn't result in the wasting of gigawatts of electricity.
> _Everyone wanted a piece of the block reward pie, so the arms race was created._
And that's intentional – getting people pursue the goal for their own egoistic reasons, because that's bound to succeed. As a result, they all increase the security and stability of the network whether they want it or not, the only way to not do this is to not participate. If the network were running on a single Raspberry, someone bringing two Raspberries could effectively outcompete the other person on block rewards.
I'm not sure how this can be avoided without fundamental changes in society with regards to competition and adversity.
Read Bitcoin whitepaper. Bitcoin was meant to decentralize trust and to eradicate fraud through transparent decentralized database called Blockchain. It is certainly more impactful than hobby search engines taking in consideration Bitcoin was also hobby project but really revolutionary one.
Go search what Larry Page said 20 years ago: If innovation is commercially successful it can have more widespread impact.
So your response to the author saying "I'm trying to be commercially successful, but it's really hard for these reasons" is "You should try being commercially successful"?
Ok...
I respect his effort but the project is 20 years old and yet not commercially successful? There must be a reason behind it. The project is not good enough. Like I said only innovation can displace Google. Innovation is not something new and different innovation is something better.
The bitcoin brainworms do bad things to people.
I suggest you update your patter some, though. A good coin scam needs to sound a lot less dated.
<div id=content style=padding-left:40px;>
</div id=box>
lmao, hopefully the C code isn't nearly as bad as your html
Please make your substantive points without snark or swipes. We ban accounts that do the latter, because it's poisonous to the culture we're trying to develop here.
If you wouldn't mind reviewing
https://news.ycombinator.com/newsguidelines.html
and taking the intended spirit of the site more to heart, we'd be grateful.
Probably been here longer than you, so really irrelevant. Anyway every single page of his site has html errors, not pointing it out is more poisonous than doing so.
HN users need to follow the site guidelines regardless of how long they've been here or how strongly they feel about HTML errors.
The consistent theme every time this comes up is that dealing with the sheer weight of the internet is almost impossible today. SEO spam is hard to fight and the index gets too heavy. However, I wonder if this is a sign that we're looking at the problem wrong.
What if instead of even _trying_ to index the entire web, we moved one step back towards the curated directories of the early web? Give users a search engine and indexer that they control and host. Allow them to "follow" domains (or any partial URLs, like subreddits) that they trust.
Make it so that you can configure how many hops it is allowed to take from those trusted sources, similar to LinkedIn's levels of connections. If I'm hosting on my laptop, I might set it at 1 step removed, but if I've got an S3 bucket for my index I might go as far as 3 or 4 steps removed.
There are further optimizations that you could do, such as having your instance _not_ index Wikipedia or Stack Overflow or whatever (instead using the built-in search and aggregating results).
I'm sure there are technical challenges I'm not thinking of, and this would absolutely be a tool that would best serve power users and programmers rather than average internet users. Such an engine wouldn't ever replace Google, but I'd think it would go a long way to making a better search engine for a single user's (or a certain subset of users') everyday web experience.
I agree, I think we are looking at the problem wrong. And this is a very insightful comparison with the linkedin levels of connections idea. I am working on something with this.
One thing to point out is that when we think of searching through information, we are searching though an information structure aka graph of knowledge. Whatever idea or search term we are thinking of is connected to a bunch of other ideas. All those connected ideas represent the search space or the knowledge graph we are trying to parse.
So one way in the past people have tried to approach this is they try to make a predefined knowledge graph or an ontology around a domain. They try to set up the structure of how the information should be and then they fill in the data. The goal is to dynamically create an ontology., Idk if anyone has really figured this out. But, Palantir with Foundry does something related. They sorta dynamically create an ontology ontop of a company's data. This lets people find relationships between data and more easily search through their data. Check this out to learn more
https://sudonull.com/post/89367-Dynamic-ontology-As-Palantir...
This might work well in some situations (e.g. research, development), however it would also increase the effect of echo chambers I think.
Possibly, but I'm not convinced.
Google's not exactly working against the echo chamber problem, and I think that's because to do so would be to work against its own reason for existing. There are two goals here that are fundamentally at odds with each other:
1) Finding what you're looking for.
2) Finding a new perspective on something.
A search engine's job is to address the first challenge: finding something that the user is looking for. The search engine might end up serving both needs if they're looking for a new perspective on something, but if these two goals ever come into conflict with each other the search engine does (and I would argue it _should_) choose the first goal. Failing to do so will just lead to people ignoring the results.
Part of the thing with echo chambers is that the search terms themselves can be indicative of a particular bubble. For example, there's a difference in the people that refer to the Bureau of Alcohol, Tobacco, and Firearms by the official initialism, "ATF", and those that use "BATF". There's a strong antigun control bent to the `"BATF" guns` query, compared to the `"ATF" guns` query.
If you're indexing forums or social media, the same site is going to give back the bubbled responses, possibly without the person even being aware they're in a bubble.
https://www.google.com/search?q=%22BATF%22+guns&client=safar...
https://www.google.com/search?q=%22ATF%22+guns&client=safari...
Kind of like when searching for "jew" on Google led to antisemitic websites, that's because jews usually prefer the term "jewish".
Interestingly, back then, Google was big on neutrality and refused to do anything, stating that it reflected the way people used the word. It was finally addressed using "Google bombing" techniques. Something that Google didn't care much about back them because of its low impact.
echo chambers are what most people want :)
echo chambers are what most people want :)
The retro idea of curation seems popular here but everybody forgets why it lost out in the first place. It just doesn't scale in the first place. Not to mention demands - people usually want tools which lower mental effort and are intuitive as opposed to ones which are precise but in an obtuse metric. Most would not find a hardware mouse that consisted of two keypads for X and Y coordinates and a left click and right click button very useful.
Similarly everyone maintaining your their own index is cumbersome overkill in redundancy, processing power, and human effort in return for a stunted network graph which is worse for all metrics people usually actually care about. In terms of catching on even "antipattern search engines" that attempt to create an ideological echo chamber would probably catch on better.
Short of search engine experiments/start up attempts the only other useful application I can see is "rude web-spidering" which deliberately disrespects all requests to not index pages left publicly accessible as search engines generally try not to be good tools for cracker wardriving for PR and liability reasons. It would be a good whitehat or greyhat tool as doors secured by politeness only are not secure.
I like the idea of a subset of the web, and for a niche purpose. Not sure about user-hosting.
Capital is the huge barrier to entry today:
Larry Page's genius was to extend google's tech, consumer-habit and PR barriers-to-entry into a capital-based advantage: massive geo server farms, giving faster responses. Consumers have a demonstrated huge preference for faster response.
I’ve often thought Alexa.com top n sites Would be a good starting point.
I wonder if we could use some kind of federation (ActivityPub?) to build an aggregate of the search indexes of a curated community. Something like a giant federated whitelist of domains to index.
That's basically what I'm doing with my search "site:reddit.com" I wonder if anyone at Google is aware of this trend and taking notes.
I estimate that about half of my searches have either site:reddit.com or site:news.ycombinator.com at the end. In fact, I have an autocomplete snippet on my Mac so I don't have to type all that.
FYI this is exactly what the hashbangs in DDG do!
Reddit is missing a huge opportunity by not improving their crappy search functionality.
What if we allow users to upvote and downvote search results. Too many downvotes and you get dropped from the index.
Companies will simply hire people or purchase bots to downvote their competitors and upvote themselves, and then an entire economy will develop around gaming search engine algorithms, so that eventually search results will be completely useless.
Basically, SEO. SEO is the real problem, not search engine algorithms. Those algorithms are a result of the arms race between Google and black-hat SEO BS. Remove SEO and search engines work just fine.
what you are suggesting would make the problem of echo chamber (bubble) worse than it is today!
Awkwardly complaints about echo chamber as a problem tends to not refer to feedback dynamics (crudely but disambiguating refered to as circle jerk) so much "People disagree with me, the nerve of them!". It is not viable to have parties A through Z sharing the same world and all having absolute control over all others. We see this same complaint every time modernation comes up, let alone the fundamentals of democracy.
Bubbles are great if you are on the outside looking in at how a specific group thinks. Bubbles are horrible if you are on the inside trying to explore your thoughts.
It's flawed from the get go if reddit is the basis.
As much as I like to hate on reddit (I'm a permanently suspended user), not every sub there is trash. There are some great subs there on very specific niche topics.
Badge of honour I'd say. What was your transgression?
Someone asked about the Hunter Biden files. I responded with g n e w s . c o m . It took a few weeks, but they finally found it and suspended me for it. Others they suspended for mentioning the news organization that mentioned gnews.
I'm a permanently suspended member too (permanent for technical reasons), and I have never posted on there.
I'm sure the algorithms are making echo chambers worse. Curating news opinion sites based on a prediction score of how often Chicken Little was right about the sky falling after the fact would surface reliable journalists and actual psychics!
Natural Language Processing is a pox on modern search engines. I suspect that Google et. al. wanted to transform their product into an answer engine that powers voice assistants like Siri and just assumed everyone would naturally like the new way better. I can't stand how Google is always trying to guess what I want, rather than simply returning non-personalized results solely based on exactly what I typed in the textbox.
While that may be good for most people, there is still a lot of power and utility in simple keyword-driven searches. Sadly, it seems like every major search engine _has_ to follow Google's lead.
I think _some_ NLP is strictly beneficial for a search engine. You may think "grep for the web" sounds like a good idea, but let me tell you, having tried this, manually going through every permutation of plural forms of words and manually iterating the order of words to find a result is a chore and a half.
Like, instead of trying
PDP11 emulator PDP-11 emulator "PDP 11" emulator PDP11 emulators PDP-11 emulators "PDP 11" emulators PDP11 emulation PDP-11 emulation "PDP 11" emulation
Basic NLP can do that a lot faster without introducing a lot of problems.
I do think Google currently goes way overboard with the NLP. It often feels like the query parser is an adversary you need to outsmart to get to the good results, rather than something that's actually helpful. That's not a great vibe. However, I think the big problem isn't what they are doing, but how little control you have over the process.
I get that for general-purpose searches this is a good idea, but it would be nice if there was an easy way to disable this when you know you don't want it - for example, for most programming searches, if I type SomeAPINameHere the most relevant results will always be those that include my search term verbatim. I don't need Google to helpfully suggest "Did you mean Some API Name Here?", which will virtually always return lower-quality search results.
Early Google was a breath of fresh air compared to the stemming that its competitors at the time did, but nowadays even putting search terms in quotes doesn't seem to return the same quality of results for these types of queries that Google used to have.
I feel your pain. Two workarounds when Google gets it wrong are to put the term in quotation marks, or to enable Verbatim mode in the toolbelt. (I know various people have come up with ways to add "Google Verbatim" as a search engine option in their browser, or use a browser extension to make Verbatim enabled by default.)
Disclaimer: I work on Google search.
Both of these options are disappointing, in my experience. Verbatim mode seems weirdly broken sometimes (maybe it's overly strict), and quoting things is rarely enough to convince Google that you really want to search for exactly that thing and not some totally different thing that it considers to be a synonym.
One porridge is too hot and the other is too cold. I know Google could find a happy compromise here if it wanted to. In fact, I bet there's some internal-only hacked-together version that works this way and actually gives an acceptable user experience for the kind of people who have shown up to this thread to show their dissatisfaction.
Try this, go to Google and type in "eggzackly this".
Two results not containing "eggz" at all.
Two results containing "eggzackly<punctuation>this"
Two results containing "eggzackly" but missing "this".
Google Search is broken. It no longer does what it's directed, it just takes a guess. I suspect part of this is because someone decided that "no results found" was the worst possible result a search engine could give.
Googling that with the brackets I get results containing "eggzackly this" ranked 3, 4, 6 (your comment) and 7 whereas the others contain just eggzackly (or with the 'this' preceded by punctuation as you mention).
Therefor I don't see how your last sentence is the explanation (there _are_ results), I've also happened to land on no results found sometimes with overly precise quoted queries (for coding errors mostly IIRC). But it is annoying that it doesn't seem stricktly enforced even when you want it to.
Google does go way overboard with "NLP". Starting at least 5 years ago there was a trend toward "similar" matching and search result quality nose-dived.
You can search for, say, "cycling (insert product category here)" and get motorcycle related results. Why? Because to google "cycling" = "biking" and "motorcycles" are "bikes", bob's your uncle, now you're getting hits for motorcycle products.
Every time I try to do a very specific search I can see from the search results how google tries to "help", especially if the topic is esoteric. The pages actually about the esoteric thing I'm searching for get drowned in a sea of SEO'd bullshit about a word/topic that is 1-2 degrees of separation from each other in a thesaurus. I'm sure someone at google is very, very proud of this because it increases their measure for search user satisfaction X percent.
It does this thesaurus crap even with words in quotes, which is especially infuriating.
Yeah. It's one of those things where it's invisible where it works and enraging when it doesn't. That's generally not a failure mode that's desirable. It at least should require extremely low failure rates to motivate.
"Basic NLP can do that a lot faster without introducing a lot of problems."
This is called "stemming" and is not sensibly approached with machine learning.
Of course, but stemming is a fairly basic technique in NLP, as is POS-tagging. NLP is not machine learning.
Modern NLP basically is machine learning
You can still do NLP without machine learning though, and a lot of the sorts of computational linguistics a search engine needs for keyword extraction and query parsing doesn't require particularly fancy algorithms. What it needs is fast algorithms, and that's not something you're gonna get with ML.
Stemming is not meaningfully a natural language processing technique, any more than arithmetic is a technique of linear equations.
At the very least,
https://en.wikipedia.org/wiki/Natural_language_processing
seems to disagree.
(So do I: NLP does not have to be machine learning/AI based)
Is it not the processing of natural language?
Would you call addition a system of linear equations?
No, you don't use the college senior label for the highschool freshman topic. You use the smallest label that fits.
It's string processing.
NLP is actually understanding the language. Stemming is simple string matching.
Playing the technicality game to stretch fields to encompass everything you think even marginally related isn't being thorough or inclusive; it's being bloated, and losing track of the meaning of the term.
Splitting on spaces also isn't NLP.
Stemming is a task specific to a natural language. You can't run an English stemmer on French and get good results, for example.
All NLP is, strictly speaking, more or less elaborate string matching.
> Splitting on spaces also isn't NLP.
String splitting can be, but it's a bit borderline. I'll argue you're in NLP territory if it doesn't split "That FBI guy i.e. J. Edgar Hoover." into four "sentences".
> NLP is actually understanding the language.
That's actually not an accepted terminology. There's, indeed, this:
https://en.wikipedia.org/wiki/Natural-language_understanding
Not sure why are you so adamant that yours is the "true meaning", when NLP existed long before machine learning and AI were used for it. And even if not, every term can be defined differently, so it should be normal to have different institutions/people define NLP differently.
Semantic search requires NLP. So does the Q&A format the OP is complaining about. People conflate all things NLP to the latter, and forget about the former.
Maybe I'm not using the right qualifiers around the term NLP. The kind of NLP I was referring to is something like "Hey google, what is natural language processing?" and orienting the search around people asking questions in standard(ish) English like they would to another person.
That's known as Open Domain Question Answering[1] and is only a subset of NLP.
[1]
https://www.pinecone.io/learn/question-answering/
NLP is very heavily integrated into search, so I don't think it's really possible to decouple them. But I agree the whole BonziBuddy thing they've got going now is annoying and it's especially unfortunate how it's replaced the search functionality. I'd have a lot more patience with it if I could choose this functionality when I wanted to ask a question.
I doubt they assumed it was better. I expect they did a ton of user testing and found that it was better for most people. And I'm sure it is. HN users are very much a niche audience these days.
Right. Bing switched to this method as well, as did Facebook, Twitter, Amazon, and pretty much every other company that has the ML resources to do this. They obviously had a good reason to do so, beyond assumptions.
What’s a pox?
Saying X is a pox on Y means saying X is bad for Y.
It originates from the disease 'the pox'.
a disease or plague
Some people try:
Mojeek founder story here:
https://blog.mojeek.com/2021/03/to-track-or-not-to-track.htm...
No-tracking and independent from the start. Now at 4.6 billion pages with own infrastructure and IP. Went to market in 2020 with contextual ads and API. Self-disclosure: CEO
HN is wild: 30m after something is mentiond, the CEO chimes in.
Now, if we could just get that on the Facebook thread... ;)
Never heard of Mojeek. I will try it for a month and see how it works. Currently using DDG 99% of the time.
For a fully independent indexer, probably the best results I have seen so far. For me the minimum baseline is searching for "444" and if it doesn't return 444.hu as the first result, its a no-go.
Do you use some sort of PageRank?
Yes, something conceptually similar to PageRank but our own thing which we call Gravity.
Mojeek is returning great results I'm not seeing from any other search engine!
Time for an
https://github.com/sindresorhus/awesome
search engines?
I'm probably the only person who doesn't think Google search has deteriorated. I play security CTFs, so a lot of times I have to search for peculiar technical details on various software. Also, like any other human being, I also make generic queries. In both cases, I feel like I almost always get to the desired webpage within the top few results.
It honestly depends on what you are searching for.
Case 1: You just want the name of the website, or an article, example "Facebook" -> fb.com, "Gordan Ramsay" -> Wiki/official website/Celeb gossip website you are good. Not much competition here.
Case 2: You are looking for something technical like "GNU rnano CVE-abcde"/"OpenBSD ARM64 Qualcomm Wifi driver not working", you are again in the fine territory, not much if any money to be made here so very less competition. There will be the official forums, websites, maybe some conference websites in this category.
Case 3: "Chicken potpie recipe", "How to be more organised": This is the category where people are trying to game the SEO algos. How the hell do recipe websites with 27 popups, 12000 word essay on the secret family history ends up on top ? There are a huge number of passionately made simple recipe websites but they have to be "found" by us. For the second query I mention about being more organized I think most people are looking for some sort of a review article which looks at some various schools of thoughts regarding discipline, cleanliness pointing to further resources and exploring the why and what to do for this. Here the search engine needs to determine the context of the query which is fairly abstract and then the internal heuristics it uses are supposed to drive it to a meaningful list of websites. Maybe the average joe would like to click cosmopolitan's article but I would never do that. Based on my previous click history maybe google should determine what I kind of links am I looking for. But when they figure that out they'd much faster use this behavioral insight for advertisers. A great search engine is basically a primitive personal librarian, I'd pay a yearly subscription for one.
The internet is vast and it has stuff that I don't know about. How my 7 word abstract query is gonna get me there is the question mark. Also, for a lot of queries the top results can be plagued by spammy/fraud results which are on top because they managed to trick the SEO algos. These bad actors were not as prevalent for 2005 google.
Well no, it’s you and me and the whole google search quality evaluation team and everyone who works on google search and like 99% of the general public as well. The meme of falling search quality infects only HN. Mostly what people are complaining about is that the quality of the web itself is in free-fall.
It's always worked fine for me when it comes to finding simple things with a
search. I think it's deteriorated in some ways though. I don't find the advanced search operators reliable anymore (eg. give me all the news about a topic published between certain dates) and I think it caps collections of things very early now, rather than returning the "billion" of results it says it has (eg. give me more than the 1000 most popular cat memes that I've seen before, or all the books about beaches).
Early 2000s google index ran in a garage. The current google index has dedicated power stations.
It's a bit like the car industry - you could run a startup from your garage in the early days but you need titanic amounts of capital to compete now thanks to vertical integration.
Major governments and billionaires can compete but everybody else is locked out of the market (most "startups" use bings index).
Google's datacenters are huge because they save user behavior data, not because their web search index is particularly big. Also, Google Search wastes a lot of resources on the "search as you type" feature.
Running a search engine in your garage is feasible today because hardware and connectivity have improved much faster than the size of the WWW.
It's the frequency of updates that chews power.
Also, that user data is used to improve search results and mitigate webspam that didnt exist in 2005.
I was thinking about exactly that. If they used simpler index would they be getting better results? There's not a lot of selective pressure so they just keep adding to the index algorithm.
whatsapp was run out of a single cabinet.
I think Google was “better” from a users point of view in 2005 because it wasn’t that good at selling ads yet. I still remember the epiphany of the first time I used Google in 1999. It was amazing.
I’ve thought the same about pre-ad Twitter and Facebook.
Early on, startups with free services look a lot like non-profits and just maximize user benefit to grow. The problem is they’re not non-profits, and have to make money at some point. That has tended to mean ads.
I’d easily pay, say, $9/mo to have access to an ad-free search engine that made me feel the way 1999 Google did.
$9/mo is not going to cut it. Google's domestic annual revenue per user in 2019 was $256. [0] That's $21.33 per month. Not all of Google's revenue is from Ads, of course, but the vast majority is. (Let's ignore for now the valid counterpoint that Ads are increasingly served on other Google properties than Search.)
But even charging users $21.33/mo for an ad-free search experience most likely wouldn't be enough. By providing such an option, you'd greatly reduce the value of the remaining Ads pool.
The optimistic perspective on this is that if you are one of the users with disposable income, you're essentially subsidizing a great search engine and a suite of other tools for the less well-off ones.
[0]
https://miro.medium.com/max/6545/0*YTqXb-F5UiVhtlIS
Let’s say ads will always make more money (I have no reason to believe they won’t), and that’s required to be the dominant search engine because the web is big and expensive to organize.
I’d bet there’s some way to characterize what I and others liked about the earlier web and create a search engine that just worries about that stuff. I’d pay $9/mo for whatever 1/3 of Google’s spend per user would get me. That’s not to say this thing would “beat” Google, but it could profitably exist.
I doubt it, because 1/3 of Google's spend per user isn't enough when you can't attract many paying users in the first place, because you would charge much more than $9/mo, because almost no one wants to pay for a search engine so your revenue will have to make up for those people too, and then even fewer people are willing to pay more than $9/mo for 1/3 of the quality.
And then I'd guess the 20 remaining users will still complain because 1999 Google is a nostalgic memory impossible to recreate without a 1999 internet for a 1999 self to live in and has little to do with raw search quality.
The web has changed drastically. I’d imagine 2005-google engine today would be nothing but abandoned Wordpress blogs with comment spam.
I suspect this is exactly it—a lot of what made 2005-era-Google good wasn't necessarily Google's own doing. It was that SEO people hadn't yet fully figured out how to game the system yet.
If you took an exact copy of Google circa-2005 and had it crawl today's web, you'd probably get mostly "SEO optimized" irrelevant blogspam.
And even more copy-pasted spam than already exists.
The early Google (and other even earlier search engines) were invented for an Internet world which, if not pristine and pure, was at least mostly fairly legit content. Today's Internet is probably 90% deliberate spammers and scammers.
I think DuckDuckGo is closer to what you want. Same results for everyone, better privacy, and they're proactive about improving their results.
Part of the problem is that there's a lot more low-quality content to wade through now than there was in 2005. I think the Google of 2005 would have trouble delivering quality results today also.
> a lot more low-quality content
I wish there was an easy way to filter ALL search results, by permanently excluding specific websites, and/or keywords.
Surely there has to be some browser extension that does this...
Excluding, or penalizing for, advertising and trackers could do wonders against perverse incentives and SEO, IMO. It would also be a better experience for the reader.
https://news.ycombinator.com/item?id=29404860
Not got round to trying it yet though.
Great and it even supports iOS…!
Try searching for the same thing from your computer and your phone and you will get different results. Also, their results come from Bing so any improvement happens at Microsoft.
They do use Bing, but not solely Bing. DDG isn't just a frontend to a different search engine.
It's a bing frontend with a few special cases handled differently. For most queries, you get bing results. Easy to check by comparing results.
I see Russian sites from Yandex all the time on DDG.
This. DDG is my primary search engine now and has been for awhile.
I don't use Google anymore to search unless I really need to. The algos they use today are not the same classic ones that actually returned results.
Same. For the sorts of searching I do, anyway, the results I get from DDG tend to be better than what I get from Google. Google tries to infer what I want rather than take me at my word, and is very bad at it.
And if you really need to, DDG !bangs[0] make a search as simple as "!g mother google help me". The keyword thing is also available in Firefox as a browser feature, and elsewhere I'm sure, but nevertheless, it makes switching to DDG easier.
(Plus I can directly go to the wiki page by using "!w", "!gm" for google maps, etc.)
[0]
The only bang I use is !gvb since DDG doesn't support verbatim searches.
Is this the same as enclosing the terms in quotes and using the !g bang?
it's a google "verbatim" search. I don't know if enclosing each term separately in quotes does the same thing, but this is easier anyway.
I didn't know about the verbatim search. I'm going to give it a try, thanks
DuckDuckGo isn’t really a search engine, it’s a website that uses bing’s api.
Not just Bing, but nearly every search engine you've ever used
https://duckduckgo.com/bang?q=
Does DDG have any of its own organic results yet, or is it still entirely Bing/Yandex?
> I think the Google of 2005 would have trouble delivering quality results today also.
What would you attribute to their modern 2021 success then? Just throwing a ton of money at amazing engineers to hone in their complex algorithm to tweak it to still return what us humans quantify as "good" results? Especially if they are waning through a sea of low-quality content as you say.
both ddg and brave are bing (microsoft) in disguise.
This is not correct. Brave Search owns its own (growing) index and relies on third-parties like Bing for some fraction of the requests. Which is not the same thing as relying fully on Bing or third-parties for results like so many meta-search engines. More detailed answer here:
https://search.brave.com/help/independence
Edit: Forgot to say that I work on Brave Search.
brave 'falls back' to bing. which in my experience is most of the time. in fact, out of all the queries i did a while back, they all seemed to come directly from bing. is there a way to disable the reliance on bing and get pure 'brave only' results? and can you be more specific as to what this fraction is? do you blend at all?
You can check exactly which fraction of the results were fetched from Brave's index vs. third-parties using the "independence score" found in the setting drawers (opening can be done with the cog icon at the top right of any page on search.brave.com). There is there a global and personalized score of independence (respectively aggregated on all user's and for your queries only).
Explanation is also found here with screenshots:
https://search.brave.com/help/independence
So Brave is still dependent on Google and Bing it seems.
Also is this Brave's CEO:
https://www.bbc.com/news/technology-26868536
https://www.nytimes.com/2020/12/22/business/brave-brendan-ei...
?
"Brendan Eich's opposition to same-sex marriage cost him his job at Mozilla."
"Covid comments get a tech C.E.O. in hot water, again."
What independence percentage do you see when you click on the gear in upper right of the Brave Search results page?
I get 84% personal (browser-based), 87% global (which means we hit Bing only 13% of the time from our server side).
DDG never worked great for me, and it doesn't have it's own index.
Brave search has been my daily driver and it works wonderful.
I'll give it a try, somehow I missed the announcement even though I'm a Brave user...
Please do it! Google is now complete trash.
Also gmail, used to have the best spam filters out there, now it's utter crap. Emails from my google analytics account, for whatever reason and disregarding how many times I have clicked on "Not Spam", go to spam, and it's their own service; while messages who are textbook spam ("Hi, I just got some inheritance ...") go to my inbox.
AI (in its current state) is crap, when is the industry going to accept these are the emperor's new clothes.
They do[0] but nobody cares anymore. Google controls web distribution through Google Chrome. I think we are at the point of no return. There won't be any competition anytime soon no matter what US government does. Only innovation can displace Google.
[0]
Marginalia is great to find blog posts, personal sites and other long form content, but it's not a replacement for Google nor intends to.
It does operate on a scale and principle fairly similar to early 2000s google, so the comparison isn't that far off, but yeah, it's quite some way before it's viable for general search. Dunno if I'll ever get there, but it does consistently seem to get better so who knows.
Isn't it's familiarity to early Google a side-effect of the early Internet being text-heavy sites in the first place rather than a similarity in the search engine? Unless I am misunderstanding your site's intent, even if you reach the dream engine you are trying to achieve, I won't be using it to search answers for coding questions on SO, how-tos for car repair, sites to stream movies, governmental page for X need, transcript for earnings calls, etc.
In my experience it is better than Google at what it does if I'm looking for long-form texts (exception being scientific/peer-reviewed articles, Google tends to shoot me those for the type of queries I make on Marginalia), but is very much complementary rather than a replacement.
I guess it depends on what you are looking for on the Internet I guess.
Right now the biggest problem with Marginalia is that it has a fairly uneven quality level. For some queries it's absolutely incredible. For others, it doesn't really provide much useful results at all. I do think it's possible to even that out a considerable bit, to make it more viable for general queries. It's never going to be able to answer every query, but it probably could answer a lot more than it does.
Basically I understand Marginalia's proposition as a search engine focused on retrieving text-heavy/long form content. Unless I misunderstand it's intent, that can't replace a generalist engine (nor does it have to) as not every search request will lends itself to long form texts. I guess that's the only point I was going for (I do feel the old-Google sentiment has got more to do with the state of the web than the engine, but am out of my league for a proper opinion), and it certainly wasn't a jab at it - I'm thankful for your neat website and will be looking forward to see it get even better over time! Maybe it is somewhat uneven, but it is nonetheless great at finding thoughtful pieces written on subjects XYZ and surfacing more obscure/personal websites.
But it is a good start and foundation for something bigger and better.
Funny. Marginalia has an option for No JavaScript but I cannot even do an HTTP “POST” with JavaScript disable at my web browser.
Disclaimer: I study for malicious JS stuff.
Because we're not having a 2005-Web anymore. More to the point, SEO & Google have evolved together. To have barely relevant results today you need to be _good_. That takes stellar talent which costs huge amounts of money.
Thus, the Google of today, which is optimized to extract that money from us.
> To have barely relevant results today you need to be good
An easy way to become way better than google — detect google ads on pages, and penalize these pages in the index. For obvious reason, google search is incapable of doing so.
Yes, I think you'd call it a Red Queen Problem:
“Here, you see, it takes all the running you can do to keep in the same place.”
-Lewis Carroll's Through the Looking Glass
But shouldn't all the blogspam be so hyperoptimized for Google's algorithm that is should be straightforward to detect and ignore/downrank it?
I _read_ auto-generated pages almost to the end before realizing it was SEO spam. (I am not a native English speaker though)
With content copying, shuffling and AI generating, I am afraid we are on the cusp of auto content generators passing some restricted Turing test where readers really think it's an actual human that wrote it.
As for me, I leant that for certain "hot topics", simply doing a generic search on Google is not a good idea anymore.
Yeah, I do this with my search engine. Works pretty well. A complementary approach that works well is to look at where blogs written by humans link. Very few spam blogs get links from humans.
No because google's algorithm is not well known publicly. Also, if it was straightforward to detect then google could downrank it as well.
I wonder if you could evaluate a page using your own algorithm, which is probably not gamed as much as Google's (because who cares about your search engine?)
Then, check Google's ranking of the page. If it is much higher than it seems the page should be, assume the page is being SEO hyper-optimized and penalize the page proportionately.
Basically, using the variance between Google's model and your model as an indicator of an SEO spam page.
The point is that SEO would just immediately adapt to Google's changes. If a smaller search engine filtered these out then it would likely stay under the radar.
you know that legitimate sites perform SEO as well, right?
SEO often seems to be a compensation for the fact that a site doesn't have particularly worthwhile content. So punishing SEO surprisingly does promote higher quality of search results.
Yes and no. A lot of those sites are small local businesses trying to get found. A front page listing can be the difference between surviving and going under. Much of the time the blog spam is what floats hours, contact info, and services provided to the first page.
Be that as it may, search ranking is a zero sum game. The unfair advantage SEO gives this particular struggling business means another goes under. I'd rather punish the guy trying to game the system than the one with enough principles not to.
The difference is far more likely to be in capability or expertise than principles.
Either way, capability for fuckery is not something I'd want to encourage.
Would you rather have a surgeon who knows how to kill you with a narrow slice to the right artery or one who doesn't even know where your kidneys appendix are located? Selecting for incompetence doesn't work well.
Eschewing SEO isn't incompetence, it's moral principle and good character. I'd much rather have a surgeon who doesn't moonlight harvesting organs OD:ing junkies.
It's not that easy, they are optimized for many metrics..
I would use a search engine that only indexed Reddit, Stack Exchange, Wikipedia, and a small number of other sites.
And that specifically blocked Pinterest, Quora, most non-personal “blogs”, etc.
People suggest DDG ! operators, but I don’t want to use a site’s (bad, single-site) search box. I want a multi-site SERP that only displays results from known good sites, which are customizable.
I've been thinking about this as well. As Google search results get increasingly worse, I find myself subconsciously filtering out all the garbage and gravitating towards a small number of known sites; and, as many other HNers do, I frequently mitigate this filtering step altogether by adding add "reddit" to any search in which I'm seeking out real human sentiment.
I've done similar optimizations elsewhere to counter Google's trash results, e.g., I've been beefing up my personal recipe database, with the goal being that I can avoid a google search altogether whenever possible, only hitting google as a last resort.
More and more I wonder, with the modern internet, is it even a _feature_ that the whole web is indexed? Might be a bug.
If I could add sites I liked to the index that'd be great. Find a blogger/hacker I like? Add to the index. Can I share my index with others? Can I include their indices in my searches?
Search engine as a social media platform? If I follow you, now I can search in your indices?
Yacy might serve your needs well. It is a sort of distributed engine where users run their own index and "neighbors" share their indices with one another.
Too bad they whitelist which bots can access their sitemaps!
Even rules such as “if there is a Wikipedia result in the top 10, display it first”.
Brave's new search engine seems to work pretty well. Have been using it as my primary for about 10 days, and so far, I've only had to revert to Google once, and when I did the results were chock full of spam.
The nice thing about Brave Search is that they're trying to create an index completely independent from Bing/Google, and they seem to be trying to innovate on ways to get there as well with their Web Discovery Project[0], unlike DuckDuckGo. They've announced Brave Search will get ads soon, with a premium version without ads, which I think is acceptable given the costs of running an independent index sustainably.
[0]:
https://brave.com/privacy/browser/#web-discovery-project
Can echo this. About 30 days going all devices. I’d say about once a day I do a !g, and rarely do I actually find something there, it usually just ends up being a confirmation search.
We are building one [1] as well as a few other people that I am aware of with different approaches and business models.
We also need to be aware that when we remember past times it usually carries a romantic, nostalgic note. Web is very different than it was 15 years ago and the problem of search has evolved.
What you are looking for is basically 'grep for the web' but it is just one facet of search that we use today. 15 years ago you would not get an instant answer to a question like you do today and many users would not be able to live without that today. There are also maps and location based answers, all sorts of widgets like translation etc. Also world became more polarized so an objective best search result became more difficult to produce, specially for events covered in news, which means bias inevitably starts to creep in.
This is not to say that Google is good or bad today, it is what it is and they are doing best they can. Startups like ours see an opportunity on the market, in large part to help savvy users find what they want.
[1]
You might call this a search engine based on the principle of Information Neutrality.
“Information Neutrality is the principle to treat all information provided (by a service) equally. The information provided, after being processed by an information-neutral service, is the same for every user requesting it, independent of the user’s attributes, including, e.g., origin, history or personal preferences and independent of the financial or influential interest of the service provider, as well as independent of the timeliness of information."
I wrote about this in relation to search [0]. We need to be allowed more freedom to choose search engines and services. One (default or selected) choice for search is unhealthy. We shouldn't have to choose between Google or Bing; DuckDuckGo or Startpage; Brave or Ecosia; Mojeek or Gigablast ..... Personally I use all 8 of these and more, as also explained [0].
[0]
https://blog.mojeek.com/2021/09/multiple-choice-in-search.ht...
I'm with you (you run a great engine BTW) and I've considered the UX of some attempts to help users do this.
I like Firefox's UI when searching, where you can select the search engine of choice while typing a query.
I like customizable metasearch engines like searx, I think it is a phenomenal idea. I wish more niche engines would implement OpenSearch so that they could easily be added.
I have considered just making a simple web page with search boxes for multiple engines for personal use as a default home page, but there's friction and again, lots of engines I'd like to use don't implement OpenSearch.
I wonder if there's some novel UX approaches to this out there. Meta search engines seem to be the best way so far to do it but there's the problem of customizing ranking, relevance of results and the like that just compounds the problems users experience.
I think what [some] people actually want isn't the Google of 2005 but to have a search engine where they don't feel like they're being manipulated.
I think a lot of people are ignoring the issue that the web has changed considerably since 2005. It is approximately 10 times larger in terms of number of websites and web pages. And a lot of it is SEO junk that is just designed for search engines to be easier to parse and show ads in your face.
Also user preferences have changed in the last decade or so. I know millenaials and users in their late 30's or early 40's still yearn for the old web where they would type a search term and correct results would astonish them. However, younger users tend to gravitate to videos and that is why a large portion of the google results are now video results.
Cliqz wanted to build new search engine but failed. It's just too difficult to operate at that scale and break the existing monopoly of big G.
https://www.burda.com/en/news/cliqz-closes-areas-browser-and...
https://news.ycombinator.com/item?id=23031520
https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...
And then Brave bought them and it succeeded.
Cliqz is now Brave Search, I use it for all my devices, it's great.
Works better than DDG and sometimes better than Google.
I only hash bang every 100 searches or so, most of the time Google doesn't have it either. It's just to make sure.
What if we didn't try to replicate google. Smaller and niche search engines would probably work better in this new world of vast information.
Does Gigablast ignore or downrate stuff on .info domains?
Seems to like
https://www.fiendishsudoku.com/
for "fiendish sudoku" search but doesn't know about
https://www.extremesudoku.info/
for "extreme sudoku" search.
Random thought, based especially on using DuckDuckGo for two years:
Search engine isn’t singular, it’s plural.
(1) Search engine for something I know exists.
(2) Search engine for finding something new.
There’s a market for both, but you don’t have to solve both problems with the same product.
Sometimes I switch to Google for the former, but the latter works well enough for me that I don’t care what else Google would’ve shown me.
More often than not, my feeling is Google would only have shown me more ads in addition to whatever I could already find elsewhere.
SEO wars are at least part of it. Google's algorithm has evolved over time not just top optimize advertising views/clicks and take over more screen space, but also to battle the constant gamification of their algorithm by SEO that, once you eventually get to the real results, will surface less relevant/spammy/scammy etc results if Google doesn't constantly push back against the worst SEO abusers.
1) Google is better at AI, for example let's take this sloppy search: "some joke where you can't tell if it is serious or joke"
It is called Poe's law, and Google returned it at #4. Bing or Duckduckgo don't have a clue...
2) They have a years of user's data, like for specific term, they see what users clicked most, so they see which results were perceived as most relevant. It is hard to catch up if you dont have such data.
3) They developed anti-spamming tools during the years of fighting against SEO-spammers.
> Google is better at AI, for example let's take this sloppy search: "some joke where you can't tell if it is serious or joke"
My problem there is that I don't expect or want my search engine to do that. The counter case is where I remember a quote from and article and want to find the article. Old Google would help me find matching text and I could quickly find the original article. Current Google will try to interpret the text and give me some nonsense based on that.
AI has ruined other Google features... the "search by image" feature now analyzes the image, returns a generic tag like "woman", and shows me the wikipedia article on women as the first result.
Old search by image had tineye like functionality and you could find the source of images.
> 1) Google is better at AI, for example let's take this sloppy search: "some joke where you can't tell if it is serious or joke"
> It is called Poe's law, and Google returned it at #4. Bing or Duckduckgo don't have a clue...
Interesting, I was looking for a good benchmark like this. For me Google returned it at #5 with an image/related terms carousel before it which places it physically more around #7 on the page. Brave Search (never tried it before today) puts Poe's Law at #8. So Google is still better.
But the other results are mostly worse (IMO) on Google. Here are the first 8 results:
- 175 Bad Jokes That You Can't Help But Laugh At - Reader's (rd.com)
- 57 Hilarious, Silly Jokes No One Is Too Old to Laugh At (bestlifeonline.com)
- 145 Best Dad Jokes That Will Have the Whole Family Laughing (countryliving.com)
- Sarcasm, Self-Deprecation, and Inside Jokes: A User's Guide (hbr.org)
- Poe's law - Wikipedia (wikipedia.org)
- Managing Conflict with Humor - HelpGuide.org (helpguide.org)
- 175 Bad Jokes That Are So Cringeworthy, You Can't ... - Parade (parade.com)
- Encouraging Your Child's Sense of Humor (for Parents) - Kids ... (kidshealth.org)
And here are the first 8 results from Brave Search:
- phrase requests - Is there a word for "pretending to joke when ... (english.stackexchange.com)
- Joke - Wikipedia (wikipedia.org)
- “Are you joking or serious?” – The Caffeinated Autistic (thecaffeinatedautistic.wordpress.com)
- How do I tell when people are joking or being serious? (reddit.com/r/socialskills)
- be a joke | meaning of be a joke in Longman Dictionary of (ldoceonline.com)
- Quote by Ricky Gervais: “If you can't joke about the most (goodreads.com)
- How can you tell if someone is joking with you or not? (quora.com)
- Poe's law - Wikipedia (wikipedia.org)
-----
edit: I did not count to 8 correctly the first time. Fixed that.
The Brave results though seem to contain “good sites” whereas the Google results are content mill blogspam. The exact placement of Poe’s Law is somewhat less important.
I agree. I switched to Brave Search after running this test.
While I feel that Google has become worse in last couple of years, I'm pretty sure it is still better now when 15 years ago. Maybe it is just some kind of nostalgia?
the internet has changed, partially due to google's influence
instead of discussion forums and Q&A sites, everyone's on facebook/twitter/discord/slack/snapchat/tiktok/etc... none of that is really very google friendly
online marketing and SEO is a _much_ larger industry now, so with less (by % of total) searchable content generated by people (which is on social media) a lot of the high-ranking content that appears in search is highly optimized marketing
then you have other kind of weird things like... half of all internet traffic being bots
The 2005 Google model only made sense in the 2005 internet. Google had the luck to become a search monopoly, and they quickly created Chrome to ensure that no one would ever switch away from Google search, so they could maintain the monopoly.
Now that Google exists, you can't create another one. There's only room for one.
Another thing is the rise of "content sites", like this one (Hacker News). I'm sure YCombinator doesn't like getting hit by dozens of crawlers. The impulse to ban everything that crawls except (Google|Bing|Baidu|VK) is too great.
A lot of alternative suggestions are being thrown into this discussion. Let me throw in mine: Reverse the concept of the "crawler". Instead of following links around the internet randomly, require sites to register with you and request to be crawled and/or submit a sitemap. It would be hard to get started, but once something like this gained momentum, I believe that there's room for several of these reverse-search-engines to compete.
I had a brief stab at this with
although its Australia specific to reduce the crawling and indexing requirements. It's main twist is that it runs entirely in AWS Lambda's meaning it costs nothing when it's not being used.
Lately I've noticed Google has just started ignoring search operators. Search results are missing terms in quotes and include terms with a leading - sign on them. It's like they've decided we're too stupid to know what we're looking for.
Yesterday there was a discussion[1] about it and someone suggested yandex.com. I'm using it since than and really love it. It's like going back to 2003 where everything was just plain and simple.
[1]:
https://news.ycombinator.com/item?id=29393467
Do not force me into autocomplete mode when I'm typing in my search terms. I don't care what anyone's "reasons" are for forcing me to put up with flashing, irrelevant bullshit when I'm searching for something. I don't care how "fast" it is.
Just let me type stuff into the search box -- including typo corrections and modifications to what I'm searching for -- and hit ENTER to start the actual search.
When I'm ready to start my search I'll hit the fucking ENTER key. Stop annoying me with your stupid assumptions about what I'm looking for.
This ONE THING is why I switched to Webcrawler.com two years ago. I type in five or ten words with ZERO craptastic guesses flashing around on my screen, hit ENTER, and THEN it returns what I'm looking for.
Even if Google dusted off their 2005 codebase and ran it on today's web it wouldn't come close to the results quality of Google in 2005. The SEO industry has been in an arms race with the search engines for 16 years. 2005 Google would be like a goldfish in a piranha tank.
Looks like millionshort.com (which I learned-of on HN) died recently. For me, its results were more useful than most others (even without the 'leave out the top nnn sources' feature). Hoping it was an experiment that will bear fruit.
Not sure if this is any close to what you’re trying to find, but there’s
https://github.com/benbusby/whoogle-search
I've been using kagi.com for a month or so now, and it consistently beats DDG and Ecosia for result quality. I'd guess it beats Google too, since last time I used Google it was nothing but ads and spam which is why I stopped.
Thank you for the vote of confidence! Better than Google is our goal, glad you perceive it that way.
You're welcome. I'm really impressed with it most of the time. Still not made it on to the Orion beta though ;)
No mention of DDG in the comments? Is there a reason I'm not seeing or it's just not the preferred alt-search on HN? Seems to have been working fine for me when I struggle to get past the funnels and content mills on Google.
DDG doesn't have their own index (they're getting their results from Bing) so not really relevant to this question.
I.. didn't know that. However, trying it just now in incognito I don't get the same results[0] (some different links, and most re-ordered). Is Duck repurposing Bing's results? I've tested with "how to get rich", a great bait for bad content (try it on Google without an adblocker, if you dare).
[0]:
I don't know what DDG is doing but I'm imagining that they send in the raw queries while you can't get around Bing's personalisation even in incognito. I get very similar results for "how to get rich", but only after setting "All regions" on DDG.
Bing:
1. How to Get Rich: 10 Things Wise and Rich People Do
2. 5 Ways to Get Rich - wikiHow
3. 16 Proven Ways On How To Get Rich Quick (2021 Edition) - TPS
4. How to Get Rich - NerdWallet
5. How to Get Rich: Follow our Step by Step Plan to Build ...
DDG:
1. How to Get Rich: 10 Things Wise and Rich People Do
2. 5 Ways to Get Rich - wikiHow
3. How to Get Rich - NerdWallet
4. 16 Proven Ways On How To Get Rich Quick (2021 Edition) - TPS
5. How to Get Rich: 8 Steps to Make Your First Million ...
It's no secret that DDG is using Bing so they're not trying to hide it. An easy way to verify it is to search for "what is my ip" on DDG and look for results where the IP number has been cached in the snippet, e.g.:
www.myipnumber.com
What is my IP number - my IP address - MyIpNumber.com
What is my IP Number? The IP Number of this machine is: 157.55.39.192. This number can also be represented as a 32-bit decimal number 2637637568, or as a 32-bit hexadecimal number 0x9D3727C0 . (Note that if you are part of an internal network then this is the IP number of your local server, the machine which is connected to the external ...
If you do an IP lookup on 157.55.39.192 you will see that it's in fact "Microsoft bingbot".
For me, DDG results are even worse than Google. It's set as my default and I'd say at least half of my searches in DDG generate completely useless results...pages of obviously SEO'd garbage.
DDG also doesn't support showing a site's basic structure in the search results (ie, the card of a company's website with Products, Contact Us, Support, etc) and the preview text is garbo as well...it reminds me of 1990's era electronic card catalog search excerpts.
I look at the first page or two, give up, search google. While I have to hunt a bit in the results, I do eventually get what I wanted.
Every time this comes up I'll see a few people talk about how the results aren't relevant but it has not been my experience. I've been using DDG as my main search engine for a few years and never have to go beyond the first page. I really curious why that is.
My experience is like yours -- DDG is legitimately better than Google. My hypothesis is that it's related to how you construct searches. I expect Google probably does better if you learn how to talk to it, since it seems to want to interpret your query rather than take it literally.
My searches tend to be keyword-oriented rather than natural language. I think DDG does better with those.
I dont find search results to be too relevant (at least for me, also Spaniard here). It is my default search engine only for the bang commands.
I was looking for this as well! I use it daily and have for years. Love it.
I think you're being nostalgic for something you don't remember very well.
In that era, Google would return a match based on words that appear in the links to a URL but not in the article itself, meaning that it was easy to produce "Googlebombs". For example, from 2005-2007 the top hit for "miserable failure" was the Wikipedia article for George W. Bush.
See
https://www.screamingfrog.co.uk/google-bombs/
for some of the "better" ones.
Google does its job.
I heard HN constantly crying over its deteriorating quality, but I am not noticing it that much, not better not worse, it just does its job.
To create 05 Google, it is easily billions of dollars and years of investment, before people will treat you seriously.
The reason we didn't get 05 Google could only because it is not profitable. Some nation state attempt to demonopolize the search engine business might work, but I didn't expect any for profit organization to easily attempt doing this, let alone individual hobbyists
Everyone runs in the other direction anytime a search engine is mentioned. The thought of competing with Google turns people off.
Even in 2021, despite how bad it's become, it's still miles ahead of other competitors.
I disagree. A lot of people I know already switched to Duckduckgo. Google’s ability to get relevant results is dropping like a brick, while the quality of DDG has been improving slowly but steadily.
I wish I could agree but from my experience, DDG's search results aren't really that great. Often even worse than Google's.
And another private company is not the answer I believe. We need something more drastic, an open-source search engine organized as a genuine non-profit organization. Something like that. Otherwise, whatever replaces Google will just turn into another Google as soon as it gets any momentum.
I think open source will be tough because you're going to need a lot of saints to work on a search engine of Google's caliber.
Maybe an alternative revenue model instead of ads.
Consortium of universities, perhaps? Every top school (globally) kicks in some design and development time. It seems odd that the most critical link to access information on the planet is _not_ the product of academia. With a country’s skin in the global game, there may be better leverage to keep it free and open for their citizens.
I'd love a much simpler version of search engines: an engine that I can give a long list of websites to crawl, and to completely ignore the rest.
These guys [0] have built something really close to 2005-Google, and possibly slightly better.
The parent company, Tiscali, was a huge hit in the 1990s, as it provided internet access to millions of Italians. It went through some struggle for several years, but lately the original founder, Renato Soru, came back to run the company.
The company is based in Cagliari, the capital of Sardinia, Italy.
[0]:
Why don't you want personalized results? If I search for "subaru service" I want to find Austin Subaru, not Thorp Subaru in Cape Town.
Why didn't you just search "austin subaru service"? If you want a query narrowed down by location, that's your job to say so.
Sure, it feels great when the engine guesses something like that correctly -- but it comes out worse overall for the plentiful cases where you have to try to compensate for it guessing wrong.
Why should I have to do all that work? I want the machine to do it for me.
I can only think of examples where I want personalization. What's an example query where it interferes?
Amazing that the same site that thinks copilot will just generate programs for us also thinks it is literally a crime for a search engine to infer anything.
I pretty much hate "personalized" search recommendations. If i'm looking for something it's usually not in relation to me but in relation to the world.
If i wanted something more relevant to me, then i would specify what aspect of relevance (country, gender, age etc...) i would like instead of playing the guessing game.
> If i'm looking for something it's usually not in relation to me but in relation to the world.
If that's true, then I don't think you are a typical search engine user.
The personalization should just be used for defaults. You can always make a more specific query to focus on aspects you are interested in.
Because today's web are full of walled gardens, and most content are going mobile , in streaming, and SPA rendered, which is no longer plain text based.
What is your gripe with DuckDuckGo?
Well, Cuil had a lot of money and couldn't do it. I don't know how you quantify your assertions but I suspect that if you brought back 2005 Google it would be easily gamed and struggle to deal with social media sites where a lot of content people are looking for is now found.
I'd like to see a "just search" engine, all it does it search for a specific string, case insensitively, across the entire web. No curation or anything, just sorted in lexicological order closest match first maybe falling back to page age if it has more then one exact match. Perhaps give me some regular expressions as well.
That would be easily the worst search engine ever deployed. Imagine just returning all docs containing the word “bicycle” in chronological order. Useless.
For "Bicycle" it would suck but I don't often use search engines that way, for "High Timber ALX 29" you'd probably get something like this:
https://www.schwinnbikes.com/products/high-timber-alx-29?var...
I wouldn't use it for everything but sometimes that is the exact behavior that I want. I'd use duck duck go for more general searches.
That is the top hit on google for that search, so what’s your complaint?
Take a random part number off your car, or a portion of a error message and try finding that. It's annoying to have to scroll down over a page or two of autogenerated SEO answers to get to something useful. The first result to appear on the internet is less likely to be SEO and more likely to be the manufacturers documentation or the git commit that spawned your error. It isn't always, but that's why you have more then one search engine.
Secondarily I think a search engine that is very simple in it's model and operation is useful for more general free speech purposes. If the major search engines decide they don't like a site like the pirate bay a search for '"Pirate Bay" And "Torrents"' on a search engine that does not curate could still get you there. I guess the point is without curation you have to work harder to find what you want, but nobody is actively preventing you from finding anything. It would help keep everybody honest.
Maybe a “stability factor” could be calculated. Whereas earlier new content was king, I now value a stable long term source of information. So domain age + page age + content variability + dependency on ads. That might give more honest sources a go.
That's a good idea, I'd make it a option. Do you want newest first, oldest first or by stability?
All big tech businesses at their core are monopolies. Once a significant field has been figured out, it is very difficult to compete with the market standard, unless they screw up so hard that that THE AVERAGE user starts searching for an alternative.
Have a look at gpt-3 if you want to see what the future dominant search engine will be. It will not find relevant results, it will write it on the fly customized for exactly what you want to read. (Maybe products will just ship to your door and be auto paid because the future ad targeting AIs will know you so well.)
What if you are looking for something written by a human?
You can always go to a library or bookstore.
Let's imagine I want to talk to the author of the content. How can I do that if it's just a souped up markov chain?
The markov chain can also power a chat bot.
But then they would need to know that the person sending the email is the same person that read a specific article.
I feel like we are at the low point or even losing the battle between search engines and SEO spam. Maybe it is time for the Yahoo-style curated directory to return? We seem to be getting a microcosm of this with the awsome-* GitHub lists and Gemini with its near-nonexistent search.
I'd like to see categories like travel, science, history, art, etc. The web pages could pick which categories their page falls into using meta tags. The user has the option of selecting which category they are interested in searching within.
I do like the idea that instead of crawling and indexing, the next generation search will likely be more like a federated community search app that indexes the stuff members actually read. Google search isn't so much a repository as a consensus about what's important, hence why it's so politicized to the point of becoming unreliable, but also why it too is vulnerable to disruption.
Imo, 2005 google got initial traction because of its tech forum post indexing, as I remember my switch to it was because it became an extension and then replacement for manpages. In that sense, what made it good was it reflected the consensus of what its incredibly influential userbase thought was important and just managed that really well. The demographic impact of the U.S. Gen X all using it at once didn't hurt either.
The equivalent today, as a lot of us say, is that blockchains are in the 1997 internet phase, and the service that makes the content of those as navigable as the 90's internet, will likely grow in a similar way.
Search that provides young people with privacy and freedom to pursue their true interests will be the dominant strategy. Its success will be because it's a product that rides growth, and not because it "solved a problem." Imo, we all index too much on the privacy pattern because the freedom pattern is too risky.
What's changed since that time are the maturity of things like Bloom and other probabilistic filters, Apple's private set intersection, differential privacy, zksnarks, and everybody you'd ask an opinion from now gets their content through mobile devices. Apple's ecosystem is equipped to do this kind of search, but they're too exposed politically to get into it. Meta will likely go there, but nobody's going to trust them willingly.
A protocol that generated a cryptograpically strong anonymous index from your browsing - and instead of putting it on google's servers, it was on a chain, or the content index information and its evolving consensus score was included in something like a DNS record - may still unseat these ensconced interests. IPFS and other P2P or torrents might do something like that as well. Blockchains maybe good for that consensus/desire score.
It's not something you architect and design top down that has to solve all cases, it will be just another useful product that grows while riding a demographic change. It would be on the level of inventing HTML/HTTP again, which, when you think about it, was just another dude making a thing he needed.
is a new engine (and Orion Browser) which seems like what you're talking about. I've been using it some and like it so far. The browser is fantastic.
Ongoing related thread:
_Gigablast Search Engine_ -
https://news.ycombinator.com/item?id=29421898
- Dec 2021 (10 comments)
What I miss most of all from the Good Old Days was getting as many hits back as I could read.
Rather than being told "No, there are only eight pages of results on anything in the goddamned world. Really. Would I lie to you?"
I'd like a way of automatically filtering for websites that :
That would be a place to begin.
Almost all websites use JavaScript, including this one.
Wiby.me might work for you.
I’ve always wondered why you can’t use SEO optimizations for GOOGLE as a negative weight and penalize those pages.
For example if my search term appears in the URL I can almost guarantee I don’t want that page.
Why doesn't anyone create a search engine comparable to 2005-Google?
Because the universe being searched isn't the internet of 2005 and earlier, and because user expectations have moved on, too.
Plus the index expense.
I use
I've been having a lot of good luck with Lycos (yeah, that Lycos, from 1995!)... It never returns pay wall or "opinion" based results (I.e. Medium).
Surely there must be some way to have distributed search compute a la folding/seti@home or those mersenne prime guys.
I'd gladly pool in some of my CPU time if it helps build a better search.
Thanks!
I don’t know how Google was in 2005, but in ~2010 I was able to pull a website on #1 with 0 cash spent, just by manipulating PR. That doesn’t seem great to me.
They have, sort of:
Can you also create a web comparable to the 2005 web?
Well, it's wikipedia. So just create a search engine for that, since their search sucks rocks.
Check out the dead internet theory. If most people browse 1% of the web, what's up with popular search engines?
Also, where are the books about writing a search engine?
Knuth's "Searching and Sorting" volume desperately needs an update.
I don't even know if anybody has written a book specifically about search at "web scale" (no MongoDB jokes here, please). But about the closest things I know of would be something like:
https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
https://www.amazon.com/Information-Retrieval-Implementing-Ev...
https://www.amazon.com/Introduction-Information-Retrieval-Ch...
Two major reasons: costs to build and maintain and manpower needed. Both are practically impossible to come by.
Because the money lies in modulating your product according to the whims of the highest bidders.
related 2 days ago:
_Ask HN: Has Google search become quantitatively worse?_
https://news.ycombinator.com/item?id=29392702
Inviting all the paranoid/speculative/hearsay/personal experience responses. Lame Ask HNs!!!!!
I am a non-dev and Ecosia and DuckDuckGo are perfect for me. Not used Google since more than 3 years now.
too many crappy websites, probably needs a "committee" to whitelist domains (only good quality ones) but probably too much work for not enough money or needs some monetization strategy
This is how Private Search [1] works since it decouples the search from the user. This means nobody knows both who searched and what they searched for. This is a huge leap for privacy in search.
[1]
Looks like your comment here caused enough curiosity to take the service down.
Tried it but it just says: "Something went wrong. Please try again."
same here
It should be working now! Thanks for the heads up. There was a traffic issue.
Is it a proxy to other search engines or are they building their own?
It's a multi part partnership with Gigablast. Gigablast sees the searches, but not who searches. Private.sh sees who searches, but not what they search for.
and i work with rasengan on private.sh so yes there's some issue there. one of the back end servers is returning a max capacity error of sorts... we are checking into it.
Just tried it and it worked for me.
Isn’t that what DuckduckGo is?
DDG is pretty useless though unfortunately.
you.com supports many of the standard operators and has specific reddit, stackoverflow, MDN apps for developers.
information-dense pages of yore have been replaced by really wordy, probably generated SEO optimized blog junk.
I seem to recall that Google consistently produced relevant results and strictly respected search operators in 2005 (?), unlike the modern Google.
You recall wrong.
Probably because you were a child and searching for "reddit".
Now as an adult, Google can't just hand you adult results by magic.
Search operators have changed, but that's because the internet is 1000's of times bigger since 2005. Where as the number of people went from ~1 Billion to ~5 Billion.
I think search results were the same for everyone, rather than being customized for each user.
You are not a baby, turn off the customisation. The same issue existed ~2005, Google customised and we had to work to turn it off. Also my idea we were become one world was totally wrong. Google customising for my location was more correct than my idealism. It also helped local businesses get online. That Google is 'evil' by default is a shitty assumption.
What about a search engine that only indexed information and technology "alternative" sites, specifically to give you the results most likely to be purged or demoted from Google's results? Would be simple enough in scope and have a built in market and use.
for the people in logistic business area who search in Google, Flatbed Container Semi Trailer, it should be good for your reference.
https://www.dreamtruegroup.com/flatbed-container-semi-traile...
here is Drop Side Trailer information, it should be helpful for those logistic company.
https://www.dreamtruegroup.com/drop-side-trailer/
the nazis don't really believe in markets, freedom, fairness, competition, democracy, or even capitalism for that matter, it's all just old school oligarchical authoritarianism