💾 Archived View for dioskouroi.xyz › thread › 25000514 captured on 2020-11-07 at 00:54:16. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Facebook was used as a proxy by web scraping bots

Author: avastel

Score: 275

Comments: 117

Date: 2020-11-05 18:15:42

Web Link

________________________________________________________________________________

Ombudsman wrote at 2020-11-05 23:53:51:

I came from a third world country and internet was pretty expensive. For some reason, my provider made Facebook completely free. So in my free college days I used the Facebook developer echo API to make a HTTP proxy so I can browse internet for free. It was terrible, was only HTTP 1, so no web sockets, videos stopped randomly etc, but hey I could read Reddit.

suifbwish wrote at 2020-11-06 01:34:29:

This is exactly the reason I love reading this site

jaflo wrote at 2020-11-06 00:42:53:

What API is this? I am curious as to what this API was meant to be used for.

arebhaibhai wrote at 2020-11-06 08:07:43:

Same story here. Data was so expensive and Airtel let users open airtel.in etc at Zero balance. We used to use all kinds of Opera and UC "Handler Mods" with custom HTTP headers like Host or X-Online-Host to fool the ISP. First on Nokia s40 and Symbain and later on Android. Someone made a Handler Mod of pshipon VPN and man, it was slow but so cool. And then Jio happened!

hamolton wrote at 2020-11-06 07:44:56:

Any chance you could share that? I think some flights still don't charge for Facebook messenger, so HTTP over messenger might still be useful.

userbinator wrote at 2020-11-06 04:59:11:

Less technically knowledgeable people would probably do something similar, using Facebook as an even slower (and lossier) "layer 8 proxy" as opposed to your "layer 7 proxy".

raxxorrax wrote at 2020-11-06 08:09:01:

It would also be a decent anti-tracking mechanism.

lamby wrote at 2020-11-06 00:19:58:

May I ask which country?

anuila wrote at 2020-11-06 00:35:24:

There are a lot of countries where Facebook is “free”, for example India and Philippines.

octoberfranklin wrote at 2020-11-06 00:45:10:

I'm guessing that in these countries the cost of Internet traffic is dominated by the undersea cables at their borders.

Facebook has been paying to have new undersea cables laid. This is done as part of a consortium, but those cables only have 6-12 strands in them (the repeaters are bulky) so owning even just one whole strand of fiber in an undersea cable is still an obscene amount of bandwidth for a single company that isn't in the business of reselling bandwidth.

apostacy wrote at 2020-11-06 01:11:30:

In The Philippines, my understanding is that they have ample bandwidth via Korea and other countries in the region. But the reason they have such expensive terrible internet is because of a lack of net neutrality and deregulation.

The cellphone duopoly sells "YouTube passes", that entitle you to get unthrottled YouTube for brief periods of time.

kortilla wrote at 2020-11-06 03:17:27:

Net neutrality isn’t related to Internet speeds. Good speeds are just driven by having competition.

Comcast was suddenly able to provide 1gbps for the same price as an 80mbps package when a fiber competitor entered the market.

Even with net neutrality, there is no incentive to make the internet better as an operator if you’re operating in a government granted monopoly/duopoly market.

arcticbull wrote at 2020-11-06 03:56:12:

Net neutrality eliminates the ability of an operator to discriminate and offer uncapped data or higher speed passes to _just_ YouTube.

Nextgrid wrote at 2020-11-06 01:19:58:

I wonder if a proxy could be made to encode data as video to put in a YouTube livestream. You'd still need an uplink but the upload bandwidth usage is a fraction of the download one for typical Internet usage.

userbinator wrote at 2020-11-06 05:10:36:

Yes, that is definitely possible and might make for a fun project. Use stego to make the livestream look innocuous and apply heavy ECC, further resisting censorship and arousal of suspicion. I think this is the closest I've seen to a public implementation of that idea:

https://news.ycombinator.com/item?id=12166332

_You'd still need an uplink but the upload bandwidth usage is a fraction of the download one for typical Internet usage._

Perhaps the chat/comments (once again with heavy stego/encryption) would work?

Related and older idea:

https://en.wikipedia.org/wiki/ArVid

nl wrote at 2020-11-06 09:15:58:

> the cost of Internet traffic is dominated by the undersea cables at their borders.

This is rarely the case. It's usually monopoly providers and/or people speaking historically when it _was_ expensive because it was rarer.

India and the Philippines both have more than adequate international bandwidth.

navanchauhan wrote at 2020-11-06 05:32:01:

Free Basics was not allowed in India by the Telecom Regulatory Authority of India[0]. The list of countries where Free Basics is currently operated is listed on Internet.org website[1].

[0]

https://www.theverge.com/2016/2/8/10913398/free-basics-india...

[1]

https://info.internet.org/en/story/where-weve-launched/

anuila wrote at 2020-11-06 18:29:45:

Indeed, I must have confused it with Bangladesh, I visited both in the same month. Thanks for linking to the actual list.

cblconfederate wrote at 2020-11-05 18:49:17:

You can't create your own link previewer, cloudflare will put a captcha in front of every website. All I want is a a freaking <title> tag. They don't seem eager to fix it either, their proposed solution is to contact every website owner (seriously) to ask them to whitelist you[1].

Frankly, i wish facebook or cloudflare offered their previewer as a free service, since most websites have them whitelisted.

https://community.cloudflare.com/t/attention-required-messag...

dmix wrote at 2020-11-05 21:40:49:

I’ve long said Cloudflare is a dangerous threat to the open internet and as well as some privacy tools like TOR.

But it doesn’t always get much traction on here because both the founder and employees of cloudflare are quite popular users on HN. Some have given me brief half assed counter answers that conveniently miss other harder questions like a good PR person does (and which you seem to have gotten in your reply).

I hope every web admin gives it serious second thought before adopting Cloudflare. Just like for cellphones OS/operator the one thing I’d dream of is a tool that offers a limited set of what Cloudflare does (DDOS protection, hosting privacy layer) but is pro internet and pro privacy. They seem hostile to it in many ways likely because it directly affects their bottom line.

The bigger question is whether such a tool could be created without all the downsides. The two I listed I think yes. But their web app security system is overly strict and bad for the internet IMO.

And I say that knowing they protect some serious defenders of human rights and face a lot of abuse from the ‘bad guys’. I just wished there was a better middle ground.

hombre_fatal wrote at 2020-11-05 21:49:26:

> But it doesn’t always get much traction on here because both the founder and employees of cloudflare are quite popular users on HN.

I don't think it gets much traction because you're barking up the wrong tree. Also, suggesting that YC is out to silence you and that nobody actually has a counter argument isn't very good for traction, either.

Until my website can't get taken off by a $5 rental of an internet-of-shit botnet, Cloudflare gives me and my users recourse against the bad actors of the world. (I also enjoy its host cloaking for my privacy)

You simply gloss over bad actors and attack one of the only solutions that works. The biggest threat to the open internet was its naive "there are no bad actors" design, not the people giving us one of the only bulwarks against bad design.

I agree with your last sentence that it would be nice to have a better middle ground, but notice that's not the "cloudflare bad" thesis of your comment.

The internet needs to be improved so that Cloudflare is redundant. It's not Cloudflare's fault that fundamental design oversights (like optional ISP egress filtering) have created a lucrative niche. And things like faster, unlimited data plans accessible by smart toasters and smart doorbells on top of the internet's naive architecture only entrenches Cloudflare further.

gjs278 wrote at 2020-11-06 04:31:12:

I hosted a server that was attacked all the time over a comcast connection and was always able to figure it out without cloudflare proxy blocking for me

wolco2 wrote at 2020-11-06 07:50:22:

Your tone shines a light that perhaps you are on of those cloudflaire employees/fanboys who will drown out a warning.

The parent poster had a point and your reply is reenforcing it.

KMag wrote at 2020-11-06 05:55:15:

Cloudflare even puts multiple captcha challenges for any request from the default browser on the Samsung S7 Edge. Granted it's an old phone at this point, and most users install Chrome on their phones, but I end up skipping a lot of websites on my phone rather than participate in furthering the misconception that "Chrome is the only browser".

pvg wrote at 2020-11-05 21:48:00:

_because both the founder and employees of cloudflare are quite popular users on HN._

It seems a lot more likely people aren't finding your argument as convincing as you'd like. Plenty of well-known users (and users who identify their employer) around whose companies' HN-perception fortunes change quite a bit over time.

wdb wrote at 2020-11-06 03:28:18:

Normally skip those sites that ask for a Cloudflare captcha if the site isn't too important. Luckily this is the case most of the time.

Would be annoying when online banking or governmental sites start asking for them.

secfirstmd wrote at 2020-11-06 01:53:58:

Hey, try

https://deflect.ca

if you want an ethical DDoS protection service.

ShamelessC wrote at 2020-11-05 22:26:35:

Edit: my bad. Misinterpreted your comment.

Can you elaborate on how Tor is a threat to the open internet? That's a non-obvious statement to me. I'm aware that it's compromisable via controlling exit nodes (NSA, various nations) but that's not really the threat profile for the average person. Are there any other reasons?

Because despite its flaws, afaik TOR is an attempt to make the internet _more_ open to those who are being surveiled.

What am I missing?

JosephRedfern wrote at 2020-11-05 22:30:28:

I think OP is suggesting that Cloudflare is a threat to TOR, not that TOR is a threat to the internet.

input_sh wrote at 2020-11-05 23:15:50:

Website owners can actually whitelist Tor traffic as a "country", but not a lot of them knows/cares/wants to do that.

grumpitron wrote at 2020-11-05 22:32:53:

My read of it was that CloudFlare is a threat to both the open internet and TOR, not that TOR was also a threat to the open internet.

m-ee wrote at 2020-11-05 22:30:52:

I read that as Cloudflare is a threat to tools like Tor

meowface wrote at 2020-11-06 07:49:11:

Any company through which a high percentage of web traffic is not only routed through but fully reverse-proxied of course always should be a significant concern and should be subject to extreme scrutiny. But why explicitly do you think they're anti-internet and anti-privacy? To me it seems like being pro-internet and pro-privacy aligns both with their general incentives and their monetary incentives.

I genuinely think they're a net positive for and supporter of Tor users. Before, site owners and security providers who faced issues with abusive/malicious traffic behind Tor connections (spam, illicit content, security scanning, password struffing) nearly always resorted to outright blocking all Tor exit node IPs, because they had no other feasible option. I've been in that position. Cloudflare at least provides any site owner an ability to easily allow the traffic; just with a fairly quick occasional bot check.

Additionally, as of 2018 they now have an "Onion Routing" option which site owners can enable, which results in Tor users being able to access your site 100% through the Tor network. As a result, Tor users no longer experience any captchas, load your site faster, and never have to touch the clearnet.

>But their web app security system is overly strict and bad for the internet IMO.

Their WAF seems to have a pretty low false positive rate, compared to others I've seen. (Though the flipside of that is it also has a pretty high false negative rate and isn't very helpful against a dedicated non-automated attacker, like many other WAFs.)

>But it doesn’t always get much traction on here because both the founder and employees of cloudflare are quite popular users on HN.

They do post a lot here, but I doubt that's really responsible for defensive responses from other HN users. The most common criticism I see here (presenting a captcha for people using Tor, which site owners can now disable) makes me think the majority of people making the criticism have never run large websites or worked infosec for any organization with a large website.

Tor is of course not a threat itself, but anecdotally I'd estimate 90 - 95% of traffic that the average website owner receives from Tor is highly abusive/malicious, and Cloudflare empirically estimated 94% as of 2016 (

https://blog.cloudflare.com/the-trouble-with-tor/

). And anecdotally, not only is a high percentage of Tor traffic malicious, in many cases a significant percentage of all malicious traffic is Tor traffic. Naturally, due to Tor by design making it impossible to distinguish the ~94% connections from the ~6%, it's extremely difficult to mitigate this without just blocking 100% of Tor traffic. This is obviously not Tor or anyone's fault; it's just a practical reality for website owners. This sort of situation will always be the case for any kind of robust privacy-protecting application.

Cloudflare is possibly the first free service that actually enables anyone to easily allow normal traffic from Tor without much increase in security/abuse risk. They seem explicitly pro-Tor, especially with the explicit Onion Routing feature that lets Tor users access your site 100% through the Tor network without ever experiencing captchas, and statements like in

https://blog.cloudflare.com/the-trouble-with-tor/

and

https://blog.cloudflare.com/cloudflare-onion-service/

One may certainly have lots of other justified, legitimate concerns regarding the company and their disproportionate control of a huge chunk of the internet and web, but I'm not sure how someone could read those, see how the traffic is handled in practice, and conclude they're anti-Tor or a dangerous threat to Tor.

rainingcatndogs wrote at 2020-11-05 19:51:04:

And unfortunately, cloudflare is everywhere. This trend will make it even harder for projects like a new search engine to enter the game.

raverbashing wrote at 2020-11-05 20:01:52:

Because if you don't have it some a-hole will go and ddos your site or you want to prevent a hug-of-death because of reasons.

It seems a lot of issues happen because bad players are continued to allowed to thrive, example: everybody uses a big provider because they're the only ones that solved the spam issue.

cblconfederate wrote at 2020-11-05 21:13:08:

cloudflare can just allow a fair crawl rate instead of a captcha on first request

beagle3 wrote at 2020-11-05 21:38:21:

The problem is that bad actors can masquerade as a lot of independent clients (The first D in DDoS stands for "distributed").

Figuring out whether a site is under a DDoS attack or getting legitimate requests from many sources is a very hard problem, and can just be worded "telling good actors from bad actors" -- no simple solution works; also, who YOU consider a good actor and who the website owner considers a good actor may be at odds.

Most people (and CloudFlare by default) consider FAcebook a good actor; but as far as I'm concerned, Facebook is an evil an actor as one can be.

cblconfederate wrote at 2020-11-05 21:56:19:

> sources is a very hard problem

We're talking about virtually unknown blogs that get 1 http request from my server's IP, which is not blacklisted anywhere. It's not hard at all , i just think cloudflare's tech s not that good

hombre_fatal wrote at 2020-11-05 22:05:31:

You're really pulling a "how hard could it really be??" to DDoS prevention?

You should at least be humbled by how few services can even offer DDoS protection that works against volumetric attacks and isn't just based on null-routing. The people with skin and money in the game might know something you don't.

cblconfederate wrote at 2020-11-05 22:17:38:

here's how simple it is :

              if (!website.underDDoS && website.requestedTimesToday[ip] <10) showCaptcha=0;

beagle3 wrote at 2020-11-05 23:21:17:

How do you implement "website.underDDoS"?

Through a proxy - mind you; CloudFlare makes their decision without access to your CPU or DB metrics, and don't know which page load times are legitimately slow and which aren't supposed to be.

cblconfederate wrote at 2020-11-06 00:07:05:

how about "haven't had requests for the past 2 minutes". Again, i m talking about links to obscure blogs that barely anyone reads, let alone DDoSes

I think another comment here may be closer to the truth, CF may only be running heuristics on the user agent

beagle3 wrote at 2020-11-06 08:32:08:

If hardly anyone reads or DDoSes them, why did they go to the trouble of setting up CloudFlare? It’s free for those obscure blogs, but it’s definitely a non trivial hassle. Usually people set it up only after they experienced their first attack.

I get it that you are upset Google gets to scrape them and you don’t. But bad actors really are making it difficult for everyone to just “be” on the internet.

cblconfederate wrote at 2020-11-06 13:29:35:

i dont know! but they do it, everyone does it because everyone else does it. it s not unusual

londons_explore wrote at 2020-11-05 22:33:17:

I got round it by just making sure the user agent is set to the latest version of Chrome rather than a version from a few years ago that I had hardcoded before. It seems Cloudflares protection is pretty much "is your user agent in the top 10 user agents?".

Did you try that?

cblconfederate wrote at 2020-11-05 22:43:43:

I have, iirc it worked some times, but not always. Is it a reliable solution for you?

londons_explore wrote at 2020-11-06 06:04:38:

It's at least a 95% reliable solution, which seems to be about the same as a real user sees.

sfifs wrote at 2020-11-06 00:31:38:

Well if you have an easy solution that you think would work, why don't you put up a website, commission a DDOS attack from a skilled actor and try to demonstrate mitigation?

Companies pay big money to CloudFlare. If a simpler and cheaper solution is workable, they'll pay you instead.

SiempreViernes wrote at 2020-11-06 01:00:00:

Just like telling if it's raining is easy but stopping rain once has started is hard, the claim is that it's not hard to detect if a site is being ddosed.

Kaze404 wrote at 2020-11-05 20:09:15:

I use Zoho.com and I rarely get spam, if ever.

snazz wrote at 2020-11-05 20:29:08:

Zoho isn't Google-size, but it isn't irrelevant, either. Sending mail from a self-hosted email server is far harder since the big providers might put it in spam or drop it even earlier.

dzhiurgis wrote at 2020-11-05 20:34:53:

To add to sibling - running your own mail server is the only way to ensure your email is not read by someone else which is so messed up.

londons_explore wrote at 2020-11-05 22:36:17:

> running your own mail server is the only way to ensure your email is not read by someone else

But any mail you send to someone else probably ends up read by Google/Microsoft anyway, since that's where their mailbox is.

Also, email security is a joke. It's 2020, and even TLS encrypted SMTP connections tend not to check for a valid certificate, making them trivial to MITM.

55555 wrote at 2020-11-06 03:10:27:

Practically speaking how does one MITM an SMTP connection? For example, from Google to Microsoft. They connect directly to the IP addresses they get from MX records + lookup. What's the actual threat vector/execution here?

londons_explore wrote at 2020-11-06 06:03:16:

Anyone with hardware on the network path can do it... Or anyone who can inject BGP routes can do it too.

pvorb wrote at 2020-11-05 23:53:50:

I use it as well and I get sooo much more spam than I git on Gmail.

coderholic wrote at 2020-11-06 04:31:27:

https://host.io

we scrape every registered domain once a month, and make the meta data available freely over an API. You could use that to get a title for a domain (although not for a URL that's not the main domain), eg:

  $ curl https://host.io/api/web/facebook.com?token=$TOKEN
    {
      "domain": "facebook.com",
      "rank": 2,
      "url": "https://www.facebook.com/",
      "ip": "157.240.11.35",
      "date": "2020-08-26T17:39:17.981Z",
      "length": 160817,
      "encoding": "utf8",
      "copyright": "Facebook © 2020",
      "title": "Facebook - Log In or Sign Up",
      "description": "Create an account or log into Facebook. Connect with friends, family and other people you know. Share photos and videos, send messages and get updates.",
      "links": [
        "messenger.com",
        "oculus.com"
      ]
   }

See

https://host.io/docs

for more details about the API and what else you can do with it (eg. find backlinks to domains, domains with the same adsense ID etc)

bgirard wrote at 2020-11-05 19:51:52:

Long term, a new HTTP META method would be interesting. I wonder if something like that has ever been considered. Providers like Cloudflare would hopefully be more lenient with these requests.

nbadg wrote at 2020-11-05 20:09:16:

Huh. It's certainly an interesting idea! Strictly speaking, individual people could implement this today, since nonstandard HTTP verbs don't break anything that doesn't know to request with them. (It wouldn't be of much use, because clients wouldn't know to use it, but still -- something that could easily be prototyped).

I don't think FAAANG (or any other big players) would have much interest in making it happen in the standard though, since it would undercut their big-player advantage.

mtberatwork wrote at 2020-11-06 12:13:44:

Doesn't the oembed spec [1] already solve this? I think the OP could solve their problem by simply creating an oembed endpoint with all the necessary meta data.

[1]

https://oembed.com/

67868018 wrote at 2020-11-07 02:45:54:

Yea but when your request to fetch the oembed data is blocked by a CAPTCHA...

This is a real problem, we experience it in the Fediverse

fiddlerwoaroof wrote at 2020-11-05 21:03:46:

I wonder if

    Accept: application/json

Would be a reasonable alternative? Wasn't this supposed to be the point of content negotiation?

stingraycharles wrote at 2020-11-05 21:12:11:

Maybe, but not really; seems like this thread is more about intent (“I just want a preview”) while content type is more about representation (“I want the content as json”). I can imagine that there will be websites that are actively using the accept parameter to distinguish between “regular visitors” and have their APIs at the same paths (didn’t Reddit do this at some point?), and thus your approach would break in this case.

I guess what this is really about is, I hate to say it, but something in the direction of the semantic web, where web servers (and in this case, CloudFlare et al) actually gain a deeper understanding of the content they serve, and a web browser / crawler being able to query this content directly.

fiddlerwoaroof wrote at 2020-11-05 22:00:15:

It seems to me that what "previews" really want is an API for the page's content in a structured format: OpenGraph tags and other microformats are one representation, but it's annoying to have to load _all_ the HTML just to grab title and the OG tags.

stickfigure wrote at 2020-11-05 21:18:10:

Accept: text/preview

stingraycharles wrote at 2020-11-06 17:41:13:

In what content type? Json? Xml? Html?

neilparikh wrote at 2020-11-05 19:41:04:

> Frankly, i wish facebook or cloudflare offered their previewer as a free service, since most websites have them whitelisted.

Yup, and exposing just a key pieces of information (title, and some of the meta/og tags) without the body would limit the potential for abuse, while still being fairly useful for legitimate uses.

qwerty456127 wrote at 2020-11-05 22:16:29:

There hardly are any "illegitimate" uses. The web is meant to be machine-readable (we wouldn't have Google or anything nearly as convenient in the first place if it wasn't). Whatever have been published is public and should not come with artificial limitations on how do you read and process it. Blocking crawling should be outlawed as it clearly is a monopolistic practice. E.g. I want to build my own crawler to index and categorize the web subset I choose for me. I believe this is a perfectly legitimate use. But they will probably try to stop me.

cblconfederate wrote at 2020-11-05 22:20:35:

> Blocking crawling should be outlawed

That's overly broad. But maybe it should be illegal to have exceptions only for major monopolies.

sokoloff wrote at 2020-11-06 13:05:18:

Turn it around at least for a few minutes. Does a website operator _have to handle_ whatever arbitrary traffic you want to throw at them from your crawler?

_They’re_ the ones choosing to use tech that’s blocking you. Proposing to make it illegal for them to make that choice or to speak to you differently than they speak to other users of their site may give you some idea of the resistance you’re likely to face to this proposal.

seniorgarcia wrote at 2020-11-05 22:34:05:

I don't get what value link previews add. Someone shares a link with me (on skype, slack, teams... whatever) and I care about the content because the person sharing it with me thinks I could/should care about it, or someone shares a link on an aggregator and then I don't think it is too much to ask for that someone to write a summary. If the link is worth sharing writing 1 sentence to explain why isn't too much to ask.

What is the value a link preview adds? And why should I, as a content provider care about the value you add? Cloudflare does something for me, what is your service doing for me and why should I whitelist you (or care about you)?

dasil003 wrote at 2020-11-05 22:39:36:

They're sending you traffic.

Imagine Twitter or Facebook without link preview, it's much harder to use and overall reduces the change I'll click on a link. Do you think only Twitter and Facebook should be allowed publish previews?

def_true_false wrote at 2020-11-05 22:58:21:

Half the time the link preview picks the wrong picture and sometimes even the quote. Twitter and Facebook would both be improved by disabling it. Hell, it might even stop people from thinking they need a hero image for their 2 paragraph medium shitpost.

input_sh wrote at 2020-11-05 23:25:32:

I'd place that blame towards website owners. Both Facebook and Twitter are pretty open where they read that info from, and an owner can pretty easily pass those fields (it's just some <meta> tags in the <head> element).

They also have their own validators:

https://cards-dev.twitter.com/validator

and

https://developers.facebook.com/tools/debug/

The only issue I'm aware of is that Facebook's crawler breaks about every two months or so.

seniorgarcia wrote at 2020-11-05 23:49:50:

What meta tags do I have to fill and why is Twitters/FBs preview suddenly my problem?

https://developer.twitter.com/en/docs/twitter-for-websites/c...

So, I should have to include twitter specific meta tags even though I personally don't care about twitter? Maybe twitter should make it clear which tags they read? Maybe it's SEO bullshit I don't care about? Maybe even even the OG: tags don't work all the time and result in dumb previews?

thatfunkymunki wrote at 2020-11-06 01:52:14:

If you don't want to fill them out, don't... Filling them out lets you customize your link preview on twitter. If you don't care about Twitter, why would this affect you at all?

67868018 wrote at 2020-11-07 02:48:14:

They're used by instant messengers too: slack, iMessage, WhatsApp, telegram, signal, ...

einpoklum wrote at 2020-11-05 22:49:58:

> it's much harder to use

Actually it's easier to use, in that the preview doesn't take up screen real estate. Perhaps you mean the experience is less pleasant?

seniorgarcia wrote at 2020-11-05 23:05:01:

>They're sending you traffic.

Irrelevant traffic for every metric I care about.

>Imagine Twitter or Facebook without link preview

That's exactly what I'm saying. Either I care about what that person thinks might interest me or I don't. The link preview abstract is shit anyway. Does the site title and the 2 sentence abstract really sway you? If someone wants to send traffic my way, writing an interesting abstract is not too much to ask.

>it's much harder to use and overall reduces the change I'll click on a link

Maybe you should re-evaluate who you follow on twitter? I frankly could care less about facebook.

>Do you think only Twitter and Facebook should be allowed publish previews?

I think previews are worthless regardless, I thought I made that clear. Either you care about me linking it to you or you do not.

*EDIT: And just for fun, here is the link preview stuff from my latest skype call with my brother:

https://imgur.com/a/yO5OP36

Look at all the value those previews added.

cblconfederate wrote at 2020-11-05 23:08:37:

when you paste a link on reddit and it autocompletes the title

update a bookmark title, or check if it exists.

is it not self-evident that a link being crawlable is useful?

seniorgarcia wrote at 2020-11-05 23:20:12:

>when you paste a link on reddit and it autocompletes the title

Oh no, you have to copy/paste the title?

>update a bookmark title, or check if it exists.

I can access the site without a captcha, my browser can fetch the title.

>is it not self-evident that a link being crawlable is useful?

No, it is not. Maybe a site owner does not want crawlers to index the site?

Me being able to access the title and any html meta tags is not the same as some crawler being able to access it. It seems like your beef is with cloudflare and that is fine but please state that that is your issue and don't try to frame it as something else. What I don't get is how everybody places the blame at cloudflares feet. It is my choice as a host to use cloudflare and to use their protection features.

cblconfederate wrote at 2020-11-06 00:11:58:

i m not sure if you re being serious

CF is so widespread that it breaks a significant part of the web for simple things like getting the page title. That's all. The End.

seniorgarcia wrote at 2020-11-06 00:20:13:

'I' can get the page title though. That's all. The End.

I don't care about your crawler. Or your ability to post the link to my site to twitter/fb and if I did maybe I'd revise my cloudflare settings.

sergiotapia wrote at 2020-11-05 20:40:31:

The cloudflare and google catcha are terrible. It's so bad that at this point I just close the tab if they challenge me with it. I use Brave and always have Shields UP, it seems having it up makes the captchas extremely difficult. Mission accomplished I guess.

skybrian wrote at 2020-11-05 19:51:18:

Is this the case for any web crawler?

cblconfederate wrote at 2020-11-05 21:16:09:

not sure but it s a very common problem:

https://www.google.com/search?q=cloudflare+attention+require...

jedberg wrote at 2020-11-05 19:32:09:

So this was a very long web page to say: Facebook forgot to rate limit their web scraper on a per user basis, but we told them and they fixed it.

dmix wrote at 2020-11-05 21:42:38:

Also it’s not really a scraper. Just a json response with the website preview stuff like meta tags, <title/>, and other basic information which could be useful for some bots.

But it does not give you the whole HTML. Or anything close.

lenitabinol wrote at 2020-11-06 00:59:38:

Exactly, they literally didn't do anything but file a Facebook bug report.

ada1981 wrote at 2020-11-05 21:37:34:

In the same way your comment is a very long way to say: something happened.

sschueller wrote at 2020-11-05 19:41:22:

I've used Yahoo's YQL. While I would hit rate limits and other crap when trying to scape data of some sites directly. YQL would provide me nicely structured data without these stupid limits as many don't see yahoo's bot as a scraper.

tyingq wrote at 2020-11-05 18:51:42:

That's pretty interesting, Facebook as a "web scale / hundreds of pages per second" batch web page summarizer. I imagine you could build a pretty decent general purpose search engine that way...free crawler.

daniellarusso wrote at 2020-11-05 20:58:59:

As long is they are using opengraph meta tags.

jtsiskin wrote at 2020-11-05 19:44:45:

Why can’t you just make your web crawler look like fb or googlebot (via user agent)?

Do website owners actually check the ip?

colinmhayes wrote at 2020-11-05 19:55:38:

Sounds like this company is checking that the ip is from facebook. That would probably work on less secure sites though.

ddorian43 wrote at 2020-11-05 20:37:20:

At the best case scenario, google has a monopoly on scraping. Imagine trying to create a global search engine, how can you possibly even crawl sites that are behind cloudflare or just allow google/fb/bing bots ?

Can you real-time crawl twitter ? Pretty sure they have a special deal with google to instant ping on new tweets.

How many websites actually ping google on new content ?

TechBro8615 wrote at 2020-11-05 21:40:30:

And don't you dare scrape Google results. That's against their TOS! Rules for thee, not for me.

Isn't it weird there is no machine-readable API to Google search results?

withinboredom wrote at 2020-11-06 00:03:58:

I thought this is exactly how DuckDuckGo worked?

snazz wrote at 2020-11-06 00:15:31:

No, DuckDuckGo purchases search results from Bing:

https://azure.microsoft.com/en-us/services/cognitive-service...

fctorial wrote at 2020-11-06 04:57:31:

You mean startpage?

echelon wrote at 2020-11-05 21:28:59:

This is so fucked. We've encrusted ourselves into this walled fiefdom, and there's no way to break free.

It's only going to get worse.

Chrome will be the only browser. AMP the only delivery mechanism. Video will require DRM. Eventually, text content will too. Binary blobs with no ad blocking.

einpoklum wrote at 2020-11-05 22:58:47:

Well, I think you're being overly alarmist, at least in the short term: DRM on Video has not really caught on, at least on-line; Non-Chrome browsers continue to have a significant share (mostly on Desktops); and ad blocking remains rather effective.

In the long run, I'm definitely worried: Capitalist economies tend to see a concentration of capital, generally and in most sectors individually. And this seems to be a real danger with computing technology. Coupled with mass surveillance and the pushing of people to have their personal information held by those large tech companies, a dystopia is not inconceivable.

PS - By AMP, do you mean Amazon Prime?

AlexandrB wrote at 2020-11-05 23:37:16:

> DRM on Video has not really caught on, at least on-line

This seems like a weird statement. All of the paid streaming services use DRM on Video, so all major browsers include the requisite black-box DRM modules. I'm actually surprised YouTube has not added Widevine DRM for all videos yet, but I'm sure it'll happen if RIAA/MPAA get annoyed enough with youtube-dl and the like.

> PS - By AMP, do you mean Amazon Prime?

I think he means Google AMP[1], which is slowly infecting more of the top search results on Google.

[1]

https://developers.google.com/amp/

einpoklum wrote at 2020-11-06 00:47:34:

I've never heard of this AMP thing. Is it really that popular?

Valid point about paid streaming services - which I don't use.

ffpip wrote at 2020-11-06 02:36:22:

AMP is incredibly popular. Every news site has enabled it. You have a 100% chance of seeing an AMP page in the top results for anything.

Google had 2 options - make websites faster the normal way (remove bloat) or make websites faster by introducing AMP. AMP is controlled by Google. What do you think they did? They said they would reduce the site's ranking if they didnt use AMP. Within weeks, everybody except Wikipedia was introducing AMP.

withinboredom wrote at 2020-11-06 00:02:08:

You don’t crawl unless you have to, Twitter[1], Facebook, WordPress.com[2] and other big services have a firehose you can apply to and get real-time changes. If you’re crawling the web, you’re probably doing it wrong or only servicing a particular niche.

[1]

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/...

[2]

https://developer.wordpress.com/docs/firehose/

ddorian43 wrote at 2020-11-06 08:35:34:

You missed the pricing. This is what you're doing wrong.

And you missed the cloudflare part too.

intricatedetail wrote at 2020-11-05 22:13:17:

Someone who would create a scraping API that site owners could embed in their projects and get paid for feeding crawlers with their data, could make billions.

AmericanChopper wrote at 2020-11-06 01:50:33:

Would just like to give an honorable mention to Google Translate, the most accessible http proxy of all time. It’s especially good for bypassing corporate access controls. I’ve used it many times for accessing solution threads on technical subreddits at work.

notRobot wrote at 2020-11-06 08:38:29:

datadome.co is blocked for me:

_datadome.co is being blocked by AdGuard DNS filter, AdGuard Tracking Protection filter, EasyPrivacy, Goodbye Ads and oisd._

Dunno what they do, but it can't be good.

AznHisoka wrote at 2020-11-05 18:33:28:

How is DataDome different from Cloudflare? The latter offers bot protection for free if you are already a Cloudflare customer

iampims wrote at 2020-11-05 18:38:10:

They’re a French/EU company, that alone can be an advantage for some businesses.

kevin1985 wrote at 2020-11-05 18:41:00:

Pretty sure they have an office in New York as well. Site shows POPs all over the map: datadome.co

ericcholis wrote at 2020-11-05 18:38:28:

My understanding is that it's more "advanced". Take that at face value, I've not used the service.

martimarkov wrote at 2020-11-05 18:48:58:

The website doesn’t open for me.

llacb47 wrote at 2020-11-05 19:20:37:

Check your DNS, if you are running a pihole or something similar you will need to disable or allow this site.

inetknght wrote at 2020-11-05 19:26:17:

Enable spamware? Big red flag nokthx

MDinBk wrote at 2020-11-05 18:52:54:

interesting!