Eldritch Things

A few months ago, a fascinating snippet crossed my radar – for some stupid reason, as a screenshot, as usual, so it was difficult to identify the author or point at the origin, let alone quote precisely. Which is evil on par with sending screenshots by pasting them into Word documents, but apparently, lots and lots of people don’t know better.

I did eventually find the source, though, so I’m linking straight to the original:

We’ve got this dimension right next to ours, that extends across the entire planet, and it is just brimming with nightmares. We have spambots, viruses, ransomware, this endless legion of malevolent entities that are blindly probing us for weaknesses, seeking only to corrupt, to thieve, to destroy.

Add onto that the corrupted ones themselves, humans who’ve abandoned morality and given up faces to hunt other people, jeering them, lashing out, seeing how easy it is to kill something you can’t touch or see or smell. They’ll corrupt anything they think could be a vessel for their message and they’ll jabber madly at any who question them. Their chittering haunts every corner of the internet. They are not unlike the spambots in some ways.

Add on top of that the arcane magisters, who are forever working at the cracks between our world and the world we made. Some of them do it for fun, some of them do it for wealth, others do it for the power of nations unwise enough to trust them. There are mages who work to defend against this particular evil, but they are mad prophets, and their advice is almost never heeded, even by those who keep them as protection.

_[astercrash@tumblr]_

astercrash@tumblr

Which I filed away as something man was not meant to know, (ftaghn!) until more recently, another piece crossed my radar, a little closer to home:

The Dead Internet Theory.

The TLDR of that idea is simple:

Large proportions of the supposedly human-produced content on the internet are actually generated by artificial intelligence networks in conjunction with paid secret media influencers in order to manufacture consumers for an increasing range of newly-normalised cultural products.

Now, every engineer who actually has an idea how things work behind the screen will tell you this is not true… _yet._ Never mind that whether influencing people in this manner is actually even possible is more of an open question than many people – the ones whose livelihood depends on advertising industry, in particular – would like to admit.

But that reminded me of my old essay about the origins of the Internet as we know it now:

Pseudoscience: Transmutation of Water

Go read that first – I know you didn’t yet, because nobody did, even though it’s from 2014 – because the rest of this post will effectively be a continuation of that one, even if it is a continuation veering sharply to the side. I will certainly be referring to terms and ideas introduced therein.

Finished? Go on reading from here, then.

The Post-Information Age

While the notion of eldritch internet populated by unknowable hostile alien deities is emotionally appealing – I spend far more time babysitting my various networked devices trying to make sure nothing gets in than I would like, and yet I can’t just take them off the network – it’s more of an exaggeration. But there is an important qualitative shift that did occur at some point after the Social Network Age went into full swing.

In prior ages in the development of the water cooler, the culture of the Net was driven almost exclusively by the people that made it up. In the Ancient World era, only educated – or in education – literate people with certain shared values were even admitted, and this defined the Net. By the time the Atomic Age rolled in, it was a much wider selection, and the character of the Net changed to reflect that. Still, it wasn’t just anyone, the barriers to entry were higher than they are today.

And then, giant companies spent billions to remove the barriers almost entirely, driving the development of everything, from software to hardware, in an effort to increase their user base.

It’s not that the Internet got worse, or even that people on the Internet got worse, not quite. It’s that enough people in the general population were always _actually_ that bad, but when put in digital writing, “people horizon,” the set of people one interacts with daily at a deeper level than simply seeing their faces,¹ greatly expands. As the set of people who had access to the Internet expanded, so the population in it would approximate the general population closer and closer, and today, there is hardly any social group that _isn’t_ represented on the Net, except, possibly, very radical Luddites. And I’m not entirely sure about those either.

┄┄

1. Reading what they say _is_ much deeper than seeing their faces, despite not seeing the actual faces, and you get a lot more information from skimming through a short sentence than from a single look at someone’s face.

It’s just that now you see all of them in focus equally.

But before a certain time, the only population of the Net were _people._ The Post-Information Age – I can’t put a finger on the specific date I would say was its starting point,² but I’m calling it that – is marked by the emergence of _native fauna_ as a major player on the Net.

┄┄

2. “2010” that I want to say is rather arbitrary, and more due to the round date in the calendar than anything specific.

A portion of it is actually _megafauna,_ vast disjointed organisms, whose true breadth cannot be easily seen even by the people who have created them, and now they lurk in the depths of the water cooler that has become a seemingly endless ocean. And while it might be more technically correct to call them “flora,” if only because what passes for their bodies doesn’t actually move, the feel of that word is wrong. In fact, they tend to behave more like unusually active, vicious fungi.

It is a matter of how you define the term “artificial intelligence” to call it intelligent fauna or not, but it certainly has more complex behavior than the earliest [attempts at artificially imitating life], which date back to the 1950s.

attempts at artificially imitating life

Let me attempt to substantiate that claim, or at least explain what is it precisely that I mean.

──── ✶ ──── ✶ ──── ✶ ────

The increasing volume of users³ in large corporate-owned systems eventually necessitated relegating as many management decisions about them to automation as possible, so we have policing algorithms which determine whether you’re allowed to keep your internet identity or not, what you are allowed to say and where, or whether an actual human will take notice of you and drag you to court. Since the user is not a client, but a commodity, fighting against such a policing decision is typically futile. I don’t even need to cite examples, YouTube’s Content ID and automatic copyright claiming system should be instantly recognizable to anyone who actually watches YouTube, and stories where someone lost their Google account with no recourse whatsoever and for no explicable reason were likewise widely enough publicized to reach anyone’s ears.

┄┄

3. While it is a widely quoted adage that “you’re not the consumer, you’re the product!” this is, strictly speaking, not true. If people were the product, mistreating the product would reflect on the business’ reputation. People are a commodity required to manufacture the actual product, while the product being sold is their attention.

This, all by itself, has a major influence on the Net culture, which ended up concentrating in a relatively small number of enclaves run by major corporations primarily due to the influence of Metcalfe’s Law. But that’s more about the behavior of people, which is not what I’m going to concentrate on today.

Right alongside people, a vast economic ecosystem exists, where advertising providers package and sell attention harvested from hypothetical users, who may or may not actually exist, while a huge number of parasitic robots rehash and generate content, attempting to acquire attention to sell thereby, or produce fake attention, all of them clustering around the few gigantic systems that trade in it. I don’t need to cite examples here either, advertising had been the core of economic activity on the Net since at least after the dotcom crash.

If you want a more in-depth desription of how it all works, “Subprime Attention Crisis” is a worthy read.

As a side note, Internet ads are in some ways very much like torture:

Everyone believes that at least some circumstances make them necessary.
They’re abhorrent.
They don’t actually work.

I am not entirely sure when did the idea of computer-generated seemingly-human content meant primarily for a search engine to consume start, but I clearly remember seeing one of the earliest examples sometime between 1994 and 1996, as a piece of software meant more as a discussion topic than an actual thing to use. The idea behind it was that instead of seeking to delist something from the then-emerging search engines, you flood them with multiple versions of the same webpage that you progressively mutate – very similar, but each slightly distinct, edging further and further from the truth. It wasn’t advanced enough to actually generate anything, but it didn’t need to. This early example was remarkably prescient, but I have seen nothing quite so similar since, or at least, nothing meant for the same purpose.

Well, out of a million random ideas, at least one absolutely had to end up remarkably prescient.

The robotic ecosystem was born to take advantage of that advertising economy. As people started working on getting more advertising income, that meant trying to get more visitors – and the primary source of those at the time were the search engines. An effective strategy was to concentrate on getting new visitors rather than keeping the ones you got, and in many cases it was the preferable one.

It was quickly established that the pages linked to more often, _cited_ more often like scientific papers, are usually the ones to read – that’s the fundamental idea of the Pagerank algorithm – so the usual method to increase your search engine ranking was getting other sites to link to yours. The obvious method of abuse here relied on flooding sites which allow people to post things with links to targets whose ranking they were meant to increase.

By the early 2010s, this technique was dying out, as `nofollow` on links submitted by users became the norm, so something new was required, and variations on the idea that annoyed people less – the so-called “link farms” – proliferated. Link farms were purported “online directories” whose only purpose was to link to other sites and inflate their search ranking thereby, which quickly lost its pretense of normality when there suddenly were thousands of them. Google announced that it would be ranking them into the ground. The ranking of a site mentioned in an article that was similar in content to the site itself was increased over that of a site simply linked to.

The earliest “content farms” – companies making it their business to generate large quantities of low quality content specifically with an eye to inflate its perceived quality to search engines – seem to have emerged by 2009. Since their very nature is to fool search engines, researching their origins gets more difficult as they come onto the scene. Never mind that unlike blogging platforms, they never had anything particularly _interesting_ on them, so I wouldn’t just remember their names. By 2011, once again, Google was announcing that it’s rehashing ranking algorithms to rank them into the ground and promote sites with “higher quality content.” Notice that “quality” was a parameter determined automatically, and the published guidelines on what counts as quality were as vague as they come.

Plagiarizing single popular pages and entire sites with as few changes as possible was the norm in this period,⁴ and could, in fact, be automated. A lot of that is still going on today, because it is the cheapest way to “produce” content, and large sites like Wikipedia and Stackoverflow are the major victims. Google, while fighting back, started working out where the original source was and ranking plagiarized copies down.

┄┄

4. The reason the copyright notice here says what it does is precisely because back in 2011, one of my pages suddenly got popular and was plagiarized just like that.

Since the original content farms were actually manned by human writers, they could not effectively compete with the robot on the other end, a robot that was determined to make them unprofitable – so I presume they took to automation the moment effective methods were developed to employ it. Most of the details on how this was done and how much exactly was done never get publicized, and this conclusion is mostly supposition: I’ve been able to find interesting articles, purporting to be written by owners and creators of content farms, which indicate they were mostly out of the economy by 2016, so if any code still survives, I’m not getting at it. Ironically, the articles claimed they thought they were providing a legitimate service – to their workers if not the Internet as a whole – and increasing the breadth of human knowledge, by providing people opportunities to write on hyper-specific topics and pay them per view of their written content.

This back and forth continued for the entire 2010s. What was a task that is normally the job of a person – selecting things that people are searching for because they are actually interested in reading them – was relegated to automation, and further automation rose to counter it. Up to a point, it was still an algorithm written by humans and many of these algorithms were actually patented by Google. Attempts to exploit them also came primarily from other humans, supported by automated tooling with properties that are at least in theory well understood.

Along the way, this gradually changed.

One notable incident that most readers will have heard about at least tangentially is the [Elsagate], the scandal arising from a discovery of numerous YouTube videos with disturbing content, clearly meant for children. It is not clear how much of the actual content itself was actually computer generated, – probably some, but not a lot – but what I’m fairly convinced of is that the _scripts_ for videos and their metadata indeed were produced automatically, in an attempt to cheaply manufacture content that would be highly ranked in searches, and this was not something trivially simple like a template, or something rationally designed like an algorithm, but a neural network. The videos were also processed with as many steps automated away as possible, but that’s just details.

Elsagate

The entire business model was made economical by the large population of unattended children watching YouTube – with ads. Which was, in turn, made feasible by cheap tablets. While iPads and Surfaces were always marketed as a more premium device, Android tablets, introduced in 2013, and getting progressively cheaper since, which typically came with YouTube pre-installed, made it a truly mass phenomenon – and those tablets, in turn, were made possible by Google itself.

What popular press usually failed to remark on, is that children were just a very distinct and easily observable subset of the general population, when the exact same techniques could be applied to the entire population, and were. The words “a monster beyond our control” in reference to this also start popping up around 2017, when the story garnered public attention – and this is also the year that the above cited snippet describing the Internet as eldritch comes from.

This is also the period where more legitimate content producers – news agencies especially – started revealing their own AI technology, like [Washington Post’s Heliograph and ModBot]. Despite their respectable history, on the Net they were just like everyone else, and had to play by the same rules to compete. In 2019, [GPT-2] was revealed, and more importantly, open-sourced, bringing text generation to previously unprecedented heights.

Washington Post’s Heliograph and ModBot

GPT-2

When exactly did someone have the bright idea that automation can be used to influence politics is also unclear. While the practice of using organized posters engaging in political discussion in what would otherwise be private-discussion-in-public that characterizes the water cooler was first described as early as 2003 – in Russia – I’m not certain if it was actually real at the time. That was also the period of the rise of Russian broadband networks, when the scarce phone lines were no longer necessary to access the Net, and the audience of the Net swelled again to include groups that previously weren’t, so the fears of this happening might predate the practical implementation. Never mind that the general concept in itself probably dates to Ancient Greece.

The earliest publicized cases with actual evidence behind them date to around 2013, also in Russia, and worked in the same manner as a manually operated content farm, with lots of humans being paid to generate content and butt in on otherwise private discussions that were happening in public. Looking at the clearly automatically generated Twitter login names, at least some automation is involved. Hundreds of otherwise unconnected accounts posting an identical Twitter message is very much not uncommon. I’ve yet to see any leaked code from an actual _bot,_ however, but a lot of “people” that appear to be at least semi-automated are readily apparent.

How much doing that actually influences readers is a matter of debate, and needs more in-depth research by someone who has a budget to do it. What it undoubtedly _does_ influence is the [perception of the current state of the public opinion]. It also prevents any meaningful political discussion, which, apparently, suffices for their purposes, because the practice seems to be increasing despite more and more public incidents. Russian government is certainly not alone in doing this, as most major governments were implicated in similar operations at one point or another – it’s just that Russian government engages in this practice almost to the exclusion of everything else, and dedicates more effort to manipulating public opinion than any other one, regardless of its actual effectiveness.

perception of the current state of the public opinion

On smaller scales, ratings for any platform that allows visitors to rate things have been similarly exploited to varying success for the entire decade, and I’m sure you have encountered an example at least once. A comment section containing entirely template-generated praise on an item being sold⁵ is pretty much normal today, rather than the exception, and you’re more likely to encounter one the further you drift away from specialist niches.

┄┄

5. Or offered for free for the cost of your attention converted to advertising, like an Android app.

This was a long, slow process, and 2016-2017 was just the time when it broke out onto the surface. While this era started much earlier, around that time it passed some kind of critical mass and started edging everything else out as the driving force of the Internet.

But _why then?_

To understand it better, I attempted to search for various means of automated content creation, and found what appears to be… automatically created sites, which spent a lot of time rambling on about what automated content creation means for my business, but precious little about how they, themselves, were created, or even where do I go to get some done for me. Further digging eventually revealed a lot of paid software and several services which I’m not inclined to spend money on just to investigate, but the important observation to make here is that just one piece of software that is sufficiently flexible to rehash a website in total absence of human input would be enough to generate any number of them, perhaps even in real time. What made “cloud computing” novel back in 2006 wasn’t the fact that this is someone else’s computer disguised as an abstract object, but the fact that it’s paid for by the clock, can be allocated dynamically and very quickly in response to changing circumstances, and all of this can be done automatically.

Still, very little information on how did all the innumerable examples of static spam flooding the search engines get created seems to have slipped out into the open source land for me to play with.

From the unavoidably cursory study, it appears that approaches analogous to Amazon’s Mechanical Turk remain a major part of the whole endeavor,⁶ so nothing radically _new_ happened in 2017 in that area. Just like in the Elsagate story described above, it seems that automata primarily determine what to do, while the specifics they cannot do themselves are delegated to what is, in effect, sweatshops. The introduction of effective text generators altered this equation, but not enough to completely displace people out of it just yet.

┄┄

6. Ironically, not Mechanical Turk itself, which seems far in decline, with only ~2000 workers cited as active out of 100k registered.

So what is the source of this qualitative shift?

You see, 2017 is _also_ the year Google search started getting progressively less useful.

Widely publicized complaints that it does get worse and worse start appearing around that time, and get progressively more frequent. In fact, one of the highest voted such complaints on Hacker News recently advises you to append `site:reddit.com` to the search query. This has the effect of limiting the search to pages on Reddit, which, being more tightly policed against the outright fauna than the Web at large,⁷ include primarily content written by actual people. This is true in general of other platforms for user-generated content, provided they aren’t complete fakes meant to create the illusion of such a platform existing, which has also been known to happen.

┄┄

7. Reddit in particular has its own population of pet fauna, though.

But I still had nothing specific to point a finger at, up until I found this interesting statement:

Amit Singhal, who was Head of Search at Google until 2016, has always emphasized that Google will not use artificial intelligence for ranking search results. The reason he gave was that AI algorithms work like a black box and that it is infeasible to improve them incrementally.

Then in 2016, John Giannandrea, an AI expert, took over. Since then, Google has increasingly relied on AI algorithms, which seem to work well enough for main-stream search queries. For highly specific search queries made by power users, however, these algorithms often fail to deliver useful results. My guess is that it is technically very difficult to adapt these new AI algorithms so that they also work well for that type of search queries.

_[vincent_s@hackernews]_

vincent_s@hackernews

John Giannandrea subsequently left Google in 2018, replaced by Ben Gomes, also an AI specialist, who was subsequently, in 2020, replaced by Prabhakar Raghavan. That latter gentleman, I’ll just quote from an interview:

“This is the Google Duplex robot. We’re curious whether you’ve changed your hours.” And there’s a little bit of dialog. “What about on Saturdays?” We captured that, and we made millions of updates.

That is, under his direction, Google, instead of indexing things, applied AI directly to obtain information from the primary source, to present it to the users all by itself. This was far from a new idea by the time he said it.

The only reasonable hypothesis I have is that it was this particular shift in strategy that resulted in the Net suddenly turning around and going straight for Leng at full speed.

Machine learning algorithms are typically used to adjust factors to optimize a certain metric or several, but there is no metric _available_ to Google to determine how satisfied the user was with the results of the search. Even if they wanted to optimize for it, they literally can’t do that without explicitly asking, and asking such a question is generally not practical, because by definition, web search implies leaving the search engine’s pages once it concludes. The metrics they do have a lot of, however, are engagement – that is, how long did the user spend searching, and which of the results down the page the user eventually clicked on.

Never mind that from an economic point of view, it is simply unprofitable to optimize for user satisfaction while they’re the dominant player in search. Is it a wonder it doesn’t look like they do?

I couldn’t tell you if it was the actual amount of spam in the world that increased, or we just started seeing so much more of it in favor of more useful information, and it’s nearly impossible to tell for certain. But _seeing_ more of it is indeed something that is happening. From a battle of algorithms guided by people in an effort to create a better product and take advantage of it, respectively, the standoff between Google and the spammers turned into peaceful coexistence between a black box paperclip optimizer and various smaller predators, who were no longer in direct opposition to its goals, or at least, nowhere to the extent they were previously.

This strategic shift isn’t just limited to Google, either, as all the major players announced similar changes at around the same time.

In 2016, Twitter introduced “machine learning algorithms” to personalize the user’s feed in an effort to “boost engagement” – which, in ad-based economy, is the preferable metric for everyone. Twitter was under pressure to start turning a profit, and this announcement was what placated the investors. While the feature had been in development well before that, that was the time it turned from an opt-in feature to opt-out.

While Facebook used machine learning to optimize sponsored posts in feeds since at least 2012 and probably was the one major player that gave everyone else the idea, 2016 was the year when they publicly described FBLearner Flow, a company-wide machine learning platform which they do everything with. To quote the post in which they did it:

With FBLearner Flow, AI is becoming an integral part of our engineering fabric and providing Facebook engineers with the power of state-of-the-art AI through simple API calls.

A lot of parallel developments in machine learning came to a close within two years from each other, abruptly reaching critical mass, in 2016-2017, and were nearly simultaneously adopted by everyone at once. Where previously, searches were optimized to show you what you asked for, they were now optimized to show you more advertisements regardless of what you asked for.

The Gates of the Silver Key opened, and we beheld the depths of the water cooler where the eldritch things lurk.

While the internet is still not _entirely_ flooded with automated fauna, it seems very much that the situation will only be getting worse. Most importantly, when before that, you could generally assume that any statement of fact found on the Net could be, if not _trusted,_ at least thought to have been made in good faith with a reasonable level of confidence, this is very definitely no longer so. You never really had any way of knowing if it was made by a person in the first place, but now you have good reason to believe it wasn’t.

Which is the whole reason I call this “The Post-Information Age.”

Most of the actually accessible information is now noise.

State of the World-P

In my 2014 post, I worried that increased accessibility of permanently stored speech would lead to an increase in social conflict and/or persecution, because speech that is normally assumed to be “private enough” – not _explicitly_ private, but not attracting any attention either – is effectively no longer so.

Ironically, I was both wrong and right. This worry is still well-founded, but the problem is suddenly a lot less severe than I anticipated it would have gotten by now, and it isn’t even because someone suddenly came up with a solution.

While the conflict between public and private speech cannot be said to have been resolved, its worse effects have been put off indefinitely by the proliferation of eldritch fauna, which largely destroyed the “accessibility” part of the equation in just a few short years. Your words may still linger on the Net for decades for anyone to see, but if you don’t remember where you left them, good luck finding them. Saying something on Twitter can still suddenly result in your tweet trending, which can have various consequences, but looking at Twitter, would you _actually_ want to say anything on there? Actually, what are the chances your tweet is getting read by a person _at all?_

The Internet feels dead, because, despite its population including the majority of humanity, what we _see_ of it is dominated by the native fauna.

Most people are still there, somewhere.

So when a Russian gets a prison sentence for having posted something mildly seditious a decade ago, that’s not because that something was found in a search engine – it’s because some _other_ Russian pointed that out to eager local authorities. Dedicated search engines can barely find _any_ private-as-public speech anymore, while major social networks never optimized themselves for searching for anything specific.

Meanwhile, we have mindless beasts to deal with _somehow,_ starting with Google and Facebook themselves. Preferably before the subprime attention crisis bubble pops and buries the entire tech sector with it. Eldritch they may well be, but only a few of them are actually vast and incomprehensible.

The rest are just fungi from Yuggoth.

As Rika liked to say, “Being bigger than you does not make one a god.”

✏️ View and leave comments

◀️ 2022-02-20: A Golang horror story

⬅️ Pseudoscience: Dododo

▶️ 2022-02-26: Good morning, Vietnam

↩ Home