💾 Archived View for tilde.club › ~winter › gemlog › 2024 › 4-02.gmi captured on 2024-08-25 at 00:23:39. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Feeding the Machine

The Internet May Not Be Big Enough for the LLMs

For Data-Guzzling AI Companies, the Internet is too Small

Via The Verge, it seems that the next iteration of ChatGPT (et al) require more tokens, the basic units used to train the algorithms used to emit images, text, etc. In machine learning, these are often words (at least for text generation), and the usual go-to corpus, Common Crawl, apparently isn't enough. So companies are looking for alternatives: what if we created transcriptions of YouTube videos? What if we're able to generate synthetic data? Facebook touts its massive, closed platform as an advantage, a source of constantly-growing data to train on. And so on.

But hang on. The state of the art, trained on decades of scraped internet data, somehow isn't enough? What's going on?

What's going on is that companies are working with a small subset of Artificial Intelligence/machine learning techniques, and are just throwing a massive amount of data and computing power at the problem. What's been going on the last few years is an improved version of the neutral networks that were popular in the 70s and 80s - better facilities for short-term memory - but which are otherwise just...larger.

The absolute state of the art requires, basically, all publicly-available data on the internet to generate text and images that have a particular scent and sheen. It's not good enough, but it'll get better, they promise, while at the same time admitting they need more data than perhaps is even available.

It's kind of crazy to think that the best we've got, using the in vogue neural networks and transformers, trained the largest dataset you can even think of, is only really good enough if you squint and lower your expectations.

Judea Pearl on Artificial Intelligence (Atlantic)

The smartest thing they've done is to call it Artificial Intelligence, call it AI. But where are the smarts? Even calling it "generative AI", which I've done in the past, ascribes more to it than is really there. Judea Pearl's seen this; as a a probabilistic AI guy, who won the Turing Award for his major contributions to Bayesian networks (and specifically polytrees, a subset of Bayes Nets that run in polynomial time), he's fiercely critical of what he sees as, at its base, little more than pattern-matching, curve-fitting, algorithms that, while impressive, don't exhibit anything like the intelligence AI researchers have chased for decades.

And here we are, in the playground policed by capital. AI is now, at least in the popular imagination, a thing that generates text and images that have the shape of meaning, but not the substance. And while data scientists at Google etc are looking for larger and larger (free [or stealable]) data sources, there's a big problem: their creations are being used to turn the web into a steaming pile of dogshit. In The Verge article above, Alex Cranz jokes that nobody should be training an LLM on his LiveJournal circa 2003. But that's the problem, isn't it? When the web opened up to corporations and people who wouldn't have given a shit years before, we were told to stop keeping journals, stop making websites. The great, open corpus began to shrink. The companies and consortiums would love to have another big, active, LiveJournal-like site full of user-created data to train on. Why do you think Reddit's been cutting deals for access to its APIs? Like it or hate it, it's used by a lot of people, a lot, every day. They talk about things they're interested in, write back and forth. It's one of a now small number of sites that can be relied on for human-created text. In capitalist terms, it's a gold mine.

The Internet is Full of AI Dogshit

Since the release of ChatGPT, and Midjourney, and DALL-E (and everything else), we've seen a torrent of machine-generated text and images begin to take over the internet. Now we're at a point where the digital commons is poisoned, probably forever. As long as there's been a web, there's been a certain type of person looking to make a quick buck. The quicker the better. And this shit is _quick_. "Generate me a website that..." "Write me copy that..." Reddit feels like a shitty bar on an endless street full of boarded-up storefronts. But at least it's ope, and there's beer on tap. I guess.

I'm not bullish on the future of the web as a medium for people. I think HTTP is good enough that it'll stick around as a way for apps to communicate. But the web as a thing for people, a place to create things, meet people? I don't know.

From "The Internet is Full of AI Dogshit", above:

The internet has been broken in a fundamental way. It is no longer a repository of people communicating with people; increasingly, it is just a series of machines communicating with machines.

A few years ago now, a fake conspiracy theory went around that the internet mostly consists of bots, apps, automatically-created content, all kinds of machine-generated traffic. When the idea was coined it was nonsense because of course people were using the web. We always had been, right?

It was a dumb idea, we laughed. Dead internet, same idea as (a couple years later) that birds aren't real, but are in fact just government surveillance drones. Conspiracy theories delighting in the absurd, in that way we used to talk in false awe about the flying spaghetti monster in days gone by.

But there we were, going to fewer and fewer unique sites, and typically interacting over our phones, whose affordances encourage us to type less, consume more. Forums seemed to die a while ago, though it was hard to say when; and so we found ourselves aimlessly scrolling reddit (and Twitter, and Instagram), looking for community in our new corporate towns.

We take the easiest approach, always, and reap the results.

It was an absurd idea until it wasn't. It was false, and then it was true; it was gradual, and then it was over. And now the very smart people who've been profiting off our personal data and turning us against each other are finding they have a problem, too: what they tell us we want, that is inevitable, can't actually be sated by what's available.

In the past, this wouldn't have seemed to be such a problem: more and more people come online, make things, and these become available. Only, we've been encouraged for the better part of two decades not to make things and make them public. It'll be interesting to see how this is resolved, but if I'm making a dystopian guess, I expect that in a few years' time, it'll come to light that our phones have been secretly recording every single conversation we have, and that this too becomes text, and becomes part of the machine. I hope I'm wrong. I guess we'll see.

gemlog