💾 Archived View for rawtext.club › ~winter › gemlog › 2023 › 9-11.gmi captured on 2023-09-28 at 16:05:06. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

The Opaque Corpus

I came of age in the dot com boom: Pentium chips, Herman Miller chairs, the first wave of media darling startups (who remembers pets.com?). It was a era before massive computational power was easily available, and long before "compute" entered the lexicon as a noun. AI was still not in a particularly good place - after the promises of the 1970s failed to deliver the general AI its practitioners were sure was on the horizon, funding dried up. For a long, long time.

The state of the art in the 1970s was neural networks; after that, there were researchers involved in investigating other ideas, such as Bayesian networks, which work off probability and priors.

These investigations yielded success in some very early-internet ways, with Naive Bayesian classifiers showing incredible promise in spam filtering. Before Google clogged up the web with SEO spam (whether written by underpaid writers, or now wholesale by AI), spam was a real plague. It's hard to emphasize just how bad it was then, other than to point out how little spam we get in our inboxes now in comparison.

My undergraduate thesis was in AI (specifically, a very niche corner to do with NP-complete problems). Most of what I knew is long forgotten, but I know this: it's incredible that in the years since, with all the advancements that have gone on, that the state of the art isn't something new, but something old, turbocharged. Neural nets are back in vogue, thanks to Large Language Models that use them as their basis.

Google confirms it's training bard on scraped web data, too

What's changed? The scale, billions of outputs, instead of dozens, hundreds, or maybe thousands. Training sets consisting of huge chunks of the world wide web, plus every public domain book ever written, plus, lawsuits accuse, all sorts of things that are not public domain at all.

Sarah Silverman's ChatGPT Lawsuit Raises Big Questions for AI

The last I'd say fifteen years of tech has been a series of startups deciding that they were going to break the law and get too big to deal with. Uber and Lyft decimated taxis; Facebook and Instagram hoover up massive amounts of personal data, sell it off, and if they face fines, they're nothing in comparison to what the data actually brings in. Politicians rarely take action, and fines are built into the cost of doing business.

So it's with this in mind that I'm very interested in the lawsuit brought by comedian Sarah Silverman and two other authors against OpenAI (the consortium behind ChatGPT) and Meta Platforms (which owns LLaMA).

The last couple of years have been more-of-the same when it comes to tech and permission. The companies have built up their training models for their generative AI using, as far as anyone can tell, every human-written work they can possibly get their hands on. And they've done so operating under the same assumptions they ever have: that what they were doing was fine, and permissible, because it was them that was doing it.

Silverman and the other plaintiffs claim that the models illegally include their own copyrighted work, having included them in what's known as "shadow libraries", collections of copyrighted work that are in violation copyright for most the authors therein.

Among other things, the OpenAI suit notes that a request to ChatGPT to summarize Silverman’s book “The Bedwetter” returns a detailed summary of the book, and asserts that it wouldn’t be possible to provide a summary of that variety without having the full text in the training model.

For the last year, people have been in awe of what generative AI can do, and less has been written about that to plausibly generate an image in the style of Escher, or a precise and true summary of Silverman's "The Bedwetter", the models must be trained on a sufficient amount of examples to make this possible.

The companies and consortiums know they're in the wrong. Asked to provide their training data, they refuse. So now we have a lawsuit, not the first, and not the last. The plaintiffs have retained lawyers who have brought lawsuits against GitHub (for CoPilot, which can emit licensed code), and Stability AI for including copyrighted art without permission.

For the last couple years, it's felt like we've been living in the fuck-around era of generative AI. People have seen what they're capable of, and have been wowed; but among the people that actually, you know, make things for a living, there has been a range of emotions. Dread, certainly - is this going to put me out of work? - but also indignation, as they've started to realize just how these behemoths are trained. Up until now, it's been "however the companies want", but I suspect that will start to change. The best case scenario, for the companies involved, is a slap on the wrist, and maybe removing the works of certain people from the corpus. But the worst case is much more interesting: what if a court compelled them to disclose their entire datasets? This could trigger a massive wave of lawsuits, and perhaps restrict the data to be public domain works and opt-in from copyright creators.

In an ideal world, it might put some consortiums out of business, too.

Not that I'm holding my breath. But in the war between the people who actually make things, and those that seek to monetize work that isn't their own, the opening shots have been fired. Who knows how long this will take to resolve. Years, at least. A decade or more? But we're entering into an era where copyright, like truth since 2016, is an increasingly wobbly thing.

gemlog