bombs in bottles
I'm reading OpenAI's Memorandum of Law in Support of Defendants' Motion to dismiss in New York Times v. Microsoft - the case in which the NYT sued OpenAI, et al., for copyright infringement.
https://www.courtlistener.com/docket/68117049/52/the-new-york-times-company-v-microsoft-corporation/
OpenAI argues, inter alia, that its use of a bunch of NYT articles to train ChatGPT is "fair use":
Indeed, it has long been clear that the non-consumptive use of copyrighted material (like large language model training) is protected by fair use—a doctrine as important to the Times itself as it is to the American technology industry.
Okay. But what is OpenAI claiming the "product" is?
I've seen confusion and obfuscation of the definition of "product" in a lot of conversations about generative AI and copyright. So let's break this down.
In the use of any ChatGPT model, there are three elements: the model itself, the user's prompt, and the generative-AI response to that prompt. Any of these could reasonably be described as a "product."
One possibility is that the "product" is the generative AI model itself - GPT-3, GPT-4, GPT-4o, GPT-extrazestynachocheese, etc. Here, OpenAI has a viable argument. (I will not opine on their likelihood of winning; it's been years since I took copyright law, and as every attorney knows, judges be cray.) There is established precedent on which to argue that creating a "new" and "innovative" technology by using works under copyright is a form of fair use.
The user's prompt could also be considered a "product." Since individual users insert prompts, however, these can't really be said to belong to OpenAI. No one is arguing that OpenAI is responsible for what users type into ChatGPT, nor is anyone seriously arguing that prompts as such aren't "fair use." (If they are, then Google search strings also aren't "fair use," and that seems entirely out of proportion.)
Okay. But what about the outputs of the model?
OpenAI's potential second "product" is the output ChatGPT generates from a given user prompt. Insofar as that output is based on, contains, and fails to cite its source in a work under copyright, that output is plagiarism - and thus not "fair use." OpenAI thus could be off the hook for copyright infringement for training ChatGPT, but on the hook for the results ChatGPT produces when prompted.
Arguing that the model itself doesn't violate copyright doesn't help OpenAI much when the model's only outputs do. Without those generated "products," OpenAI does not have a viable source of income from ChatGPT.
(One could argue, based on their financials, that OpenAI doesn't have a viable income source from ChatGPT even with its responses, but it definitely doesn't have one without those responses.)
An analogy floating around the AASL listserv can be instructive: Imagine you stole a workbook and made twenty photocopies of it. You handed these out to each of your 20 students, who then use the workbook.
Your students' use of the workbook may constitute "fair use" (here, for educational purposes). But the way you acquired and distributed the workbook definitely does not. The fact that your students' use is "fair use" doesn't absolve you from committing copyright infringement.
Similarly, individual users may be off the hook for infringement when they type prompts into ChatGPT. OpenAI may or may not be responsible for infringement for using that material to train ChatGPT. But OpenAI still on the hook for the way it distributes material belonging to others.
And that's crucial, because without the ability to distribute that material, OpenAI has no future.
--