2021-07-01 AI coworkers and copyright

I haven’t seen GitHub Copilot in action but it sounds weird. Maybe it’s a bit like code completion, except better? Perhaps it’s tailor made for languages involving a lot of boilerplate code? I don’t know.

GitHub Copilot

It does raise the interesting question of copyright and neural nets. I was chatting with @renatoram about it. The reason I don’t think there will be an easy lawsuit for the Free Software Foundation (FSF) or the Software Freedom Conservancy is that I heard they are filtering out exact matches. Copilot will not be producing verbatim copies. So what about derived copies? Interesting question.

@renatoram

Clean-room design is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design. It’s not strictly necessary, but it might help in court. What matters, I think, is how much things are different. On the Wikipedia page on clean room design, there’s this comment on NEC Corp. v Intel Corp. (1990):

A US judge ruled that while the early, internal revisions of NEC’s microcode were indeed a copyright violation, the later one, which actually went into NEC’s product, although derived from the former, were sufficiently different from the Intel microcode it could be considered free of copyright violations. – Clean room design, on Wikipedia

Clean room design, on Wikipedia

My take is that we come to the point where it’s difficult to differentiate between learning and copying. If you’re a musician then you are learning from other pieces of music your hear, and then you write your own music, and you might end up improvising similar tunes, putting notes to paper that remind people of the original you heard. These cases are not clear-cut. We need human judges to decide these questions.

So what I think what’s going to happen is that some people will be very interested in arguing that the process by which Copilot produces code is as transformative. They’ll argue that surely the authors of works used to train neural networks don’t share authorship in every permutation the network then produces. That way lies madness for the current industry, they’ll say. It’s important for the US economy for it to be legal!

Consider Megaface. It’s a public face recognition training dataset:

It includes 4,753,320 faces of 672,057 identities from 3,311,471 photos downloaded from 48,383 Flickr users’ photo albums. All photos included a Creative Commons licenses, but most were not licensed for commercial use. … to advance facial recognition technologies around the world by companies including Alibaba, Amazon, Google, CyberLink, IntelliVision, N-TechLab (FindFace.pro), Mitsubishi, Orion Star Technology, Philips, Samsung1, SenseTime, Sogou, Tencent, and Vision Semantics to name only a few. – Megaface, on Exposing.ai

Megaface, on Exposing.ai

Now what?

Disgusting, that’s what.

And this brings me to another link I got via @bob by Bradley M. Kuhn for the Software Freedom Conservancy, arguing that copyright assignments are important. The first problem is that for many contributors the default is that their employers own the copyrights of what they write, and that is a problem. These companies are often not interested in enforcing their copyright. They’d have to go against their own clients! Of course they don’t want that. But even if you don’t have that problem, there still remains the problem of who is actually going to court in case the license is violated. You? Me? Probably not! That’s why it is important to explicitly assign our copyright to a charity that promises to enforce the license.

@bob

Ultimately, copyleft functions best when leaders of a project have the agency, wherewithal, commitment, and collective consensus to either ensure copyleft functions to protect their users’ rights, or enable another entity to do that job for them in a principled way that serves the public good (usually in contrast to the interests of for-profit industry). – It Matters Who Owns Your Copylefted Copyrights, by Bradley M. Kuhn, for the Software Freedom Conservancy

It Matters Who Owns Your Copylefted Copyrights, by Bradley M. Kuhn, for the Software Freedom Conservancy

Perhaps we should assign copyrights of our pictures to charities that will fight to enforce our licenses. And perhaps we might even want to add licensing terms that make the work of surveillance capitalists harder. Pictures of me and my loved ones, used by “Alibaba, Amazon, Google, CyberLink, IntelliVision, N-TechLab (FindFace.pro), Mitsubishi, Orion Star Technology, Philips, Samsung1, SenseTime, Sogou, Tencent, and Vision Semantics to name only a few?” I shudder at the thought!

#Pictures #Programming #Copyright #AI

Comments

(Please contact me if you want to remove your comment.)

⁂

Copilot has been caught with verbatim copies. Aside from it being mentioned in the FAQ, see: https://mobile.twitter.com/mitsuhiko/status/1410886329924194309

https://mobile.twitter.com/mitsuhiko/status/1410886329924194309

– remyabel 2021-07-03 04:30 UTC

---

Is GitHub a derivative work of GPL'd software? by Drew DeVault

Is GitHub a derivative work of GPL'd software?

– Alex 2021-07-04 15:22 UTC

---

GitHub Copilot And Copyright, by Ariadna Vigo

GitHub Copilot And Copyright

– Alex 2021-07-05 05:22 UTC

---

GitHub Copilot is not infringing your copyright, by Julia Reda

GitHub Copilot is not infringing your copyright

– Alex 2021-07-05 14:04 UTC

---

Oh, GitHub is not the first!

“Tabnine is the world’s leading AI code completion tool, trusted by over 1 million developers in all programming languages.” – Code faster with AI completions

Code faster with AI completions

– Alex 2021-07-09 19:43 UTC

---

@jk writes:

@jk

Like an AI playing Chess or Go, perhaps it can ultimately produce output that works brilliantly but is in fact unintelligible — a disastrous mess of spaghetti code and an impossible black box for humans to unravel? – Artificial Programming

Artificial Programming

– Alex 2021-07-10 14:31 UTC

---

It is a complicated topic (licences always are). If there’s no distribution and is for private use, whatever copilot produces is likely to be fine (I don’t know all the licences), but if the result is distributed, how can you be sure? IMHO the problem here is more GitHub claiming that there’s no issue, because I guess they don’t want to say if the generated snippet is covered by the original licence or not.

– jjm 2021-07-11 12:04 UTC

---

I think we’re in the middle of a new front in that endless fight. Microsoft is trying to push the boundary. This is why I appreciate art installations like this one:

Upload any song protected by copyright. Our Neural Network will “Learn” your song. You can then download what we “Learned.” Choose File (MP3/FLAC) – Fair use via Machine Learning.

Fair use via Machine Learning.

– Alex 2021-07-11 15:41 UTC

---

In this context I had an interesting conversation on Mastodon. It started with me quoting Bruce Schneier on decentralized autonomous organizations (DAOs).

“In 1996, US District Judge Marilyn Hall Patel ruled that computer code is a language, just like German or French, and that coded programs deserve First Amendment protection. That such code is also functional, instructing a computer to do something, was irrelevant to its expressive capabilities, according to Patel’s ruling.” – Regulating DAOs

Regulating DAOs

It does make code sound like a magic language.

@wim_v12e said:

@wim_v12e

I think many programs work on two levels: the language to express what you are writing, and the underlying language of execution. The former is mainly the naming of functions and variables, the other is what the abstract syntax reflects.

I replied:

Sounds reasonable but I’m not sure this distinction helps Americans solve problems such as these (from the Schneier blog linked above).

Now quoting from Bruce Schneier’s blog post again:

Tornado Cash is not a traditional company run by human beings, but instead a series of “smart contracts”: self-executing code that exists only as software. Critics argue that prohibiting Americans from using Tornado Cash is a restraint of free speech, pointing to court rulings in the 1990s that established that computer language is a form of language, and that software programs are a form of speech. They also suggest that the Treasury Department has the authority to sanction only humans and not software.

My argument being that in that respect, copilot is not a free speech problem but a copyright violation problem. 😆

Wim disagreed:

I don’t think so, only for the specific cases where it replicates existing code. Copyright is very narrow. When copilot for example suggests code based on code that you have just written, the question is, is this code written by that machine subject to freedom of speech? If it is, why? If not, then to what degree can generated code still be subject to this ruling? Does the freedom of expression rest with the owner or the creator of the code generator? Of course this is the case for all software-generated text, not just code.

I said:

I suspect that outside of the very narrow confines of copyright (and trademarks and other limits), copilot is about as safe as all those text-to-image generators that people are talking about these days. It’s currently a grey area, but I think in the end, they’ll all be safe. Between humans, paraphrasing is fine, copying style is fine. Maybe people aren’t happy about it, but they can only litigate when actual copying of text or images happens, not when similarities arise, as far as I understand it. So we might dislike it, like we might not respect an unoriginal artist, but it’s not *verboten*.

By “safe” I mean “safe from economic terminal litigation” – in other words: will legal threats force Midjourney, Stable Diffusion or CoHost off the web because they generate pictures or text? No. Because they do style transfers? No. Because they use public works for training? No. Because you can generate porn with Mickey Mouse? Hell yeah. Because you can steal code from big software companies that host some of their code under a free license on GitHub? Maybe…

Which is why the case against smart contracts will be won using the sort of arguments used against the enabling of criminal behaviour (i.e. as tricky as regulating arms, dual use goods, and so on).

At the end of the conversation, Wim said:

I have thought a bit more about the “free speech” issues, and I think the degree to which the programs create new content is irrelevant from that perspective. In legal terms it will all be about ownership and responsibility, not about any perceived agency of the machine.

– Alex 2022-10-20 10:56 UTC

---

In the meantime, The Register reports:

GitHub Copilot – a programming auto-suggestion tool trained from public source code on the internet – has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim. – How GitHub Copilot could steer Microsoft into a copyright storm

How GitHub Copilot could steer Microsoft into a copyright storm

This refers to the following initiative:

Maybe you don’t mind if GitHub Copilot used your open-source code without asking. But how will you feel if Copilot erases your open-source community? – GitHub Copilot investigation, by Matthew Butterick

GitHub Copilot investigation, by Matthew Butterick

– Alex 2022-10-20 10:59 UTC