💾 Archived View for gemini.abiscuola.com › gemlog › 2022 › 10 › 18 › do-not-use-copilot-at-work.gmi captured on 2024-03-21 at 15:02:55. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Do not use copilot at work

I'm not a lawyer. What's written here is my personal opinion only.

I was reluctant writing something about Github copilot. But after some time (and some dramas), I believe I'm ready to share my thought about it. In and on itself, the idea isn't bad. However, The Github developers, as proven by the backlash they are facing, are a bit too cavalier about dismissing the intersection of how copilot works and what kind of output it emits with copyright's law. This is, IMHO, where things are still unclear.

I'm of the opinion that training the model with code that's under any kind of open source license, will never represent a problem. In the end, in this case, I think copilot is not a derivative work of all the code it's trained with. However, things start to get problematic when copilot actively suggests you a piece of code to use.

Copilot "suggests" copyrighted code

Ok, Davis kind of explicitly triggered copilot in burping back his own code. However, the main problems are:

Copilot literally suggested the whole implementation, slightly tweaked.
The license and author attributions are missing.

And in the second point, copilot may be violating the copyright of code released, even, under the most permissive license possible. Let's take rssgoemail, for instance.

rssgoemail repository

It's released under the ISC license, that in essence says:

You can do whatever you want with my code, but please, preserve the copyright notice and give credit to the copyright holders.

If I decide to mirror my fossil repositories in github, my code would end-up in the copilot training set. Imagine somebody happily hacking at work, using copilot. At some points, copilot decide to burp, maybe because of a bug, swaths of rssgoemail code without writing also the copyright notice. This means that, if this company, or person, will release their product with my code in it, be it open source or proprietary. I'll have the right to sue them for copyright infringement. Not good.

There are people focusing on the training bit of copilot, but I consider that safe. However, as demonstrated by the example, the problem is what kind of code it emits. It's not just some basic code completion, where a function name with it's arguments are suggested and in that case, the suggestion would be too small to be a copyright violation, but the fact that copilot it's too stupid and act, essentially, like a glorified copy-paster from Stackoverflow.

Another issue is, in my opinion, the programmer attitude and total disrespect for other people's work.

The simple act of copy-pasting code from Stackoverflow (a favourite of the modern "full-stack" bro), represents a potential copyright violation, but for some reasons, as usual, programmers are too busy "creating value" to care about the law. Proof is, a lot of people are using copilot for their own projects, or even professionally and they never asked themselves if it safe to do so. Even after being finally clear that copilot is problematic under the current copyright law, those programmers will continue to use it instead of saying:

Wait a minute. We should avoid it, until the matter is clarified in court, or copyright law is amended. We may be sued for that!

But this doesn't cross those people's minds, given they never consider the matter of the law while writing code, "It's so useful to write the boilerplate for me". Aha, but that boilerplate, as shown, might be the copyright of somebody else's work. it's not like a codegen program, where the algorithm driving the generation is pretty clear and, in general, you are giving it a grammar (or whatever), to generate the code from. There is no "learning" there. A lot of people releasing their code under an open source license, are also guilty of this, never considering the ramifications of integrating copilot's code.

Personally, I didn't even tried copilot, because I don't trust any machine learning model to do a decent job for this kind of things at the current state of the art, with the legal aspect of it being the main one. It seems there are a series of lawsuits being prepared around this and I consider that a good thing, because if the Github line of thinking pass, it will mean, potentially, that any kind of license applied to open source code, be it a BSD, or even the GPL, will become garbage.

If copilot-generated code becomes ok under the "fair use" doctrine, I wouldn't be surprised if other machine learning models will come out, with their training set being the whole internet's public repositories, or targeted ones, with the aim of circumventing license restrictions. It's sounds like a conspiracy theory, but we all know what greedy humans can do to feed their bottom lines. Imagine how happy somebody like Stallmann could be if some important code generation algorithms in GCC will be included in a proprietary compiler, but everything will be good and well because the code was created by a "tool". There is still the argument that you are, as a programmer, responsible in the end for the code you integrate, but I have the impression that this will not be enough, in the case of a machine learning model like copilot.

For now, I have a request for you all: if you like any of my projects and you feel that mirroring the fossil repositories to github may be a good idea, do not do that. Until the matter is settled, I'm not comfortable with my code landing there. I perfectly know that I have no legal recourse, unless I change my projects licenses to stop you from doing so, but please, if you want to mirror, use codeberg, use sourcehut, whatever.

But do not use github or any other platform with a similar machine learning artifact.

As I said in the title, IMHO you should not use copilot for now, in particular in a professional setting, Github position is still risky and you could, potentially, be sued for integrating somebody else's code without attribution. Don't do that.

Someday it might be totally fine, but now, I feel it's not.