💾 Archived View for risingthumb.xyz › blog › 001_July_9th_License_Laundering.gmi captured on 2023-09-28 at 15:43:14. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
Extend, Embrace, Extinguish.
This is a quote oft. used in reference to Microsoft. Yes, that Microsoft, the producers of such bad software as Windows 11, Windows 10, Windows 8, Windows 7, Microsoft Teams, Skype and owner of Mojang, Developer for Minecraft, and owner of Github.
It has been for some time that Microsoft has been suspiciously buddy-buddy with Open Source Software. To the point it was acceptable, but as usual you cannot trust a corporation with anything you value. Please note, I refer to Open Source Software which is typically exploited by Corporations and typically use poor licenses with no protections such as the MIT and BSD Licenses, that simply request the license reproduced verbatim. Free Software on the otherhand places value in Copyleft ideals- turning Copyright on its head and demanding the work be produced in public if the GPL is used.
So recently, Github announced a project they had been working on. Github CoPilot. Initially I paid it little attention, "Huh, it's autocomplete for Code? That's neat, and will render a lot of the easily and mass produced work of web developers redundant". Dig a little deeper, and you'll see this is a big turning point for Copyleft.
Why is it a big turning point? The Data that Github used is all[1] public repositories. This is ignoring licenses. For some Licenses, it is easier to satisfy, like the MIT License or BSD License, but others by their nature is very hard if impossible to be satisfied by a corporation. Such licenses include the GPLv3, GPLv2, AGPL and Apache Licenses, as they all have conditional use. A study was produced on this by Github(naturally this produces a possibility of bias)[2]. What strikes me as interesting is this Quote taken from the results.
"For most of GitHub Copilot's suggestions, our automatic filter didn’t find any significant overlap with the code used for training. But it did bring 473 cases to our attention. Removing the first bucket (cases that look very similar to other cases) left me with 185 suggestions. Of these, 144 got sorted out in buckets 2 - 4. This left 41 cases in the last bucket, the “recitations”, in the meaning of the term I have in mind."
"That corresponds to 1 recitation event every 10 user weeks (95% confidence interval: 7 - 13 weeks, using a Poisson test)."
This presents two things. Firstly it does recite the training data frequently enough that this the legal case must be brought up. Secondly, the recitations are on all the various licenses, even those that can't be licensed.
An additional note, not mentioned in Github's study is the possibility of both misattribution and incorrect Licensing[3]. This means that copyright infringement will occur, and even when the CoPilot produces a license that can be satisfied, you have no guarantee that it *actually* is satisfied. The weight of copyright infringement lies additionally with the developer too. This is a tool provided, it is up to the developer to properly license their work.
As such, the legal problem I will lay out is this. GPL License must be applied if a GPL-Licensed work is modified. Github CoPilot produces work with the GPL Licensed work in the data set. As such, the first argument is that any data set with GPL-Licensed work in it, is both, using it and modifying it. The second argument is that, if Github CoPilot produces a work that is a verbatim copy of another licensed work, is it subject to that licensed work's conditions?
The latter point is complicated by the presence of "Generic Solutions", this typically occurs within algorithm design, as a solution is generic and already coined, for example the Dijkstra's algorithm. Even a verbatim copy of code is very unlikely to hold up with copyright due to the frequency that problem crops up and the number of optimised solutions and implementations for it.
The reason this is a particularly noteworthy event is because there is no prior legal precedence. So the result of this, will be the precedent for furture similar cases. If it falls out of favour, a term called "Licensing Laundering" where GPL Licensed work could be laundered by the use of a tool like Google CoPilot to be a license compatible and usable for use in industry. This defeats the virility of GPLv3 and AGPL, and makes necessary a new License to handle this new corner case, and renders all prior code under GPLv3 and AGPL possible to launder and use by big corporations. If it holds that Github is in breach of copyright on many licenses, then there is nothing to worry about.
As side points, I feel I should mention that the GPT-3 which is the AI produced by OpenAI which is not Open, is effectively purchased and owned by Microsoft[4].This adds some credibility to the idea that this was the extend step. The embrace step was to purchase Github and with it come to acquire all the publicly available and licensable code. They then extend by purchasing exclusive license of GPT-3 by the For-Profit and non-open company OpenAI[5]. The extinguish step is to extinguish restrictive licenses and come into possession of a great quantity of code that is as free as public domain. In effect, getting a huge amount of free code. It also extinguishes bad programmers, but to me that is no loss at all.
Extend... Embrace... Extinguish. As one could probably expect, I am personally against this form of License Laundering, as it would have disasterous effects on Free Software. But lets not kid ourselves, the problem of Machine Learning algorithms overfitting to their training data was eventually going to crop up sometime... I suppose now.
[2] Research Recitation on Github's Copilot.
[3] Github copilot reproduces verbatim the fast inverse square root and incorrectly licenses it.
[4] Microsoft team up with OpenAI to exclusively license GPT-3 Language Model
[5] OpenAI is a for-profit company, that used to be a non-profit.