Copyleft Licences and AI

There has been a lot of talk, and a lot of litigation, about how copyright might affect AI and its users. The lawsuit filed against OpenAI by the Authors Guild and some big-name authors including George RR Martin, John Grisham, and Jodi Picoult (case Case 1:23-cv-08292[1]), alleging that ChatGPT infringes the copyright in the authors’ works, is just one example. This is all fascinating, and has generated many column inches in the year since generative AI burst onto the scene, but what about copyleft? Could this bastion of Free and Open-Source Software mean that after finishing a big coding project with a little AI help coders might find out that their code must be made available for others to use and develop because they inadvertently used some copyleft code?

Let us start by defining the terms copyright and copyleft. Copyright, which is probably familiar to us all, protects creative works – it gives the owner the exclusive right to copy and use the work. This means, for example, that we can’t produce and distribute copies of a book or a film without authorisation from the owner of the copyright in that book or film. Whilst copyright may be considered to prevent the use of a work, a copyleft licence, on the other hand, promotes it.

A key aspect of the concept of copyleft is that a work made available under a copyleft licence may be used and modified, but the resulting derivative works must be made available for others to use and modify in the same way. This means one can build upon a piece of copyleft software, for example, but one must then make the resulting derivative work available under the same copyleft conditions. The way that copyleft licences spread to any derivative works, and then to any derivatives of those derivatives, and so on, has earned them the name “viral licences[2]”.

The idea of copyleft has its roots in the field of software development and in the person of Richard Stallman. In 1985, he developed the first copyleft licence, the Emacs General Public Licence[3], after his dissatisfaction with a company working on software that he made available to them and then not allowing him to continue to develop their updated version. The Emacs General Public Licence later evolved into the GNU General Public Licence (GPL)[4], which is now a class of some of the most well-known and popular copyleft licences. Richard Stallman went on to launch the GNU project[5], a mass collaborative initiative for the development of free software, and to found the Free Software Foundation – FSF, a nonprofit with a worldwide mission to promote free software and computer user freedom.

Copyleft licences are closely related to the idea of Free and Open-Source Software (FOSS), whose principles grew out of the benefits found by early software engineers in sharing and collaborating with others. A core principle of copyleft is that anyone ought to be able to benefit from a work, but that further developments should benefit everyone else, too.

The concept of copyleft seems fairly straightforward when the simple case of modifying some copyleft software is considered – one can do so, but one has to make the modified version available for everyone else to use and develop, too. But sometimes it’s not quite so easy. For example, when exactly is a work a derivative of another piece of work? The picture is muddied further when we look at AI, and particularly the recent advances in Generative AI.

In general, AI tools work because they have been trained on examples – this is the principle of machine learning, that an AI can learn from training data and can generalise the answers or solutions to problems. In this way, an AI tool can complete a job without explicit programming of how exactly that particular job should be executed. For example, an AI tool can be trained to recognise images of dogs, and if trained properly it will be able to recognise a dog in an image it has never seen before because it is able to generalise what it has learned from the training images.

Many of the current legal issues surrounding AI have to do with the use of copyrighted material for training an AI tool – for example, the lawsuit against OpenAI mentioned above. This lawsuit also concerns the ability of OpenAI software to “copy” authors’ works by generating passages which are the same as or close to the original copyrighted material.

In a similar way, what if material made available under a copyleft licence is used as training data to train an AI tool? Could this lead to the AI tool effectively reproducing some of that copyleft material to the extent that the resulting work is considered a derivative and must therefore be made available under the same terms as the original material?

As indicated above, in general the mechanism of AI is not to copy what it has learned from, but to generalise solutions from training data to carry out tasks without specific instructions on how to do so. But even if an AI tool does not explicitly and deliberately copy solutions from the training data, it is not difficult to appreciate how an AI tool might feasibly offer a solution to a problem that is substantially the same as a solution in its training data, for example if the problems are the same, in particular given the relative limitations on vocabulary and syntax when compared to ‘human’ language.

A specific example of the potential issues at the intersection of AI and copyleft licences can be found in one of the AI tools that offer to autocomplete software code. Similar to how a mobile phone’s keyboard is able to suggest the next word of a message based on what has already been written, or how a word processor suggests the next word or phrase as I type this sentence, this AI tool is able to suggest snippets or lines of code based on the code a user has already typed. The owners of this AI tool maintain that it uses probabilistic reasoning, as opposed to copying code from its training data, but there is the possibility that the suggestion the AI tool provides may be the same as some of the code from its training data[8] – and that may happen to be code which has been made available under a copyleft licence.

An interesting question is this: if a developer uses this tool when writing some code and inadvertently uses some copyleft code, is their software then “infected” with the copyleft code’s viral licence? The answer is not completely clear.

It seems reasonable to assume that the answer would depend, among other factors, on just how much copyleft code is used and whether the string is specific to the original software or relatively common. For example, it could be considered that the use of a small piece of code defining a class or a function in a relatively common way would most likely not make the overall software a derivative work – similarly to how using a ubiquitous phrase like “she swallowed her fear” or “they released a breath they didn’t know they’d been holding” in a novel would not constitute copying with regards to copyright. If the piece of copyleft is longer or more specific, the question is much more difficult to answer, and is one of the issues raised in a US lawsuit (case 4:2022cv06823 [10]) filed in 2022 against this particular AI tool.

As with many aspects of AI, its intersection with copyleft licences is relatively new, and the legal ramifications are not at all clear at present. We may well have to wait for more guidance from the courts, as the multiple pending lawsuits work their way through the system, before some clarity starts to emerge.

One thing that seems clear, at least, is that the concept of open-source software, and copyleft licences that support it, have been a boon to the development of AI and of technology in general, and this is surely something that ought to be preserved.

This is for general information only and does not constitute legal advice. Should you require advice on this or any other topic then please contact hlk@hlk-ip.com or your usual HLK advisor.