Taming Copyrights Issues in Machine Learning for Internet Freedom

by opennet | Feb 20, 2025 | Free Speech, Innovation and Regulation, Open Blog | 0 comments

The Internet has contributed to political equality and economic progress by providing powerless individuals the powers of mass communication and information. Power of information was supported by search engines. Existence of massive information is different from power of information. Availability of too much information is a hurldle to power of information. Search engines solved this problem for all including powerless individuals thereby contributing to democracy and economic equality.

Search engines took the next leap through artificial intelligence. The current version of AI is machine-learning based. Machine learning trains the software on massive amounts of data and leaves within the software the weighing factors that constitute cognitive abilities, in the form of error functions. The learning process is similar to how a child learns to speak, sing, and draw better even if the child may not remember the actual examples of sentences, songs, or objects. However, a copyright dispute has arisen on the issue of whether the authors of copyrighted works must be paid for when they are used as training data.

There are three possible ways that copyright violations are being discussed. Firstly, when the training data is copied for the purpose of inputting into AI. The Korean National Language Institute wanted to digitize (i.e., make digial copies of) existing books and provide to AI so that AI can enhance its ability to recognize and understand Korean language but the authors and publishers opposed and stopped the project. Secondly, in the US, the judiciary decided in Authors’ Guild v HathiTrust that libraries’ digitization of books constitutes ‘fair use’ because it enhances public’s access to the content of books and therefore benefits authors and publishers.

Secondly, the copyright issue arises when machines read, view, or otherwise ingest the tranining data. However, copyright regulates the right to “copy”. We don’t pay any royalty for reading a book or borrowing or lending it but only for buying an extra copy of the book. Duplicating, broadcasting, performing, and other acts similar to “copying” expand the physical medium of enjoying the artistic or literary work so that people without copyright can enjoy them. By granting the exclusive right to “copy” to the authors, copyright law tries to compensate for and incentivize them into creating more works. If even the act of enjoying the creative work such as reading or viewing them is exclusive or is monopolized by authors, the copyright law’s original purpose, that is, promotion of art and culture will be paralyzed. Who will buy a book if the buyer needs to pay a royalty everytime they read it? Authors will no longer receive incentives or compensation for creation if books are not bought. The process of machines reading the books or viewing artworks is no different from people doing so. For instance, inside Large Language Models, there is no copy of the training data bu only the inter-related weighing factors. For instace, LLM knows that the longer neck dimensions are related to the giraffe-likeness of the object but there is no photo of a giraffe inside LLM.

Some argue that many training data contents such as New York Times are behind the pay wall and therefore AI’s use of those contents should be restricted. However, the people who have purchased the entry through the physical access have right to have others “read” those articles. AI’s reading of those articles paid their way out of the pay wall is still a copyright issue.

Thirdly, a copyright issue arises where LLM’s output includes the contents similar to some data points in the training data. This is not difficult. Whether you use machine learning or Microsoft Word, creating something similar to the pre-existing copyrighted works after training on them violates copyrigh. Of course, LLM does not have “copies” in memory but accidentally creates something that is similar to the real works that was included in the training data. But even human songwriters who have subconsciously copied other writers’ melodies will be liable in copyright — that is unless the songwriters can prove that they have not learned of the others’ melodies. However, this copyright violation takes place at the output stage.

In sum, there is no copyright violation that prohibits AI from reading and viewing copyrighted works.

Reasonable enforcement of copyright will sustain equal sharing of information.

The Korean original of this article was published at Kyunghyang Shinmun.