Large Language Models, IP Infringement, and the Cost of Doing Business Across Borders: A Multidisciplinary approach for the use of copyright protected material in the market for artificial intelligence development.

Authors

Paul Aubrecht (University of Passau )
Steffen Herbold (University of Passau )
Anamaria Mojica Hanke (University of Passau )

Abstract

Large Language Models (LLMs) have been documented to infringe on copyright protected intellectual property by memorizing training data sourced from the internet and other electronic sources, including: news articles, books, social media forums, music, and code. This occurrence is well-supported by scientific literature and is related to an increasingly common claim in civil litigation cases for copyright infringement by firms developing LLMs. Depending on how common (number of occurrences) and severe (scope of occurrence) these infringements are, the costs arising from litigation and fines for LLM providers may vary within a large range of potential sanctions. This variation can also be seen in a divergence of IP related cases for copyright infringement by LLM providers across borders. Potential defenses against copyright infringement claims also vary across states (nations). Since training LLMs in a way that avoids infringement (e.g., by buying licenses or cleaning data) also incurs costs, this raises the question: are the costs of infringement (possible fines or liability) for copyright violations just the “costs of doing business” which firms must account for, or should firms seek to completely avoid instances of infringement through securing licenses, cleaning their data or completely avoiding copyright protected data? This also relates to the ability of regulation to effectively and efficiently enforce copyright law. Given the variation of regulation for IP infringements by LLM providers across borders, this problem can also be examined within a regulatory competition context where states compete for the provision of regulation to lure the incorporation of firms. In this project, we provide a framework to estimate the expected legal costs due to copyright violations in several states. This framework accounts for various parameters including (1) the likelihood of infringement, (2) the severity of the infringement (based on number and scope), (3) the average costs associated with infringement, and (4) the variations in the average costs for infringement across borders. We use a multidisciplinary methodology using insights from law, economics, data science, and computer engineering to demonstrate how these parameters can be described within the framework of a cyclical process and regulatory competition to further demonstrate the need to develop new and innovative approaches to regulation for the novel challenges created by LLMs ongoing infringement of copyright.
The growth of Artificial Intelligence (AI) programs and providers has been significant over the past five years. With widespread implications for society, economies, cultures, and nearly every aspect of human life, the dawn of AI has challenged many aspects of the law, including the ability of intellectual property rights to protect rights holders from unlawful use by third parties. The growth in the use and availability of AI has been deeply tied to the specific type of AI known as a large language model (LLM). LLMs (Definition). AI developers need to provide large sets of data to enable the LLM to make better predictions about the next word which will appear in a sequence of words which is a response to a prompt given to the AI. In compiling these data sets which train LLMs, copyrighted material is often incorporated into the data set. This poses a specific type of problem for developers of LLMs, copyright holders, and regulators. When should the use of copyrighted material be allowed? More specifically, under what circumstances are uses of copyright protected materials by LLMs permissible, and when it is not permissible. To understand this specific problem, we look at the potential for copyrighted material to be used by an LLM, and the potential costs of using the copyrighted material by the LLM (this includes licensing fees, fines for copyright violations, and criminal penalties for copyright violations). In addition to the potential for costs which LLMs face from using or misusing copyrighted material, we also look at the variations across borders in the civil and criminal enforcement of copyright law. Using a comparative legal methodology, we identify how variation in the enforcement of copyrights against LLMs impacts the market for LLMs and regulation of LLMs. This analysis shows how the interaction of the role of innovation, the protection of IP rights, the availability of enforcement mechanisms for violation of IP rights, and the economics of compliance with IP law has created an environment which LLMs using copyright protected material is widespread and which reflects a divergence in copyright enforcement across borders. Computer modeling is also used to evaluate the decision making process of the individual LLM when deciding when and how to use copyright protected material, and when to move jurisdiction in order to benefit from the divergent approaches between states. We use a mathematical model of the costs associated with copyright to consider the hypothetical behavior of LLMs given a variation in copyright enforcement and transaction costs. We use continuous variables in our model to simulate the outcome under different conditions by initializing variables. We initialize each value with random variables that model the expected distribution of outputs, e.g., expected costs for copyright violations. We then estimate the distribution of the expected total costs using Monte Carlo simulation. By combining methodologies from law, economics and computer engineering, we provide a robust evaluation of the current situation concerning the use of copyright protected material by LLMs. This contributes to the literature concerning how AI technology has place stress on existing legal frameworks for copyright.