Data protection & Copyright in the Age of GenAI

November 20, 2023

In late September, the News Media Alliance visited Capitol Hill to advocate for updated copyright protections in the age of artificial intelligence (AI). The alliance, representing publishers ranging from major outlets like Vox Media to smaller regional papers, argues that unauthorized use of their content by generative AI companies violates copyright laws. This issue poses a significant threat to the media industry, as AI algorithms could potentially replace news publishers as sources of information. If people can simply ask platforms like ChatGPT for details on current or historical events, they may not feel the need to visit news publisher websites. This could severely impact the news media business model, leading to potential job losses.

The situation exemplifies creators' efforts to push back against how AI companies train large language models (LLMs). These models are often trained using massive amounts of data scraped from the web and other sources, causing concerns over copyright infringement. Well-known figures, such as authors George R.R. Martin and John Grisham, as well as comedian Sarah Silverman, have filed lawsuits addressing this issue. Additionally, software developers are taking legal action against GitHub Copilot, a coding assistant trained on open-source code. Visual artists, Getty Images, and many others have also joined class action lawsuits. This legal battle will determine how AI companies can develop their algorithms in the future, with significant implications.

The question of whether AI companies violate copyright has created uncertainty in a thriving market. AI-focused firms, valued at billions of dollars, attract significant interest from businesses and consumers. To alleviate concerns, Microsoft has made its Copilot Copyright Commitment, promising to defend its commercial customers against third-party copyright infringement claims related to the use of Microsoft Copilot, an Office 365 AI feature powered by Open AI's GPT-4 algorithm.

This new debate over data rights follows the introduction of the General Data Protection Regulation (GDPR) in Europe five years ago. Several U.S. states have also enacted data privacy legislation, imposing similar or even stricter protections on user data. The GDPR has already resulted in substantial fines for tech giants that rely on amassing consumer data to target ads or for other purposes. These laws should have covered the use of personal or copyrighted data to train AI models, but they demonstrate how quickly technology advances and how legal frameworks struggle to keep pace.

Regardless of the state of the law, users are actively opposing predatory data harvesting practices. In cases where the law is clear, the Data Rights Protocol (DRP) seeks to automate the resolution of data requests between consumers and compliant companies. In situations where technology outpaces the law, users seeking to protect their data from AI training employ defensive tactics. For example, Glaze, a filter developed by Shawn Shan and his team at the University of Chicago, prevents AI from imitating an artist's image style. Defending against data-intensive business models will become part of a broader trend of digital sovereignty. Consumers and businesses alike will assert control over their data and digital identity through court challenges, political lobbying, and technological countermeasures.