<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=2006193252832260&amp;ev=PageView&amp;noscript=1">

1 Min Read

AI's Dirty Little Secret: Models Fueled by Pirated Books

Featured Image

Wondering how to get started with AI? Take our on-demand Piloting AI for Marketers Series.

Learn More

The Atlantic just released a major investigative journalism piece that proves popular large language models, like Meta’s LLaMA, have been using pirated books to train their models.

Why it matters:

This raises serious copyright concerns around how large language models have been trained.

Says the article:

“Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. . . . These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet.”

According to an interview in the story with the creator of the Books3 dataset of pirated books, it appears Books3 was created with altruistic intentions. The developer behind the Books3 dataset said he created it to give independent developers “OpenAI-grade training data,” in fear of large AI companies having a monopoly over generative AI tools.

Connecting the dots:

In Episode 61 of the Marketing AI Show, Marketing AI Institute founder/CEO Paul Roetzer broke down what we can expect to happen next.

  1. AI companies may try to rely on “fair use” arguments to justify this. Fair use doctrine in U.S. copyright law states that sometimes copyrighted material may be used if it meets certain criteria—including how it’s used and how much material is used. It’s unclear if this is a justifiable strategy in the case of using copyrighted material to train AI models.
  2. But AI companies are now on notice. “It seems like, if nothing else, these companies were very aggressive in using stuff that might not have been allowed to be used,” says Roetzer. In the future, Roetzer doesn’t see that happening. AI companies now know they’re being watched closely in this respect, and that future laws and regulations may catch up to them.
  3. The likely way forward is licensing deals. “I assume that the play moving forward is to try and license the best examples of writing possible, including books,” says Roetzer. It’s not viable for AI companies to continue trampling on copyright, especially with lawsuits pending. So it’s possible they’ll move forward more aggressively with licensing content from trusted publishers. For instance, OpenAI and the New York Times are attempting to reach such a licensing deal. This would give models high-quality content on which to train—without breaking the law.

Related Posts

World of Bits, and What It Means to Marketing and Business

Paul Roetzer | February 18, 2023

We're so caught up in figuring out large language models (LLMs) that most marketing and business leaders are missing the bigger picture.

How to Score, Prioritize, and Better Understand Leads and Accounts with AI

Paul Roetzer | March 31, 2021

MadKudu uses AI to help marketers build models to better score, prioritize, and understand leads. Read this post to learn how.

Why Data Cleansing Is a Must for Predictive Modeling Success

Laurie Hood | March 23, 2021

Clean data is essential for success in predictive modeling and machine learning. Here's what to look for as you vet data for use with predictive models.