<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=2006193252832260&amp;ev=PageView&amp;noscript=1">

1 Min Read

AI's Dirty Little Secret: Models Fueled by Pirated Books

Featured Image

Wondering how to get started with AI? Take our on-demand Piloting AI for Marketers Series.

Learn More

The Atlantic just released a major investigative journalism piece that proves popular large language models, like Meta’s LLaMA, have been using pirated books to train their models.

Why it matters:

This raises serious copyright concerns around how large language models have been trained.

Says the article:

“Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. . . . These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet.”

According to an interview in the story with the creator of the Books3 dataset of pirated books, it appears Books3 was created with altruistic intentions. The developer behind the Books3 dataset said he created it to give independent developers “OpenAI-grade training data,” in fear of large AI companies having a monopoly over generative AI tools.

Connecting the dots:

In Episode 61 of the Marketing AI Show, Marketing AI Institute founder/CEO Paul Roetzer broke down what we can expect to happen next.

  1. AI companies may try to rely on “fair use” arguments to justify this. Fair use doctrine in U.S. copyright law states that sometimes copyrighted material may be used if it meets certain criteria—including how it’s used and how much material is used. It’s unclear if this is a justifiable strategy in the case of using copyrighted material to train AI models.
  2. But AI companies are now on notice. “It seems like, if nothing else, these companies were very aggressive in using stuff that might not have been allowed to be used,” says Roetzer. In the future, Roetzer doesn’t see that happening. AI companies now know they’re being watched closely in this respect, and that future laws and regulations may catch up to them.
  3. The likely way forward is licensing deals. “I assume that the play moving forward is to try and license the best examples of writing possible, including books,” says Roetzer. It’s not viable for AI companies to continue trampling on copyright, especially with lawsuits pending. So it’s possible they’ll move forward more aggressively with licensing content from trusted publishers. For instance, OpenAI and the New York Times are attempting to reach such a licensing deal. This would give models high-quality content on which to train—without breaking the law.

Related Posts

World of Bits, and What It Means to Marketing and Business

Paul Roetzer | February 18, 2023

We're so caught up in figuring out large language models (LLMs) that most marketing and business leaders are missing the bigger picture.

[The AI Show Episode 106]: Enterprise AI Adoption, GPT-4o Mini, New Research on Prompting, OpenAI “Strawberry,” & AI’s Impact on the Crowdstrike Outage

Claire Prudhomme | July 23, 2024

Episode 106 of The AI Show evaluates the complex landscape of enterprise AI adoption, the latest developments in best prompting practices, and developments in AI models.

[The AI Show Episode 95]: OpenAI and Moderna Team Up, Microsoft Phi-3, and Sam Altman and AI Leaders Join Homeland Security AI Board

Claire Prudhomme | April 30, 2024

Episode 95 of The Artificial Intelligence Show discusses the rise of AI Emergent companies, Microsoft's small language models, and the tech leaders joining the US government's AI Safety and Security Board.