AI's Dirty Little Secret: Models Fueled by Pirated Books

Written by Mike Kaput | Aug 29, 2023 1:15:48 PM

The Atlantic just released a major investigative journalism piece that proves popular large language models, like Meta’s LLaMA, have been using pirated books to train their models.

Why it matters:

This raises serious copyright concerns around how large language models have been trained.

Says the article:

“Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. . . . These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet.”

According to an interview in the story with the creator of the Books3 dataset of pirated books, it appears Books3 was created with altruistic intentions. The developer behind the Books3 dataset said he created it to give independent developers “OpenAI-grade training data,” in fear of large AI companies having a monopoly over generative AI tools.

Connecting the dots:

In Episode 61 of the Marketing AI Show, Marketing AI Institute founder/CEO Paul Roetzer broke down what we can expect to happen next.

AI companies may try to rely on “fair use” arguments to justify this. Fair use doctrine in U.S. copyright law states that sometimes copyrighted material may be used if it meets certain criteria—including how it’s used and how much material is used. It’s unclear if this is a justifiable strategy in the case of using copyrighted material to train AI models.
But AI companies are now on notice. “It seems like, if nothing else, these companies were very aggressive in using stuff that might not have been allowed to be used,” says Roetzer. In the future, Roetzer doesn’t see that happening. AI companies now know they’re being watched closely in this respect, and that future laws and regulations may catch up to them.
The likely way forward is licensing deals. “I assume that the play moving forward is to try and license the best examples of writing possible, including books,” says Roetzer. It’s not viable for AI companies to continue trampling on copyright, especially with lawsuits pending. So it’s possible they’ll move forward more aggressively with licensing content from trusted publishers. For instance, OpenAI and the New York Times are attempting to reach such a licensing deal. This would give models high-quality content on which to train—without breaking the law.

View full post