Two bestselling authors filed a lawsuit against OpenAI in federal court in San Francisco on Wednesday, alleging in a proposed class action lawsuit that the company used copyrighted intellectual property to “train” its artificial intelligence chatbot.
Authors Mona Awad and Paul Tremblay claim that ChatGPT was trained in part by “taping” their novels without their consent. Generative AI is based on two software programs known as large language models, which eschew a traditional programming method and instead extract large amounts of text to generate natural and lifelike responses to user input.
Upon request, ChatGPT published extremely detailed summaries of Tremblay’s “The Cabin at the End of the World” and Awad’s “Bunny” and “13 Ways of Looking at a Fat Girl”. Both authors claim this is proof that their novels were used to train the chatbot, and the submission includes ChatGPT’s responses to prompts about their novels.
According to the lawsuit, much of the material OpenAI uses to train its generative chatbots comes from copyrighted works, including books by Awad and Tremblay, “which were copied by OpenAI without consent, without attribution, and without compensation.”
The lawsuit alleges that a variety of materials were used to train the large language models, but that books “have been an important part of training datasets for large language models because books provide the best examples of high-quality long-form writing.”
In June 2018, OpenAI announced that it had trained GPT-1 with BookCorpus, describing the lawsuit as a “controversial dataset” compiled by artificial intelligence researchers in 2015 and containing a collection of “over 7,000 unique unpublished books across genres, including adventure”, contains , fantasy and romance.
“They copied the books from a website called Smashwords.com, which hosts unpublished novels that are available to readers for free. However, these novels are mostly copyrighted.”
According to the complaint, later iterations of the company’s large language models were trained on significantly larger sets of in-copyright books. In a July 2020 article introducing GPT-3, the company revealed that 15% of the training dataset came from “two internet-based book corpora” which OpenAI simply called “Books1” and “Books2”.
The lawsuit alleges that, based on figures published in the OpenAI article on GPT-3, Books1 would contain approximately 63,000 titles and Books2 would contain approximately 294,000 titles.
“Because the OpenAI language models cannot function without the expressive information extracted from and stored in plaintiffs’ (and others’) works, the OpenAI language models themselves constitute derivative works made without the plaintiffs’ permission and in violation of their exclusive rights.” were created under the Directive Copyright Act,” the lawsuit states.
Also on Wednesday, Clarkson, a public interest law firm, filed a broader class-action lawsuit on behalf of a dozen anonymous clients, accusing OpenAI of stealing private, sometimes identifying, information from internet users “without their informed consent or knowledge.” ” according to a Report in Rolling Stone. Experts predict more lawsuits are sure to follow as AI becomes more adept at using information from the internet to generate new content.