The Nonprofit Feeding the Entire Internet to AI Companies

Editor’s note: This work is part of AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.


The Common Crawl Foundation is little known outside of Silicon Valley. For more than a decade, the nonprofit has been scraping billions of webpages to build a massive archive of the internet. This database—large enough to be measured in petabytes—is made freely available for research. In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models. In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this—as well as masking the actual contents of its archives.

Common Crawl has not said much publicly about its support of LLM development. Since the early 2010s, researchers have used Common Crawl’s collections for a variety of purposes: to build machine-translation systems, to track unconventional uses of medicines by analyzing discussions in online forums, and to study book banning in various countries, among other things. In a 2012 interview, Gil Elbaz, the founder of Common Crawl, said of its archive that “we just have to make sure that people use it in the right way. Fair use says you can do certain things with the world’s data, and as long as people honor that and respect the copyright of this data, then everything’s great.”

Common Crawl’s website states that it scrapes the internet for “freely available content” without “going behind any ‘paywalls.’” Yet the organization has taken articles from major news websites that people normally have to pay for—allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl’s executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. “The robots are people too,” he told me, and should therefore be allowed to “read the books” for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not.

I’ve discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, has written, “Generative AI in its current form would probably not be possible without Common Crawl.” In 2020, OpenAI used Common Crawl’s archives to train GPT-3. OpenAI claimed that the program could generate “news articles which human evaluators have difficulty distinguishing from articles written by humans,” and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers’ articles to train models that summarize and paraphrase the news, and are deploying those models in ways that steal readers from writers and publishers.

Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from “Search 2.0”—referring to the generative-AI products now widely being used to find information online—and said that, anyway, it is the publishers that made their work available in the first place. “You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” he said.

Common Crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles. Thus, by my estimate, the foundation’s archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper’s, and The Atlantic.

Some news publishers have become aware of Common Crawl’s activities, and some have blocked the foundation’s scraper by adding an instruction to their website’s code. In the past year, Common Crawl’s CCBot has become the scraper most widely blocked by the top 1,000 websites, surpassing even OpenAI’s GPTBot, which collects content for ChatGPT. However, blocking only prevents future content from being scraped. It doesn’t affect the webpages Common Crawl has already collected and stored in its archives.

In July 2023, The New York Times sent a notice to Common Crawl asking for the removal of previously scraped Times content. (In their lawsuit against OpenAI, the Times noted that Common Crawl includes “at least 16 million unique records of content” from Times websites.) The nonprofit seemed amenable to the request. In November of that year, a Times spokesperson, Charlie Stadtlander, told Business Insider: “We simply asked that our content be removed, and were pleased that Common Crawl complied.”

But as I explored Common Crawl’s archives, I found that many Times articles appear to still be present. When I mentioned this to the Times, Stadtlander told me: “Our understanding from them is that they have deleted the majority of the Times’s content, and continue to work on full removal.”

The Danish Rights Alliance (DRA), an organization that represents publishers and other rights-holders in Denmark, told me about a similar interaction with Common Crawl. Thomas Heldrup, the organization’s head of content protection and enforcement, showed me a redacted email exchange with the nonprofit that began in July 2024, in which the DRA requested that its members’ content be removed from the archive. In December 2024, more than six months after the DRA had initially requested removal, Common Crawl’s attorney wrote: “I confirm that Common Crawl has initiated work to remove your members’ content from the data archive. Presently, approximately 50% of this content has been removed.” I spoke with other publishers who’d received similar messages from Common Crawl. One was told, after multiple follow-up emails, that removal was 50 percent, 70 percent, and then 80 percent complete.

By writing code to browse the petabytes of data, I was able to see that large quantities of articles from the Times, the DRA, and these other publishers are still present in Common Crawl’s archives. Furthermore, the files are stored in a system that logs the modification times of every file. The foundation adds a new “crawl” to its archive every few weeks, each containing 1 billion to 4 billion webpages, and it has been publishing these regular installments since 2013. None of the content files in Common Crawl’s archives appears to have been modified since 2016, suggesting that no content has been removed in at least nine years.

In our first conversation, Skrenta told me that removal requests are “a pain in the ass” but insisted that the foundation complies with them. In our second conversation, Skrenta was more forthcoming. He said that Common Crawl is “making an earnest effort” to remove content but that the file format in which Common Crawl stores its archives is meant “to be immutable. You can’t delete anything from it.” (He did not answer my question about where the 50, 70, and 80 percent removal figures come from.)

Yet the nonprofit appears to be concealing this from visitors to its website, where a search function, the only nontechnical tool for seeing what’s in Common Crawl’s archives, returns misleading results for certain domains. A search for nytimes.com in any crawl from 2013 through 2022 shows a “no captures” result, when in fact there are articles from NYTimes.com in most of these crawls. I also discovered more than 1,000 other domains that produce this incorrect “no captures” result for at least several of the crawls, and most of these domains belong to publishers, including the BBC, Reuters, The New Yorker, Wired, the Financial Times, The Washington Post, and, yes, The Atlantic. According to my research and Common Crawl’s own disclosures, the companies behind each of these publications have sent legal requests to the nonprofit. At least one publisher I spoke with told me that it had used this search tool and concluded that its content had been removed from Common Crawl’s archives.

In the past two years, Common Crawl has been getting cozier with the AI industry. In 2023, after 15 years of near-exclusive financial support from the Elbaz Family Foundation Trust, it received donations from OpenAI ($250,000), Anthropic ($250,000), and other organizations involved in AI development. (Skrenta told me that running Common Crawl costs “millions of dollars.”)

When training AI models, developers such as OpenAI and Google usually filter Common Crawl’s archives to remove material they don’t want, such as racism, profanity, and various forms of low-quality prose. Each developer and company has its own filtering strategy, which has led to a proliferation of Common Crawl–based training data sets: c4 (created by Google), FineWeb, DCLM, and more than 50 others. Together, these data sets have been downloaded tens of millions of times from Hugging Face, an AI-development hub, and other sources.

But Common Crawl doesn’t only supply the raw text; it has also been helping assemble and distribute AI-training data sets itself. Its developers have co-authored multiple papers about LLM-training-data curation, and they sometimes appear at conferences where they show AI developers how to use Common Crawl for training. Common Crawl even hosts several AI-training data sets derived from its crawls, including one for Nvidia, the most valuable company in the world. In its paper on the data set, Nvidia thanks certain Common Crawl developers for their advice.

AI companies have argued that using copyrighted material is fair use, and Skrenta has been framing the issue in terms of robot rights for some time. In 2023, he sent a letter urging the U.S. Copyright Office not “to hinder the development of intelligent machines” and included two illustrations of robots reading books. But this argument obscures who the actors are: not robots but corporations, and their powerful executives, who decide what content to train their models with and who profit from the results.

If it wanted to, Common Crawl could mitigate the damage done by those corporations to authors and publishers without making its data any less accessible to researchers. In his 2024 report, Baack, the ex-Mozilla researcher, pointed out that Common Crawl could require attribution whenever its scraped content is used. This would help publishers track the use of their work, including when it might appear in the training data of AI models that aren’t supposed to have access. This is a common requirement for open data sets and would cost Common Crawl nothing. I asked Skrenta if he had considered this. He told me he had read Baack’s report but didn’t plan on taking the suggestion, because it wasn’t Common Crawl’s responsibility. “We can’t police that whole thing,” he told me. “It’s not our job. We’re just a bunch of dusty bookshelves.”

Skrenta has said that publishers that want to remove their content from Common Crawl will “kill the open web.” Likewise, the AI industry often defends its presumed right to scrape the web by invoking the concept of openness. But others have pointed out that generative-AI companies are the ones killing openness, by motivating publishers to expand and strengthen their paywalls to defend their work (and their business models) from exploitative scrapers.

Promoting another dubious, feel-good idea, Common Crawl has said that the internet is “where information lives free,” echoing the techno-libertarian rallying cry that “information wants to be free.” In popular usage, the phrase is frequently stripped of its context. It comes from a remark made by the tech futurist Stewart Brand in 1984. In a discussion about how computers were accelerating the spread of information, Brand observed that “information sort of wants to be expensive, because it’s so valuable.” But, paradoxically, he said, “information almost wants to be free” because computers make the cost of distributing it so low. In other words, it’s not that information should be free—rather, computers tend to make it seem free. Yet the idea is deployed today by secretive organizations such as Common Crawl that choose which information “lives free” and which doesn’t.

In our conversation, Skrenta downplayed the importance of any particular newspaper or magazine. He told me that The Atlantic is not a crucial part of the internet. “Whatever you’re saying, other people are saying too, on other sites,” he said. Throughout our conversation, Skrenta gave the impression of having little respect for (or understanding of) how original reporting works.

Skrenta did, however, express tremendous reverence for Common Crawl’s archive. He sees it as a record of our civilization’s achievements. He told me he wants to “put it on a crystal cube and stick it on the moon,” so that “if the Earth blows up,” aliens might be able to reconstruct our history. “The Economist and The Atlantic will not be on that cube,” he told me. “Your article will not be on that cube. This article.”

Great Job Alex Reisner & the Team @ The Atlantic Source link for sharing this story.

#FROUSA #HillCountryNews #NewBraunfels #ComalCounty #LocalVoices #IndependentMedia

Latest articles

spot_img

Related articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Leave the field below empty!

spot_img
Secret Link