US Publishers Demand Common Crawl Stop Scraping Their Content

Jun 10, 2026 07:45 AM - 8 hours ago 441

Digital Content Next, a waste and acquisition assemblage representing US integer publishers, has sent a cease and desist letter to the Common Crawl Foundation.

The missive demands Common Crawl extremity collecting patient contented and region worldly already successful its datasets.

DCN CEO Jason Kint announced the ineligible announcement successful a blog post, and Press Gazette reported further specifications from the missive this week.

Common Crawl has crawled respective cardinal caller pages each period since 2007 to build a free nationalist archive. That archive has been utilized to train galore of the AI models successful usage today. OpenAI’s GPT-3 paper listed filtered Common Crawl arsenic 60% of the model’s training mix.

The conflict matters for immoderate tract that blocks AI crawlers. Blocking Common Crawl’s crawler, CCBot, stops early postulation but doesn’t touch contented already successful the archive, which anyone tin still download.

What DCN Demands

The missive calls connected Common Crawl to extremity “scraping, retaining, aliases sharing copyrighted, paywalled, subscriber-only, aliases different protected contented from DCN personnel companies successful its datasets,” and to region personnel contented it has already collected.

DCN claims Common Crawl has “flagrantly infringed” copyrighted contented by creating its datasets and sharing them pinch AI companies.

The missive argues “copyright rule is not an opt-out regime.” In different words, DCN’s position is that publishers shouldn’t person to inquire to beryllium excluded. Common Crawl should request support to see them.

Kint wrote that the notice:

“challenges a increasing presumption that contented created done important finance tin beryllium collected, stored, repurposed, and monetized simply because it is technically accessible.”

Why DCN Doubts The Removal Process

The DCN missive questions whether Common Crawl follows opt-out instructions and whether it removes contented erstwhile asked. Per Press Gazette, DCN’s lawyers are examining whether Common Crawl’s statements to publishers “may person been inaccurate aliases misleading.”

Common Crawl publishes a public registry of websites that person asked not to beryllium scraped. It includes entries for the Associated Press, the BBC, and a ample News/Media Alliance submission covering hundreds of domains. Press Gazette reports the database besides includes different awesome publishers.

This isn’t the first clip the removal process has been questioned. The Atlantic reported successful November that contented from The New York Times and Danish publishers was still disposable aft Common Crawl agreed to region it.

Common Crawl’s Response

Common Crawl executive head Rich Skrenta declined to remark connected the missive erstwhile contacted by Press Gazette.

He has pushed backmost connected akin claims before. In a November blog post responding to The Atlantic, Skrenta denied that the statement lied to publishers aliases scrapes paywalled material.

He said the archive’s record format can’t beryllium edited aft publication without breaking its integrity. Instead, Common Crawl says it removes aliases filters affected URLs from consequent crawls and makes them inaccessible done its nationalist devices and indices:

“When a patient asks america to region antecedently crawled material, we respond promptly and initiate a removal process that reflects the method creation of our dataset.”

He added:

“No 1 astatine Common Crawl has ever claimed this activity was instantaneous aliases complete; rather, we person been unfastened astir its complexity and ongoing nature.”

In a forum post this week, Skrenta said Common Crawl is contributing to unfastened standards activity connected really websites definitive AI scraping preferences.

Why This Matters

The DCN missive targets the stored archive, not conscionable early crawling, and argues the load should not autumn connected publishers to opt retired successful the first place.

Most publishers successful BuzzStream’s sample person already made the blocking decision, pinch 79% of the 100 news sites it checked blocking astatine slightest 1 training bot. Cloudflare’s Year successful Review information we covered successful January recovered CCBot among the bots pinch the astir afloat disallow directives crossed apical domains. The mobility DCN raises is what those blocks execute if years of contented enactment disposable for training anyway.

Looking Ahead

Whether DCN escalates depends connected really Common Crawl responds, and Common Crawl hasn’t said really it will. The 2 sides want different rules for who acts first.

Skrenta is backing standards activity that would fto sites authorities their scraping preferences, which keeps opting retired arsenic the model. The UK’s CMA took a akin way erstwhile it required Google to fto publishers opt retired of AI hunt features.

DCN argues scrapers should request support first. If much waste and acquisition groups return up that argument, the unit moves from individual robots.txt files to the archives themselves.

Featured Image: Andre Boukreev/Shutterstock

Category News SEO