Anthropic’s crawler is ignoring websites’ anti-AI scraping policies

Jul 25, 2024 09:46 PM - 5 months ago 148245

The ClaudeBot web crawler that Anthropic uses to scrape training information for AI models for illustration Claude has hammered iFixit’s website almost a cardinal times successful a 24-hour period, seemingly violating nan repair company’s Terms of Use successful nan process. 

“If immoderate of those requests accessed our position of service, they would person told you that usage of our contented expressly forbidden. But don’t inquire me, inquire Claude!” said iFixit CEO Kyle Wiens connected X, posting images that show Anthropic’s chatbot acknowledging that iFixit’s contented was disconnected limits. “You’re not only taking our contented without paying, you’re tying up our devops resources. If you want to person a speech astir licensing our contented for commercialized use, we’re correct here.”

iFixit’s Terms of Use policy states that “reproducing, copying aliases distributing” immoderate contented from nan website is “strictly prohibited without nan definitive anterior written permission” from nan company, pinch circumstantial inclusion of “training a instrumentality learning aliases AI model.” When Anthropic was questioned connected this by 404 Media, however, nan AI institution linked backmost to an FAQ page that says its crawler tin only beryllium blocked via a robots.txt record extension.

Wiens says iFixit has since added nan crawl-delay extension to its robots.txt. We person asked Wiens and Anthropic for remark and will update this communicative if we perceive back.

iFixit doesn’t look to beryllium alone, pinch Read nan Docs co-founder Eric Holscher and Freelancer.com CEO Matt Barrie saying successful Wiens’ thread that their tract had besides been aggressively scraped by Anthropic’s crawler. This besides doesn’t look to beryllium caller behaviour for ClaudeBot, pinch several months-old Reddit threads reporting a melodramatic summation successful Anthropic’s web scraping. In April this year, the Linux Mint web forum attributed a tract outage to strain caused by ClaudeBot’s scraping activities.

Disallowing crawlers via robots.txt files is besides nan opt-out method of prime for galore other AI companies for illustration OpenAI, but it doesn’t supply website owners pinch immoderate elasticity to denote what scraping is and isn’t permitted. Another AI company, Perplexity, has been known to ignore robots.txt exclusions entirely. Still, it is 1 of nan fewer options disposable for companies to support their information retired of AI training materials, which Reddit has applied successful its caller crackdown connected web crawlers.

More