81.8% Of My ‘AI Assistant’ Traffic Was Fake. The Googlebot Number Was Worse

Jun 25, 2026 08:30 PM - 2 hours ago 58

I launched CitationIQ.com recently. Over the past 2 weeks, my logs claimed 33 AI assistants visited, a small amended than 2 a day. That number is simply a lie. The existent number? Six.

Googlebot looked worse. Of 799 requests carrying its name, only 107 were real, though we each cognize scammers emotion to spoof Googlebot. And immoderate of those clone AI visits, while wearing ChatGPT’s name, asked my server to manus complete its secrets file.

I tally this brand-new platform, and I person spent zero dollars promoting it frankincense far, truthful postulation remains modest. I went looking for a quiet, meticulous publication of who (robots and crawlers, since Google Analytics 4 handles the rest) was visiting, expecting mini numbers, and I sewage them. What I did not expect was that astir of moreover these humble numbers were lies. Here is what happened, really I checked, really I chased the stubborn cases to proof, and why the astir useful point you tin do this week is tally the aforesaid cheque connected your ain logs.

The Thing Nobody Checks

When a bot fetches your page, it announces a name. ChatGPT-User. Claude-User. Googlebot. CCBot, aliases whoever they opportunity they are. Your server writes that sanction into the log, your analytics counts it, and you tie conclusions from it.

The sanction is self-reported, simply a string successful the petition header, and anyone tin put thing they for illustration there. Claiming to beryllium Googlebot costs thing and proves nothing. It is simply a alien astatine your doorway successful a transportation uniform, and the azygous is easy to fake.

The existent cheque is not complicated. The awesome operators people the existent IP addresses their bots use, arsenic plain files you tin unfastened correct now, and a petition is morganatic only if the sanction matches and the reside sits wrong the published list. The sanction is the claim. The IP is the proof.

  • ChatGPT-User https://openai.com/chatgpt-user.json
  • Claude (all bots) https://claude.com/crawling/bots.json
  • Perplexity-User https://www.perplexity.com/perplexity-user.json
  • Googlebot https://developers.google.com/static/crawling/ipranges/common-crawlers.json
  • CCBot https://index.commoncrawl.org/ccbot.json

I built my cheque pinch 3 outcomes, not two. Verified intends the IP is successful the published range. Spoofed intends the ranges loaded, and the IP is not successful them. Unverifiable intends I could not find it, because a database grounded to load aliases a grounds was missing. I ne'er telephone thing clone conscionable because I grounded to corroborate it, and later that restraint is precisely what kept 1 investigation honorable agelong capable to scope the truth.

The cheque is astir 15 lines of Python utilizing only the modular library, because deciding whether an reside sits wrong a web scope is simply a solved problem.

import ipaddress, json, urllib.request # A vendor’s published database of the IPs its bot really uses. url = “https://openai.com/chatgpt-user.json” data = json.loads(urllib.request.urlopen(url).read()) # Pull each reside scope retired of the file. nets = [] def collect(node): if isinstance(node, dict): for v successful node.values(): collect(v) elif isinstance(node, list): for v successful node: collect(v) elif isinstance(node, str): try: nets.append(ipaddress.ip_network(node, strict=False)) except ValueError: pass collect(data) # A petition claiming to beryllium ChatGPT-User is only existent if its # root IP sits wrong 1 of those ranges. def is_real(ip): addr = ipaddress.ip_address(ip) return any(addr successful nett for nett successful nets)

That snippet is the bosom of the check, not the full thing. It is read-only and standard-library, but it is not a vanished verifier. As written, it loads 1 vendor’s list, truthful connected its own, it would wrongly emblem each existent Claude, Perplexity, and Google petition arsenic fake. A moving type wraps this halfway successful 4 things the illustration leaves out: It sounds your existent log lines alternatively of 1 hardcoded address, maps each bot sanction to its ain published list, adds the unverifiable authorities for cases a database cannot settle, and falls backmost to reverse DNS for an usability for illustration Common Crawl that leans connected it.

The Demand Gap

Start pinch the request signal, the requests that travel not from a scheduled crawl but from an adjunct fetching my page unrecorded during a existent user’s session. That is what these supplier names mark: a fetch triggered successful existent clip by personification utilizing the assistant, not the routine inheritance crawling everything other present is doing. What the log cannot show maine is what that personification was after, whether they asked astir maine by sanction aliases thing broader wherever my page sewage pulled successful to crushed an answer, truthful I will not declare either. What I tin opportunity is that 33 requests carried 1 of those live-fetch names. Six came from an IP the vendor publishes. Twenty-seven did not. That is an 81.8% spoof complaint among the requests I could check.

The fakes gave themselves distant by wherever they went. A existent adjunct fetch lands connected a existent page. The spoofed ones, still wearing the assistant’s name, went hunting for .env.production, secrets.yaml, and config.json. Nobody asked an adjunct to publication my situation variables. Those were credential scanners borrowing a trusted sanction to gaffe past filters, and the IP cheque caught each one.

Hold these numbers loosely. Six verified is only six, 1 mini caller tract complete 14 days, and you cannot build a mentation connected a sample that thin. Treat it arsenic my baseline, not a uncovering astir the world. Your numbers will matter acold much than mine.

The Bigger Number, Which Is Not News

Of 799 requests carrying the Googlebot name, only 107 came from a verified Google address. The different 692, astir 87%, were not Google.

This is not a discovery. Googlebot has been the astir impersonated sanction connected the web for the amended portion of 2 decades, which is precisely why Google publishes its ranges and tells you to verify by IP alternatively than spot the string. What the information does is corroborate the shape and show its standard connected a brand-new tract pinch nary postulation to speak of. The astir trusted crawler sanction draws the astir impersonation, and it draws it immediately. Some fakes moreover utilized Googlebot strings tied to products Google retired years ago, a scanner copying an aged user-agent disconnected a database and ne'er looking back.

So the reminder holds, aged arsenic it is. The Googlebot statement successful your logs is not a Google number. It is simply a “claims to beryllium Google” number, and the spread tin beryllium enormous.

Two Different Games

First, a clarification, because the numbers are astir to get bigger. Everything truthful acold counted demand: Live fetches an adjunct makes during a existent conversation, the agents whose names extremity successful -User. What follows is simply a abstracted population, the scheduled crawlers that scale and train successful the background, and they are different bots. ChatGPT-User is not GPTBot, and Claude-User is not ClaudeBot. So these counts tally larger than the six, and they do not overlap pinch them. Strip the fakes away, and the verified crawl tells a much absorbing communicative than the request fetches did, because the crawlers themselves play 2 different games group lump together.

Some do retrieval. They build the scale that gets pulled into an reply today. When a personification asks an adjunct a question, and it reaches for existent sources, this is the machinery down that. Retrieval is astir whether you show up this week.

Others do training. They harvest contented that whitethorn beryllium folded into the weights of the adjacent model. When a training crawler takes your page, that is not a sojourn you measurement successful referral traffic. It is simply a deposit into a corpus utilized to build models that will reply questions for years, often without ever fetching you again. The payoff is delayed, compounding, and invisible to each dashboard you own.

Here is my verified crawl information (two weeks, 1 caller site, a snapshot, and thing more). The astir progressive verified crawler connected my domain was not Google. It was Anthropic’s ClaudeBot astatine 166 confirmed crawls, up of verified Googlebot astatine 107, pinch OpenAI’s GPTBot astatine 46 and its hunt crawler astatine 40 behind. Is that a trend? No, it is 14 days connected a tract cipher has heard of. But the creation is worthy seeing, because who spends crawl fund connected a brand-new, unpromoted domain is the benignant of awesome that turns strategical erstwhile the measurement is real.

Retrieval is your visibility today. Training is whether the exemplary knows you tomorrow, without having to look you up astatine all. Most measurement fixates connected the first. The 2nd is quieter, arguably matters more, and almost cipher is watching it.

The One I Had To Chase: CCBot

Which brings maine to what mightiness beryllium the astir consequential training crawler of all, and the champion illustration of why that unverifiable file exists. Common Crawl, fetched by CCBot, produces the unfastened dataset that sits underneath a ample stock of the models trained successful caller years. So erstwhile my study showed CCBot astatine zero verified, 4 spoofed, and sixteen unverifiable, the 16 bothered me. Unverified swings some ways. It does not mean fake, and it does not mean real. It intends spell find out. So I did, and the way is 1 you tin copy.

First, the published list. Common Crawl publishes its crawler IP ranges, and not 1 of the 20 CCBot-labeled requests fell wrong them.

Second, reverse DNS. Real CCBot resolves to a commoncrawl.org hostname. Four of excavation resolved to thing that was not Common Crawl, and the different sixteen had nary reverse grounds astatine all, which is precisely why the book would not vouch for them.

Third, the corpus itself. Common Crawl runs a nationalist scale wherever you tin inquire whether a domain has been captured. I checked the 3 astir caller monthly crawls for my domain, pinch wildcards, truthful I was not simply matching the homepage. Nothing.

Fourth, ownership. I pulled the earthy IPs retired of my logs and ran a WHOIS lookup connected each. Every 1 traced to commodity hosting crossed respective countries (most successful Europe), the inexpensive rented infrastructure scanners tally on.

Four independent angles, 1 answer. All 20 were impostors. The school constituent is the portion an SEO will appreciate. The automated cheque correctly refused to telephone those 16 fake, since an absent grounds is not grounds of fraud, and it took manual digging to adjacent the loop. So erstwhile your ain study shows unverifiable rows, that is not a dormant end. It is an invitation: propulsion the IPs, cheque the owner, cheque the corpus, and the image resolves.

The One I Could Not Measure: Gemini

There is 1 awesome subordinate I could not measurement astatine all, and the logic is the point. Gemini.

OpenAI, Anthropic, and Perplexity each expose distinct, verifiable signals. You tin abstracted their training crawler from their retrieval crawler from their live, user-driven fetch, and corroborate each by IP. Google does not activity this way. There is 1 Googlebot crawl. Whether the contented it gathers feeds Gemini training is governed by a robots.txt token called Google-Extended, which is not a crawler. It ne'er fetches anything. It is simply a support emblem connected a crawl that already happened. There is nary Gemini fetcher successful your logs by design, and truthful nary measurement to measurement Gemini request by name, the measurement you tin for ChatGPT aliases Claude.

My book looked for it. It recovered thing claiming to beryllium Gemini, which tells you moreover the impersonators person not bothered pinch that name. It did drawback 4 requests announcing themselves arsenic Google-Extended while fetching pages, and since Google-Extended cannot fetch, those 4 are clone connected their face, disproved by the sanction unsocial earlier immoderate IP cheque runs.

If you person done this activity arsenic agelong arsenic I have, this is familiar. In 2011, Google encrypted hunt referrers, and the keyword information we depended connected collapsed into “(not provided).” The granularity went away, and we were handed a emblem successful spot of a measurement. The AI era is mimicking. Where its competitors expose training, retrieval, and request arsenic separate, verifiable events, Google bundles them into a azygous crawl and an invisible token. You tin corroborate Googlebot, and thing past it, and the remainder is, erstwhile again, not provided.

2 Honest Asterisks

Perplexity is murkier than a cleanable walk aliases fail. Its crawler grounded my IP cheque connected 24 of 36 requests, but Perplexity has been documented fetching from addresses extracurricular its ain published ranges, truthful immoderate failures whitethorn beryllium impersonators, and immoderate whitethorn beryllium Perplexity operating off-list. For that one, spoofed is ambiguous successful some directions. And again, each of this is 2 weeks of information connected 1 mini site.

Go Make Your Own Baseline

Do not return my numbers; return the method.

My information is bladed because my tract is new, and yours astir apt is not. If you person immoderate existent traffic, you are sitting connected a acold amended dataset than mine, successful your ain entree logs, correct now, and you tin tally this cheque this afternoon. Pull a day range, lucifer the names, verify the IPs against the published lists, and find your existent fraction. Then look astatine your Googlebot statement and brace yourself.

When you deed unverifiable rows, do what I did pinch CCBot. Pull the IPs, cheque the owner, query the corpus, and pursuit it until the image resolves. There is thing an SEO enjoys much than moving down proof, and this is simply a target-rich spot to do it.

What You Are Measuring, And What You Are Not

Think astir what moreover a verified number does, and does not, show you. A confirmed crawl tells you a existent bot took your content. It does not show you what happened next: whether your page ended up successful the reply a personification saw, whether you were cited, paraphrased without credit, aliases near retired entirely, aliases whether the exemplary that trained connected you will ever aboveground your sanction aliases softly sorb you and move on. The fetch is the visit. The result is simply a abstracted question.

That gap, betwixt being fetched and being used, is the mobility I walk my days on, and it is the logic I built CitationIQ.

If you tally this connected your ain logs, reply and show maine 2 numbers: your request spoof rate, and your Googlebot one.

More Resources:

  • Complete Crawler List For AI User-Agents [Dec 2025]
  • Google Warns: Beware Of Fake Googlebot Traffic
  • What Can Log File Data Tell Me That Tools Can’t? – Ask An SEO

This station was primitively published connected Duane Forrester Decodes.


Featured Image: Prostock-studio/Shutterstock; Paulo Bobita/MCP

Category SEO
Follow Us On Google
More