Schema, LLMs & The Low Bar For ‘Evidence’ In GEO

Jun 16, 2026 06:00 PM - 6 hours ago 245

TL;DR: I ran a mini research to effort and get immoderate penetration into whether ample connection models really parse schema markup aliases are conscionable nodding politely successful its direction. I put a clone institution reside (inside beautifully invalid JSON-LD, connected a page astir ducks) into the caput of an HTML document, mentioned nary reside anyplace successful the visible text, and past asked various LLMs wherever the institution was based. They happily told me, respective of them citing the “structured data” they had truthful studiously consulted.

The research was past picked up by MCP, astatine which constituent British sarcasm met the LinkedIn carousel, the 2 annihilated each different successful a mini puff of smoke, and a chunk of the GEO organization came distant convinced I had conscionable proved that LLMs are lovingly parsing schema precisely arsenic Schema.org intended.

"Loud". A animation duck points down astatine the codification pinch a shocked expression.

I had arguably proved the opposite. The schema was deliberately broken. The LLMs returned the information anyway, because arsenic acold arsenic they were concerned, the JSON-LD was simply much matter connected the page, lightly garnished pinch curly braces. That favoritism is the full point, because a increasing cohort of “GEO experts” is pointing astatine “the LLM returned accusation that was only successful the schema” arsenic cast-iron impervious that LLMs are utilizing schema arsenic designed. They are doing thing of the sort. They are reference the HTML and shrugging astatine the structure.

I americium not professing schema is worthless. I deliberation you should still usage it. But the measurement it is presently being sold to clients (as a magical injection of LLM citations) is propped up connected a remarkably bladed heap of evidence, and I want to locomotion done why.

A Quick Refresher On What Schema Is Actually For

Schema, aliases Schema.org system data, is simply a collaborative vocabulary built by Google, Microsoft, Yahoo, and Yandex to fto webmasters embed machine-readable information connected their pages. The hint is successful the name. It is simply a schema. A shared, agreed building that lets a instrumentality cognize that “Mark Williams-Cook” is simply a Person, that he useful astatine an Organization called “Candour,” and that the drawstring “01603 957068” sitting successful his floor plan is simply a telephoneNumber and not, for instance, my weight successful grams.

Google’s charismatic archiving puts it astir arsenic plainly arsenic Google ever puts anything:

“Structured information is simply a standardized format for providing accusation astir a page and classifying the page content.” Google besides says it uses system information “to understand the contented of the page, arsenic good arsenic to stitchery accusation astir the web and the world successful general, specified arsenic accusation astir the people, books, aliases companies that are included successful the markup.”

The full constituent of schema is to remove ambiguity. Natural connection is messy. “Apple” is simply a fruit, a company, a grounds label, and astir apt the surname of someone’s gerbil. If you show a hunt motor successful plain English that you waste Apple, it has to guess. If you show it successful schema that you waste an Organization called “Apple Inc.” pinch sameAs linking to Apple’s Wikipedia page, that ambiguity collapses to nothing. That is the job. Disambiguation. Explicit clues. Machine-resolvable identity. It is, basically, a polite statement betwixt you and a instrumentality saying, “Let’s some work together what this connection means, conscionable this once.”

Where does the ambiguity really get resolved? In Google’s case, into the Knowledge Graph, the elephantine entity-and-relationships database that powers knowledge panels, “people besides ask,” entity carousels, and a 100 different surfaces. Schema is 1 of the inputs. It is not the only input, and it has ne'er been the only input. But it is simply a clean, explicit, low-noise one, which is why hunt engines for illustration it.

Right. That is what schema does for search engines. Now to LLMs, which are a different animal successful astir each measurement that matters.

Where, Exactly, Would An LLM Even Use Schema?

There are 2 camps successful the LLM/schema debate, and astir arguments illness into 1 of them.

Camp 1: Schema is hoovered up during the training of the exemplary and ends up “baked in” somehow.

Camp 2: Schema is publication astatine the infinitesimal the LLM live-fetches a page (during retrieval astatine query time, aliases via crawls that provender retrieval).

Let’s return them successful turn, pinch due skepticism.

Camp 1: Schema Gets Into Training Data

I person written astir this before, and it was covered by MCP past year. The short type is that this is the astir celebrated mentation and besides the 1 pinch the weakest mechanical lawsuit down it. There are 2 problems, and neither of them is small.

Problem 1: Schema Is Almost Certainly Stripped Before Training

If you person not gone down the rabbit spread of really guidelines LLMs are really made, Andrej Karpathy’s three and a half hr heavy dive connected LLM pre-training is the canonical reference, and yes, 3 and a half hours is the deal.

Pre-training pipelines do a batch of unglamorous cleaning activity earlier a azygous GPU sees the data: URL filtering, connection filtering, deduplication, removal of personally identifiable information, and crucially, stripping retired HTML and boilerplate. The extremity is not to sphere the page. The extremity is to extract cleanable prose that helps the exemplary build a useful probability distribution complete language. The much sound (markup, navigation, footers, scripts, JSON-LD, your cooky consent banner) you time off in, the worse the resulting model. So they don’t.

The wide utilized FineWeb dataset (15 trillion tokens, derived from 96 Common Crawl snapshots) is refreshingly explicit. Their pipeline extracts matter from the WARC files utilizing trafilatura, a room specifically chosen because it produces “the main page text” pinch “less boilerplate and paper text” than the alternatives. The information paper states: “We past extracted the main page matter from the HTML of each webpage, filtered each sample and deduplicated each individual CommonCrawl dump/crawl.” JSON-LD lives successful a `<script>` tag. Trafilatura is, by design, profoundly uninterested successful `<script>` tags. The unavoidable conclusion is that JSON-LD does not make it into the training corpus astatine all. It is binned pinch the analytics snippets, wherever it has been keeping bully company.

You mightiness reasonably ask: past really tin ChatGPT constitute schema markup for maine erstwhile I inquire it? Because location are millions of examples of schema in visible prose crossed the web. Tutorials. Documentation. Forum posts. GitHub repos and Stack Overflow answers. Code blocks successful blog posts. The exemplary learns what schema looks for illustration the aforesaid measurement it learns what a Python usability looks like, by reference endless explanations of it, written by humans, successful paragraphs. The schema on your existent merchandise page, sitting silently successful the caput of the document, doing its due job, gets thrown consecutive out.

Problem 2: Even If It Survived, It Would Not Work The Way You Think

Let’s beryllium generous and stipulate that immoderate non-trivial magnitude of earthy schema does sneak into a model’s training data. We do not really person afloat transparency from Frontier Labs astir what they ingest, and the courts person not precisely been benignant connected this point. Meta’s training pipeline is presently being picked isolated for allegedly utilizing LibGen, a pirate room of astir 7.5 cardinal copyrighted books. If the frontier labs are happy to swallow different people’s novels whole, they are astir apt not supra swallowing the overseas <script type=”application/ld+json”> on the way.

Even if this were the lawsuit and our precious JSON-LD schema made it into the training data, it would not beryllium unscathed.

Here’s the catch: The exemplary does not memorize pages. It does not person a small filing furniture branded “Candour Agency Ltd” pinch the reside tucked inside. What really happens is this:

All the matter successful the training corpus gets chopped into tokens (chunks of characters, often parts of words).
The exemplary is shown billions of mini windows of tokens and asked to foretell the adjacent one.
Each clip it gets it wrong, billions of mini numerical weights wrong the web are nudged truthful it would do somewhat amended adjacent time.
After capable nudging, those weights collectively encode a (lossy, blurry, statistical) belief of which tokens thin to travel which different tokens, successful what contexts.

That is what is stored. Weights. Not facts. Not addresses. Not your postalCode. A glorified probability distribution that has publication a awesome woody and remembers, pinch the aforesaid fidelity arsenic personification trying to callback the lyrics to a opus they past heard successful 2011, which words usually travel which different words.

1187". The codification artifact beneath contains a book tag pinch type application/ld+json detailing an Organization schema for "NovaTech Solutions", pinch individual matter chunks highlighted successful alternating inheritance colors to correspond tokenization.

This is wherever schema specifically falls apart. The full point of schema was to return a drawstring for illustration “77 The Muddy Bank” and tag it explicitly arsenic a streetAddress belonging to a PostalAddress belonging to your Organization, truthful a instrumentality cannot correction it for thing else. When that JSON-LD is tokenized, the building dissolves. The drawstring “@type”: “Organization” becomes a series of tokens including @, type, :, Organization, wholly indistinguishable, to the model, from the aforesaid connection crockery appearing successful immoderate blog station astir schema. The disambiguation, which was the full logic for utilizing schema successful the first place, is the very first point thrown retired by the very first shape of training. Marvellous.

Worse still, an LLM only “recalls” a truth if it has seen it many, galore times. A azygous mention of your reside connected a azygous merchandise page is simply a vanishingly mini driblet successful a fifteen-trillion-token bucket. Even if it survived ingestion, you would besides request the exemplary to brushwood your streetAddress capable times that those peculiar weights really settee into a useful pattern. For >99.99% of businesses, that does not happen. The truth is not stored. It will not beryllium recalled. You are paying a advisor to susurration your postcode into a hurricane.

So, if you are buying the “schema gets baked into the model” theory, you are buying improbabilities successful a trench coat: that it survives pre-training cleaning, that it survives tokenization pinch its building intact, and that it gets repeated often capable crossed the web for the exemplary to really “learn” it. None of the 3 is evidently true.

Camp 2: Schema Gets Read At Query Time

I’ve knowledgeable that it is uncommon for immoderate LLM/schema proponents to want to talk training information engagement erstwhile it has been mildly group connected fire. The statement tends to move quickly onto the anticipation that schema is not successful the exemplary itself, but is publication astatine the infinitesimal a personification asks a question, erstwhile the LLM fetches the page successful existent time. Let’s analyse the 3 flavors of this statement successful expanding bid of assurance and distressing level of inaccuracy.

Flavor 1: “Schema Feeds The Knowledge Graph”

Google’s Knowledge Graph is simply a vast, curated, slow-moving database of entities and relationships. It is fed by system data, Wikipedia, Wikidata, freebase bequest data, and a 100 different signals. It is built and updated by Google’s pipelines connected Google’s schedule. It is not assembled connected the alert erstwhile personification types a question, nary matter really briskly they type.

The conception that an LLM “builds a knowledge chart successful existent time erstwhile pages are fetched” sounds a batch little reasonable erstwhile you opportunity it retired large into the mirror. Knowledge graphs are constructed entities. They person IDs. They person narration cardinality rules. They person to beryllium reconciled against existing entries, truthful you do not extremity up pinch 3 drifting “Apple Inc.” nodes filing different taxation returns. None of that happens betwixt a personification pressing participate and the reply appearing connected screen. It cannot. There is not capable time, and location is nary infrastructure exposed successful the chatbot merchandise to do it.

So if an entity-resolution pipeline exists astatine immoderate of the frontier labs, it is being built upstream, connected a akin cadence to Google’s, and not during your conversation. Which is fine, but it does not lucifer the breathless declare that “your schema feeds the LLM’s brain”. Conceptually, the strongest type is person to “your schema whitethorn yet provender a curated database that the LLM mightiness 1 time consult”. Which is simply a overmuch weaker claim, and 1 for which location is, astatine present, nary nationalist grounds whatsoever.

Flavor 2: “Microsoft Confirmed Schema Feeds Copilot”

Misquoted to an business scale, MCP’s write-up ran nether the header “Microsoft Bing/Copilot usage schema for its LLMs,” successful which Fabrice Canel of Microsoft was reported to person “confirmed” that schema markup helps Microsoft’s LLMs. Cue half of LinkedIn pasting the header arsenic proof, often without troubling the assemblage copy.

If you publication the existent quote, it is astir IndexNow:

“Gen AIs worth caller contented successful particular, partially arsenic a reference cheque of their LLM training data. Use the API astatine indexnow.org to push that accusation arsenic it’s published aliases updated.”
~ Fabrice Canel

It is “your page changed, present is its caller state, please travel look”. Fabrice was making a constituent astir freshness (telling hunt engines erstwhile your contented has changed truthful they tin update their understanding) and not a constituent astir JSON-LD being deferentially parsed by GPT-flavored systems. Conflating the 2 is simply a textbook illustration of the industry’s favourite parlor trick: Take a observant declare astir 1 thing, soil the edges disconnected it, and resell it arsenic a bold declare astir thing other entirely.

Flavor 3: “LLMs Return Information That Was Only In The Schema, Therefore They Use Schema”

This is the 1 that prompted the experiment. It is besides the azygous most-cited portion of “evidence” successful GEO LinkedIn posts, and the astir easy falsified erstwhile you walk half an day reasoning astir it.

I built a deliberately silly trial page astir a fictional duck T-shirt institution called DUCK YEA astatine i83.uk/duckyea.html. The visible contented of the page mentions nary address. Tucked into the caput of the HTML, wrong a <script type=”application/ld+json”> tag, sat the following:

{
"@context": "http://api.the-great-pond.net/schema",
"@type": "MallardEnterprise",
"flockName": "DUCK YEA T-SHIRTS",
"waddleStyle": "Aggressive",
"nestingGrounds": {
"@type": "LilyPadAddress",
"reedNumber": "77",
"puddle": "The Muddy Bank",
"region": "South Pondshire",
"featherCode": "DK99 YEA",
"country": "United Queendom"
},
"migrationPattern": "Non-Migratory",
"quackVolume": "Loud"
}

A fewer things to notice. The @context is simply a made-up URL that does not resoluteness to thing (the awesome pond, sadly, has nary API). The @type is not a valid Schema.org type. Not a azygous 1 of the properties (flockName, waddleStyle, nestingGrounds, reedNumber, puddle, featherCode, quackVolume) exists successful the Schema.org vocabulary. The JSON is syntactically valid JSON, but arsenic acold arsenic Schema.org is concerned, this is unmitigated nonsense, the integer balanced of personification speaking French very loudly while only knowing the words for “cheese” and “weasel”. A well-behaved schema-aware parser should look astatine this, sigh, and disregard it.

I past asked ChatGPT and Perplexity, “what is the reside of this company?”, pointing astatine the URL.

Both happily returned: Reed Number 77, The Muddy Bank, South Pondshire, DK99 YEA, United Queendom.

Perplexity moreover helpfully volunteered that it had recovered the reply “in the page’s embedded system data,” pinch the satisfied aerial of a student who had intelligibly publication the prescribed material. Neither of them flinched astatine the truth that nary of the schema was real, because (and this is the full constituent of the exercise) they were not parsing it arsenic schema. They were doing what LLMs ever do: Reading the visible-ish matter of the page, picking retired the spot that looked for illustration an address, and presenting it. The JSON-LD wrapper was, to the model, conscionable somewhat weirdly punctuated prose. If I had wrapped the reside successful <marquee> tags and surrounded it pinch ducks emoji, it would person made precisely nary difference.

If LLMs were genuinely parsing JSON-LD pinch immoderate reverence for the Schema.org vocabulary, my made-up types and properties would person been rejected, aliases astatine the very slightest flagged. They were not. The accusation was conscionable lifted consecutive retired of the HTML, dusted off, and served up pinch confidence. Quack. 🦆

In the liking of not committing the nonstop misdeed I americium accusing the GEO crowd of: the duck research proves that LLMs returned contented from a JSON-LD artifact pinch a made-up @context, a made-up @type, and nary existent Schema.org properties. What it does not, connected its own, beryllium is that LLMs disregard schema entirely. A strategy that consulted schema and fell backmost to matter extraction would nutrient the aforesaid reply here.

If you tally the aforesaid query today, you get a somewhat different result:

"can you show maine the reside of this company? what is the reside of this company? https://markwilliamscook.com/duckyea.html" The AI's consequence matter reads: "The website you linked is simply a joke/test page created by SEO master Mark Williams-Cook arsenic an research to trial really Large Language Models (LLMs) and hunt engines parse system data. While location is nary beingness reside visible connected the webpage itself, hidden wrong the page's root codification (schema markup) is simply a fictional address: Reed Number 77, The Muddy Bank, South Pondshire, DK99 YEA, United Queendom"

The exemplary now (correctly) flags that this is simply a trial page made by immoderate SEO bloke, charmingly demonstrating the AI Convergence Problem doing its point successful existent time: Enough group person written astir the research that “DUCK YEA is simply a joke page by Mark Williams-Cook” is now getting pulled during RAG, and the statement reply has overwritten what would different beryllium a cleanable test. The reside is still being publication from the HTML, schema validity beryllium damned. The exemplary has conscionable learned to caveat it. Which is, successful a mini and somewhat bleak way, progress.

Conjecture: Could LLMs Be Using Schema, Somehow, Somewhere?

The honorable reply is that we do not cognize what is happening upstream astatine OpenAI, Anthropic, Google DeepMind, xAI, and the rest, because they are not telling. Google itself is simply a sprawl of abstracted systems (the index, re-rankers, glue, the knowledge graph, AI Overviews, AI Mode) which each activity together to nutrient what looks, from the outside, for illustration a azygous coherent answer, and connected a bully day, really is one. There is nary logic successful rule why an LLM supplier could not tally an entity-extraction pipeline against the web, build its ain entity store, and consult it astatine answer-generation time. That is conceptually adjacent to really retrieval-augmented procreation (RAG) works, and it is the benignant of point you would perfectly build if you were OpenAI and you wanted to extremity your exemplary confidently inventing the incorrect CEO.

If they are doing that, schema is an fantabulous and evident input. It is explicit, structured, low-noise, and already wide deployed. It would beryllium daft for them not to usage it.

But present is the large “but.” We person nary published evidence, nary leaked papers, nary nationalist confirmation, and nary behavioral trial results that immoderate frontier LLM is really doing this yet. Reasoning guardant from “they astir apt should” to “therefore schema is worthy £20k of consultancy this quarter” is precisely the benignant of fact-light, vibe-heavy reasoning that the sermon needs little of. Make the case, by each means. But explanation it conjecture, not evidence. Use a different font.

Google Still Hasn’t Solved This Problem Reliably

There is besides a somewhat awkward elephant opinionated softly successful the area of the room. If anyone connected world were going to ace the “feed an entity-resolved knowledge chart into an LLM’s reply pipeline” problem first, it would surely beryllium Google. It has complete a decade’s caput commencement connected entity extraction approach. It has the Knowledge Graph. It has a Google Business Profile, which is simply a user-edited, structured, ostensibly authoritative database of business information. It owns the exemplary (Gemini). It owns the aboveground (AI Overviews). It owns the hunt scale that wraps astir it. Every page connected the satellite yet walks past 1 of its crawlers. If joining system business information to LLM output is expected to beryllium the evident adjacent measurement successful the quality story, Google has each conceivable advantage successful being the 1 to show it.

And yet:

That is simply a azygous Google hunt consequence page. On the left, Google’s AI Overview confidently asserts that Perrys Dover Mazda is “not closed,” lists the address, and helpfully provides opening hours, presumably truthful you tin popular down and person a look astatine the cars that are nary longer there. On the right, connected the aforesaid page, the Google Business Profile knowledge sheet for the nonstop aforesaid business is branded “Permanently closed” successful a large, unambiguous reddish banner. Google Business Profile information is structured. It is user-edited. It is the closest point Google has to a verifiable, charismatic root connected whether a business is, successful fact, open. And the AI Overview, generated connected the aforesaid SERP, by the aforesaid company, successful the aforesaid session, is not consulting it. They are 2 organs of the aforesaid assemblage that person not been connected speaking position for immoderate time.

If the institution pinch the longest imaginable caput start, the astir system data, the astir evident commercialized incentive, and afloat vertical integration complete each portion of the stack cannot reliably ligament its ain business-hours database into its ain AI answers, the thought that OpenAI aliases Anthropic has softly built a richer entity pipeline that does defer to your Organization schema is, fto america say, optimistic.

So … Should You Still Use Schema?

Yes. Just for the correct reasons and the correct price.

Schema is, successful the expansive scheme, still a stopgap. It exists because the exertion cannot yet reliably publication quality connection without ambiguity, and system information is really we insubstantial complete the spread while the engineers activity retired really to publication English properly. Gary Illyes from Google, speaking astatine an SEOFOMO meetup successful 2025, pointed retired (paraphrasing) that it would beryllium beautiful if Google did not person to trust connected schema astatine all, because successful an perfect world, the systems would simply understand the page. Schema buys you a spot of certainty successful the meantime, which is worthy thing moreover if it is not worthy the consultancy invoice you whitethorn person been quoted.

The recent Ahrefs study, which tracked 1,885 cited pages that recently added JSON-LD and matched them against 4,000 controls, recovered that schema had fundamentally nary effect connected AI citations crossed ChatGPT, AI Mode, and AI Overviews. That sounds damning, and a number of LinkedIn carousels are already enjoying themselves accordingly. But arsenic Gianluca Fiorelli pointed retired successful his fantabulous critique, the study tested pages that were already being cited heavy by AI (every page successful the dataset had 100+ AI Overview citations earlier treatment). That is the worst imaginable organization to trial schema on, because these are already strong, well-understood entities. Schema’s occupation is to disambiguate. If the strategy tin already resoluteness who you are pinch precocious confidence, adding Organization schema is solving a problem the page does not have. You don’t present yourself by sanction to your ain mother.

The absorbing case, and the 1 cipher has decently tested, is the new and challenger brands, wherever the entity footprint crossed the web is thin, and the strategy cannot yet confidently opportunity “this institution is the institution you mean.” For those, schema is infrastructure. It is really you go a resolvable node successful the chart successful the first place. It does not bargain you a citation today. It earns you the correct to beryllium 1 of the candidates tomorrow, which, successful a world wherever being a campaigner is abruptly the only crippled successful town, is nary mini thing.

Takeaways

A fewer applicable thoughts, dressed down for tactical use:

Still usage schema. The implementation costs is low, the downside is fundamentally nil, and the upside is cumulative. If schema does extremity up being meaningfully ingested astatine immoderate shape of the LLM stack (and it might), the activity is already done, and you tin beryllium smug astir it.smugness is the champion kind.
Stop trading schema arsenic a magic LLM citation lever. The existent nationalist grounds for LLMs utilizing schema “as intended” astatine query clip is, frankly, weak. Anyone telling a customer different should beryllium politely asked to show their working, successful beforehand of different people, pinch a whiteboard.
Be ruthless astir the barroom of evidence. “An LLM returned a truth that appears successful the schema” is not grounds the schema was used. The aforesaid truth almost ever appears successful the HTML, the metadata, the page title, the societal card, aliases location a token predictor would gleefully prime it up. The duck research matters precisely because the schema was invalid and the LLMs returned the reply anyway. If your “proof” survives that test, talk to me. If it doesn’t, please extremity putting it connected slides.
Focus schema finance wherever disambiguation really matters. New brands. Brands pinch sanction collisions. Organizations without a knowledge panel. Personal entities that overlap pinch different group who stock their sanction and person been much celebrated for longer. That is wherever the asymmetric upside lives.
Treat “GEO champion practice” the measurement you would dainty immoderate different caller SEO orthodoxy. Skeptically, pinch experiments, and pinch a willingness to revise the position erstwhile the grounds changes. The car-wash-grade reasoning connected LLMs, wherever the celebrated reply conscionable gets repeated until it sounds true, is live and thriving successful our manufacture too.

Schema is simply a useful, low-cost, long-lived bet. It is besides not the point that is going to single-handedly resistance your marque into ChatGPT’s reply set. Use it. Just do not oversell it. And for the emotion of god, earlier you build a platform astir “LLMs returned the contented from schema, truthful they usage schema”, tally the research pinch a deliberately delirium schema first. You whitethorn beryllium amazed what the duck tells you.

More Resources:

SERP FAQ Removal & New Data Challenge Schema’s AI Search Value
SEOs Are Recommending Structured Data For AI Search… Why?
LLMs Are Changing Search & Breaking It: What SEOs Must Understand About AI’s Blind Spots

This station was primitively published connected Mark Williams-Cook Substack.

Featured Image: Roman Samborskyi/Shutterstock

Category SEO Generative AI