
Two and a half years ago, I wrote an article for MCP astir really retrieval-augmented procreation (RAG) was the early of search. That portion based on that RAG was not Google’s reactive reply to ChatGPT. It was the architecture they had been building since the REALM insubstantial successful August 2020. SGE (now AI Overviews) was the accumulation manifestation. Everything that has happened since has confirmed it.
The single-shot RAG pipeline I described successful that article, query → retriever → top-k chunks → LLM → reply pinch citations, is already the past. Every awesome AI hunt level has moved on. Google AI Mode, ChatGPT Search, Perplexity Pro Search, Claude pinch Computer Use, Gemini Deep Research, moreover the Microsoft Copilot Researcher and Analyst agents, they each tally a different architecture now. They plan. They way betwixt tools. They retrieve, read, past retrieve again. They people their ain first drafts and determine whether to spell backmost for more. The retrieve-once-then-generate shape that defined the first activity is obsolete.
This is agentic RAG, and it is now the default.
If your GEO programme is still optimized for single-shot retrieval, you are optimizing for a strategy that nary longer exists. Worse: successful agentic RAG, you cannot spot the gatekeepers rejecting you. You only spot whether you ended up successful the last answer. The accepted reverse-engineering playbook (rank checking, citation counting, moreover prompt-by-prompt sampling) only sees the past shape of a multi-stage pipeline. Everything that happens upstream is simply a achromatic box.
By the clip you get to the bottommost of this page you will person a moving intelligence exemplary of agentic RAG, the patent grounds that Google has productized this architecture, what each awesome level is really doing, the six actual shifts it forces successful contented engineering, and a reproducible audit you tin tally against your ain marque this week. You will besides person the strongest sentiment I person published each year: the only honorable measurement guardant is model distillation.
What the MCP article sewage correct and what’s changed
The October 2023 thesis still holds. Passage-level retrieval is the portion of relevance. Knowledge graphs are symbiotic pinch LLMs, not a checkbox you tick erstwhile and forget. Static IR scores are obsolete. The occupation of a hunt strategy is to lower Delphic costs, the costs a personification pays to get to an answer, and Google’s organizing rule has ever been that postulation is simply a basal evil, not a goal. That portion of the statement needs nary revision.
What has changed is the shape of the retrieval pipeline.
In 2023, RAG was a linear assembly line. A query came in, an embedding exemplary encoded it, a vector scale returned the top-k passages, those passages were stuffed into the LLM’s discourse window, and the exemplary generated an answer. Citation search was straightforward because the citation group was the retrieval set. If your contented was successful the top-k, you had a chance. If it wasn’t, you didn’t. This is the model I described successful that piece, and it was meticulous astatine the time.
But things person changed.
The pipelines now person 4 properties that the linear architecture lacks: planning, instrumentality use, multi-hop iteration, and reflection. The accusation is that retrieval is not a azygous arena anymore. A azygous personification query triggers location betwixt 5 and 20 soul sub-retrievals. The supplier orchestrates them, evaluates the intermediate results, and only synthesizes a last reply erstwhile it has decided the grounds guidelines is sufficient.
This is the upgrade my portion foreshadowed but did not name.
Why naive RAG broke

Retrieval value determines output value and naive RAG has 4 nonaccomplishment modes that yielded little value results.
- Classic, single-pass RAG cannot service compound questions – A punctual for illustration {How does a 1031 speech interact pinch a SEP IRA for an LLC proprietor nether 50?} needs 5 retrievals, not one. A azygous embedding query against a vector scale will onshore connected documents astir 1031 exchanges or SEP IRAs, and the synthesis will beryllium incoherent because the exemplary is forced to span 2 retrievals it ne'er made.
- Classic RAG can’t retrieve from a bad first propulsion – If the first retrieval misses the canonical root because the embedding region was off, aliases because the chunk boundaries divided the applicable transition successful half, aliases because a much fierce portion of competing contented scored higher connected a query the personification did not virtually inquire past the exemplary has thing to thin connected isolated from its parametric knowledge. That’s erstwhile hallucinations cascade.
- Classic RAG didn’t way betwixt retrieval devices – Vector hunt is the correct reply for immoderate sub-questions and precisely incorrect for others. “What is today’s owe rate?” needs a structured-data API call, not a transition search. “What does the IRS opportunity astir Section 179?” needs an authoritative-source filter, not similarity. “Calculate the depreciation schedule connected a $50,000 conveyance placed successful work successful March” needs a codification expert aliases a calculator tool. A azygous retriever cannot make those choices.
- Classic RAG can’t people its ain activity – Once the reply is generated, naive RAG ships it. There is nary critic. No 2nd pass. No “wait, this contradicts the root I cited 2 paragraphs up.” If the exemplary gets it wrong, the personification sees the incorrect answer.
These 4 nonaccomplishment modes are why each superior deployment moved to a different architecture. Each 1 has a corresponding fix, and the fixes together are agentic RAG.
What ‘agentic’ intends successful agentic RAG

The connection “agentic” gets utilized loosely. Let’s nail it down structurally. There are 4 properties that move RAG into agentic RAG, and a strategy needs each 4 to merit the label.
1. Planning
Before immoderate retrieval happens, the strategy decomposes the personification query into a investigation plan. Sub-queries get generated, devices get pre-selected, retrieval bid gets determined. In the AI Mode portion I called this “a latent multi-query event” erstwhile discussing query instrumentality out.
Agentic RAG goes a measurement further: the strategy does not conscionable instrumentality out, it plans the fan-out. The foundational insubstantial is ReAct (Yao et al., 2022), which framed the move directly: “we research the usage of LLMs to make some reasoning traces and task-specific actions successful an interleaved manner, allowing for greater synergy betwixt the two: reasoning traces thief the exemplary induce, track, and update action plans… while actions let it to interface pinch outer sources, specified arsenic knowledge bases aliases environments.”
That interleaving is the planner. The accumulation type is successful each frontier exemplary now, positive the planner-executor patterns that LangGraph and LlamaIndex person made standard.
2. Tool use, besides called usability calling.
Retrieval is 1 instrumentality among many. The supplier tin take to query a vector index, deed a BM25 index, deed a structured-data API, tally code, browse a unrecorded web page, telephone an MCP server, aliases telephone different agent. Each instrumentality has a schema, and the supplier picks the correct 1 for the correct sub-query.
Toolformer (Schick et al., 2023) made the lawsuit bluntly: “language models tin thatch themselves to usage outer devices via elemental APIs and execute the champion of some worlds… a exemplary trained to determine which APIs to call, erstwhile to telephone them, what arguments to pass, and really to champion incorporated the results into early token prediction.” That condemnation is the spec for each router we’ll talk later.
3. Iteration, sometimes called multi-hop retrieval
The supplier retrieves, sounds what came back, and past retrieves again based connected what it learned. Bridge entities aliases the entities the first retrieval surfaced that the 2nd retrieval needs to investigate, go first-class behavior, not separator cases.
IRCoT (Trivedi et al., 2022) defined the loop as “interleaving retrieval pinch steps (sentences) successful a concatenation of thought, guiding the retrieval pinch CoT and successful move utilizing retrieved results to amended CoT.” The aforesaid insubstantial reported retrieval improvements of up to 21 points connected multi-hop QA datasets erstwhile the loop was applied.
4. Reflection, besides called self-critique
After drafting an answer, the supplier grades it. Sufficiency, contradiction, freshness, root diversity. If the professional flags a problem, the supplier goes backmost and retrieves more.
Self-RAG (Asai et al., 2023) is the most-cited insubstantial successful this lineage and the cleanest articulation: “a caller model called Self-Reflective Retrieval-Augmented Generation that enhances a connection model’s value and factuality done retrieval and self-reflection… the model trains a azygous arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects connected retrieved passages and its ain generations utilizing reflection tokens.”
CRAG, Reflexion, and Self-Refine widen the aforesaid shape successful different directions, but the halfway system is correct there.
Anthropic’s December 2024 essay “Building effective agents” defines the aforesaid 4 properties nether cleaner terminology, and 1 of its lines belongs successful each GEO platform this year: “Agents are systems wherever LLMs dynamically nonstop their ain processes and instrumentality usage, maintaining power complete really they execute tasks.” With truthful overmuch disorder astir what an supplier is aliases what agentic means, let’s usage that arsenic the moving definition. Ultimately, the terminology varies by vendor; the 4 properties do not.
A image is worthy much than the meaning database above. Imagine the classical RAG architecture arsenic a azygous arrow pointing right: query enters 1 end, reply comes retired the other. Now ideate agentic RAG arsenic a loop pinch 5 branded stops — planner, router, retrieval tools, critic, synthesizer — and bidirectional arrows that let the supplier to revisit immoderate extremity until the professional signs off. That loop is what your contented has to survive.

The agentic RAG reference architecture

Let’s locomotion done the canonical components, because you cannot reverse-engineer a strategy you cannot draw.
- Planner / orchestrator – Reads the personification query, generates a investigation plan. Same LLM arsenic the remainder of the system, tally pinch a planner-specific prompt. Outputs a database of sub-queries and a instrumentality duty for each.
- Router – Decides which retrieval instrumentality fits each sub-query. Vector search? Lexical? A hybrid retriever? A unrecorded web fetch? A SQL query against a system database? A usability telephone into a calculator? An MCP server exposing a domain-specific API? An agent-to-agent call? The router is the astir underrated constituent successful the full stack because it determines whether your contented moreover gets a chance to beryllium retrieved. If your domain has a instrumentality aboveground and you do not expose one, the router skips you.
- Retrieval devices – Each instrumentality is its ain subsystem. Vector retrievers tally cosine similarity complete dense embeddings. Lexical retrievers tally BM25 aliases rank-modified TF-IDF. Structured devices telephone APIs and return rows. Code interpreters execute scripts. Web browsers fetch unrecorded URLs. The supplier treats them each uniformly: input goes in, grounds comes out.
- Memory – There are typically 2 layers of memory. Short-term scratchpad for the existent investigation thread. This includes things for illustration what sub-queries person run, what grounds has travel back, what the professional has flagged. Then there’s semipermanent representation for user
- Critic / reflection module – Judges sufficiency and value of the draught answer. This is sometimes a abstracted model, but often the aforesaid exemplary pinch a critic-specific prompt. The Reflection module decides whether to vessel aliases to re-query. The professional is the gatekeeper that cipher talks about, and it is the gatekeeper that drops the astir contented from last answers

- Synthesize – Composes the last reply pinch inline citations, often aft a last pairwise re-rank against the surviving candidates.
A explanation earlier we move on. Most accumulation systems are not literal multi-agent constellations. They are a azygous LLM moving tight loops pinch different prompts astatine each stage, positive instrumentality calling. Do not conflate “agentic” pinch “multi-agent.”
Multi-agent setups exist. Anthropic’s investigation stack uses them, and truthful does Microsoft’s Researcher / Analyst pair, but the ascendant accumulation shape is single-LLM, multi-prompt, multi-tool. When the trading squad tells you their AI is “multi-agent,” 9 times retired of 10 what they mean is “we person a planner punctual and a professional prompt.”
Patent evidence: How Google is really doing agentic RAG
Google has been softly building toward this architecture for years, and the patent grounds maps almost cleanly onto the four-property meaning from §3. Five Google LLC patents do the dense lifting. Read them successful this bid and you tin watch the agentic loop combine successful IP filings, 1 constituent astatine a time.
- Planning — query decomposition and fan-out. US11663201B2 — Generating Query Variants Using a Trained Generative Model was revenge successful April 2018 and issued successful May 2023. It describes systems that usage a trained generative exemplary to nutrient query variants astatine runtime from a azygous submitted query. The patent enumerates 8 version types — equivalent, follow-up, generalization, canonicalization, language-translation, entailment, specification, and explanation queries — and explicitly handles “tail” queries pinch debased submission frequency. This is the planner. When AI Mode receives 1 query and decomposes it into five-to-twenty sub-queries, the mechanic the patent describes is what is running. The companion filing, WO2024064249A1 — Systems and Methods for Prompt-Based Query Generation for Diverse Retrieval, is the Google Research type of the aforesaid idea. “Promptagator” which uses few-shot LLM prompting to make synthetic queries for training dual-encoder retrievers crossed divers domains. Plan-then-fan-out, productized.
- Tool usage — routing among retrieval sources. US20240362093A1 — Query Response Using a Custom Corpus, assigned to Google LLC and published October 31, 2024, is the cleanest router patent successful the stack. The strategy has the LLM process a personification query and generate API calls to outer applications, each of which has entree to a respective civilization corpus. The outer applications return documents, which the LLM uses arsenic discourse for generation. Tool selection. API calls. Multiple corpora. The behaviour each frontier vendor now ships nether the explanation “function calling” was revenge by Google successful this patent.
- Memory — stateful, multi-turn orchestration. US20240289407A1 — Search pinch Stateful Chat, assigned to Google LLC successful March 2024, describes augmenting accepted hunt pinch a “generative companion” that maintains and updates personification discourse crossed aggregate chat turns. The patent explicitly handles synthetic query procreation tailored to that ongoing state. This is the semipermanent representation furniture of the architecture successful §4 — the aforesaid furniture that ChatGPT calls Memory and Gemini calls Saved Info. Google patented the mechanic earlier immoderate of them shipped a UI for it.
- Reflection — pairwise ranking wrong the loop. US20250124067A1 — Method for Text Ranking pinch Pairwise Ranking Prompting, assigned to Google LLC successful October 2024, is the patent I covered in How AI Mode Works. The strategy ranks passages by having an LLM execute pairwise comparisons — “of these 2 passages, which is amended for this query?” — and aggregates the comparisons into a last classed list. This is relative, model-mediated, probabilistic ranking, and it is the soul loop that runs wrong the agent’s reflection and synthesis stages. Your contented is not competing successful isolation. It is being compared head-to-head against each different surviving candidate, by an LLM that sounds some passages and picks a winner.

- Synthesis — generative answers grounded successful retrieved evidence. US11769017B1 — Generative Summaries for Search Results was revenge successful March 2023 and issued by September of the aforesaid year. The patent describes generating natural-language summaries of hunt results utilizing LLMs, pinch definitive provisions for processing further contented to mitigate inaccuracies and amended summary quality. Industry analysts person correctly identified this arsenic the patent instauration underneath SGE and the AI Overviews product. The “process further contented to mitigate inaccuracies” connection is reflection successful early shape — the synthesizer is checking its ain activity earlier shipping the answer.
Five patents. One planner mechanic. One router mechanic. One representation mechanic. One reflection mechanic. One synthesis mechanic. Lay them connected apical of the four-property meaning and it’s clear that Google has revenge IP connected each constituent of the agentic loop. The agentic stack is not a startup-vendor framing borrowed from the open-source supplier ecosystem. It is simply a accumulation architecture that Google has been building toward successful its patent filings since 2018.
The different awesome platforms do not person the aforesaid patent footprint, but they person the aforesaid architecture. Patents are evidence, not boundaries. The truth that Google has chosen to record IP connected these circumstantial subsystems tells you which subsystems they see strategical and which subsystems your contented has to triumph astatine if you want to beryllium cited successful AI Mode.
How each awesome level really uses agentic RAG
Different platforms stress different pieces of the loop. The platform-by-platform publication matters because the aforesaid contented tin triumph successful 1 strategy and suffer successful different based connected which gatekeeper does the heaviest lifting.
- Google AI Mode – The astir fierce agentic implementation successful production. Planner-driven fan-out. Multi-pass retrieval into Search. Pairwise re-ranking per US20250124067A1. A reflection module that drops sources that neglect the critic. The visible “expansion” UI shows you a fraction of the sub-queries, but the existent fan-out is wider. This is the level wherever breadth and pairwise survivability matter most.
- Google AI Overviews – A lighter agentic pattern. Shorter loops. Less loop than AI Mode. AIO is person to classical fan-out than afloat agentic RAG, but the trajectory is clear, each AIO update adds much reflection and much router intelligence.
- ChatGPT Search and Deep Research – Deep Research is the cleanest user-facing objection of the pattern. It virtually exposes its planning, sub-queries, and reflection successful the visible UI. You watch the supplier decompose your question, way to tools, and people its ain progress. Standard ChatGPT Search runs a smaller type of the aforesaid pipeline without the visible plan. If you want to study agentic RAG empirically, tally 10 queries done Deep Research and publication the trace.
- Perplexity Pro Search and Deep Research – Agentic from the start. Multi-step retrieval, root diversification by design, draught critique. Perplexity tends to beryllium the astir generous astir root attribution, which makes it the champion canary for whether your contented is making it into intermediate retrievals.
- Claude pinch Computer Use, Projects, and Skills – Tool usage arsenic a first-class primitive. Claude features long-running multi-step tasks wherever retrieval is interleaved pinch action. The strategy tin publication a page, determine to fetch a different page, determine to tally code, determine to query an API, each wrong the aforesaid task. Claude is overrepresented successful endeavor deployments wherever the action furniture matters arsenic overmuch arsenic the retrieval layer.
- Gemini Deep Research – Explicit research-plan-then-execute loop. Multi-source aggregation. Draft critique. The visible scheme successful Gemini Deep Research is simply a useful diagnostic. If your contented does not show up successful immoderate of the planned sub-queries, you are not conscionable losing the citation, you are losing the information set.
- Grok DeepSearch – An emerging real-time agentic shape leaning connected X data. The retrieval aboveground is fundamentally different successful that it uses caller societal signals complete a system nationalist corpus, but the loop architecture is the same.
- Microsoft Copilot Researcher and Analyst agents – Enterprise agentic RAG complete SharePoint, Microsoft Graph, and the unfastened web. The Researcher and Analyst brace is person to a existent multi-agent setup than the others connected this list. Two specialized agents, each pinch their ain instrumentality stack, coordinating connected a azygous investigation goal.
Here is the comparison crossed the 8 awesome platforms. Iteration extent is rated connected a five-point standard from minimal (single-pass pinch ray reranking) to heavy (10+ sub-queries pinch aggregate professional loops). Visibility ratings bespeak what is exposed successful the user-facing UI arsenic of mid-2026.
Platform Planner visibility Router strategy Iteration depth Reflection visibility Citation surfacing Google AI Mode Partial (expansion position shows immoderate sub-queries) Internal Search scale + system information devices + Knowledge Graph Deep (5–20 sub-queries) Hidden (pairwise rerank + professional some internal) Inline links, often per-claim Google AI Overviews Hidden Search index, lighter than AI Mode Medium (3–8 sub-queries) Hidden Inline links, little granular ChatGPT Search Hidden Bing scale + first-party tools Medium Hidden Inline links, sometimes a sources panel ChatGPT Deep Research Fully exposed (live scheme + sub-queries + reasoning) Bing scale + browse + codification interpreter Deep (often 20+ sub-queries) Partially exposed (you spot the supplier bespeak mid-task) Numbered references pinch afloat root list Perplexity Pro Search Partial (sub-question database rendered) Multi-source web + system tools Medium-to-deep Hidden but generous connected sourcing Inline numbered links, afloat root panel Perplexity Deep Research Fully exposed Multi-source web + browse + system tools Deep Partially exposed Inline + broad root panel Claude (Computer Use, Projects, Skills) Hidden Tool usage arsenic first-class primitive (search, code, browse, MCP) Variable, tin beryllium very deep Hidden Inline citations erstwhile devices return them Gemini Deep Research Fully exposed (research scheme rendered earlier execution) Google Search + system tools Deep Partially exposed Inline + system root list Grok DeepSearch Partial X information + unfastened web Medium Hidden Inline links, X-weighted Microsoft Copilot Researcher / Analyst Partial (multi-agent traces successful immoderate surfaces) SharePoint + Microsoft Graph + unfastened web Deep Partially exposed Inline citations, enterprise-doc weighted
The honorable summary: each awesome AI hunt strategy is now agentic. The differences are astir which gatekeepers they expose and which ones they hide. None of them expose each five. The Deep Research surfaces — crossed ChatGPT, Gemini, and Perplexity Pro — are the astir useful diagnostics you person for studying agentic-RAG behaviour successful production, because they show the planner and partial reflection successful the UI. The non-Deep surfaces are what astir users really run, and those hide astir everything.
What this changes for Relevance Engineering
You cognize I’m not going to time off you without thing actionable. Here are the six actual shifts that travel from everything above.
- You person to triumph crossed galore sub-retrievals, not one. A azygous “good ranking” page is nary longer enough. Agentic systems decompose your taxable into 5 to 20 sub-queries and retrieve against each 1 independently. Coverage breadth and topical extent are not nice-to-haves anymore, they are structural requirements. Pages that beryllium arsenic standalone pillars without extent successful the surrounding subtopic chart get cited once, maybe, and past dropped from the information group connected the adjacent sub-query. Pages that anchor a dense, well-linked topical vicinity get cited 5 times successful the aforesaid answer.
- Atomic, scoped passages hit monolithic articles and now they person to triumph pairwise. Each supplier sub-query retrieves chunks, not pages. Then those chunks get pairwise-ranked against competing chunks from competing sources, by an LLM that sounds both. The statement I utilized successful the AI Mode portion holds: your passages person to survive pairwise scrutiny. That intends you request self-contained logic, named entities up front, definitive scope conditions (“for businesses pinch nether 500 employees”). You besides request grounds density, tables, and lists that an LLM tin quote without ambiguity. Anything that requires a quality to scroll up 2 paragraphs for discourse will suffer pairwise to a transition that does not.
- Bridge entities find multi-hop inclusion. When the agent’s first retrieval lands connected Entity A, the 2nd retrieval is astir A’s relationships. If your contented is the canonical span betwixt A and B, you get cited successful answers wherever the personification ne'er typed your brand. This is the astir underexploited GEO aboveground successful the manufacture today. I’ll talk much astir it successful different article.

- Reflection cycles reward root diverseness and contradiction-handling. When the professional grades the draft, it looks for corroboration and contradiction. Content that explicitly addresses counterarguments, separator cases, and “when this doesn’t apply” survives reflection passes that portion retired one-sided sources. Salesy contented pinch nary acknowledgment of nonaccomplishment modes is simply a show to the professional that the root is biased, and biased sources get filtered.
- Tool-callable contented is simply a caller contented type. Calculators. Structured-data endpoints. APIs. Comparison engines. When a instrumentality exists, the router calls the instrumentality alternatively of citing prose. If you are successful a domain wherever a instrumentality is much useful than an article for illustration owe rates, supplier interactions, taxation brackets, merchandise specs, ETF performance, money characteristics, you should build the tool and expose it done an MCP server, an API, and system data. The brands that disregard this and support penning 2,500-word “ultimate guide” articles will beryllium replaced successful the reply by a usability call.

- Freshness is simply a reflection-stage gate. The professional checks freshness explicitly. dateModified in your schema. Version numbers successful assemblage copy. Explicit “as of [date]” framing successful the prose. None of this is cosmetic. All of it straight affects whether your contented survives the reflection walk erstwhile the supplier is grading root quality. Stale contented gets dropped astatine the critic, moreover if it won the pairwise re-rank, because the professional decides it cannot spot it.
The unifying constituent nether each six: classical SEO contented engineering optimized for 1 infinitesimal of judgement — the SERP. Agentic RAG contented engineering has to triumph astatine 5 different moments for each subquery successful the fan-out: planner, router, retrieval, pairwise, critic. That is astir an bid of magnitude much aboveground area, and the brands that build for it will spot citation gravity that compounds.
The opacity problem — and why distillation is the smart measurement forward
Here is the portion cipher other is consenting to constitute yet, because saying it retired large has uncomfortable implications for the full GEO measurement category.
In single-shot RAG, you could astatine slightest observe inputs and outputs. Your page either showed up successful the retrieval group aliases it didn’t. You could reverse-engineer the retriever by sampling capable queries. You could correlate contented changes pinch citation changes. The strategy was a achromatic box, but it was a achromatic container pinch measurable inputs and measurable outputs.
In agentic RAG, each gatekeeper betwixt the personification query and the last reply is opaque.
You don’t cognize which sub-queries the planner generated. You don’t cognize which instrumentality the router picked for each sub-query. You don’t cognize which corpus was searched, which passages were returned, aliases which competitor passages your contented mislaid to successful the pairwise re-rank. You don’t cognize what the professional flagged. You don’t cognize which sources the professional dropped earlier synthesis. You only cognize whether you ended up successful the last answer.
The accusation is uncomfortable. Traditional reverse-engineering — “rank checking,” “citation tracking,” moreover prompt-by-prompt sampling astatine standard only sees the last stage. Every citation locator watches what shows up successful the published answer. They are each measuring the survivors of a five-stage select without watching the filter. You are optimizing against a achromatic container down a achromatic container down a achromatic box.
The honorable way guardant is model distillation.

Distillation, successful plain English: training a smaller, observable exemplary to imitate the behaviour of a larger, opaque one. You cannot spot wrong Google’s planner, but you tin guidelines up your ain planner-router-critic stack connected inputs and observed outputs, calibrate it against the citations you really spot successful production, and use that as the diagnostic harness. When your section agent’s planner generates 10 sub-queries that intimately lucifer the visible Deep Research scheme for the aforesaid prompt, you person a calibrated proxy for the upstream gatekeepers successful accumulation systems. The proxy is not the accumulation system, but it is observable, and observable thumps invisible.
What this looks for illustration successful believe for a GEO program:
Stand up a section reference supplier connected Google Gemma 4 — the 31B Dense version for the planner and professional loops wherever reasoning fidelity matters, aliases the 26B A4B MoE version erstwhile latency and costs dominate. Pair it pinch LangGraph aliases LlamaIndex for the supplier framework, a hosted embedding model, and a mini civilization scale complete the unfastened web for your topic. There is simply a thematic constituent worthy making retired large here: Google ships the open-weights exemplary that powers the section distillation harness utilized to reverse-engineer Google’s ain accumulation stack. That is not a coincidence. That is simply a class opening up that the smart agencies and package companies will own.
Feed the harness the prompts you attraction astir ranking for. Observe its planner output. Log each sub-query the router generates. Capture the retrieval candidates astatine each stage. Score the pairwise comparisons. Read the critic’s notes. Where your section agent’s behaviour matches the accumulation system’s visible behaviour for illustration the Deep Research plan, the Perplexity sub-question list, the AI Mode description past you person a calibrated harness. Where it diverges, you person a calibration target. When your contented fails to make it past the router aliases the professional successful your distilled section agent, that is simply a beardown awesome it is failing successful production.
This is preferable to the existent ascendant playbook of “spam much prompts astatine ChatGPT and count citations” for 1 reason: distillation gives you a causal story for why contented fails astatine each stage. Citation counting only gives you a correlational story for what survived. When a customer asks “why are we losing to Competitor X successful AI Mode,” the reply “your passages support losing pairwise comparisons successful the calculator-ratio sub-query” is defensible. The reply “our citation count went down 12 percent this month” is not.
The candid caveat: distillation is not free. It requires engineering investment, an information harness, and continuous calibration against production-system behavior. The agencies and in-house GEO teams that build this capacity now will person a measurement moat that compounds. The ones that hold will beryllium moving the aforesaid dashboard their competitors are moving and wondering why their reports cannot reply the questions executives are asking.
You cannot optimize what you cannot observe. Reverse-engineering the accumulation achromatic container is simply a dormant end. Distilling your ain type of it is the only way to durable GEO performance.
What this changes for measurement
The measurement class is going to fragment, and the brands that prime the correct broadside of the fragmentation will person a important advantage for the adjacent 2 years.
Citation counts under-report your existent footprint by a facet of 3 to 10 successful agentic systems. If you look successful 4 of 12 sub-retrievals but get cited erstwhile successful the last answer, classical citation search misses 75 percent of your existent impact. Worse, it misses the why. You tin person a citation complaint that looks patient and a sub-query sum complaint that is collapsing, and a twelvemonth from now the illness shows up successful citations and you person nary warning.
The caller metric furniture needs:
- Sub-query coverage — what percent of the agent’s planned fan-out includes astatine slightest 1 of your sources.
- Retrieval-to-citation ratio — for sub-queries wherever your contented is successful the retrieval set, really often does it past to citation.
- Reflection endurance rate — for contented that makes the synthesis pool, really often does the professional driblet it.
- Bridge-entity centrality — whether your contented is positioned arsenic the canonical nexus betwixt cardinal entities successful your topical graph.
- Tool-call inclusion — whether the router is calling your endpoints erstwhile a instrumentality fits the sub-query.
- Distillation stage-failure rate — from the section agent, wherever successful the loop your contented astir often gets dropped.

Existing devices watch the survivors of a five-stage filter. The adjacent procreation of GEO measurement infrastructure will beryllium underneath them and watch the select itself, partially done the visible UI of Deep Research and AI Mode, and partially done a distilled section supplier that fills successful everything the accumulation systems hide.
A reproducible trial you tin tally this week
You cognize I ever want to time off you pinch thing actionable. So, I’ve sewage 2 things you tin do to make improvements connected your AI Search performance. The first requires nary engineering. The 2nd is engineering-light, single-engineer effort.
Part A — The Observable Agentic RAG Audit.
The first 1 is simply a workbook for you to cod information and spot really you are being interpreted by agentic RAG systems. Here are the steps:
- Pick 5 high-value queries. Pick the ones wherever citation really moves your business. The queries your income squad wishes you classed for, the queries that thrust demos, the queries that show up successful customer support tickets. I understand that these are difficult to measure, truthful usage your accepted hunt queries arsenic a proxy if you request to.
- Run each query done ChatGPT Deep Research, Gemini Deep Research, and Perplexity Pro pinch investigation mode enabled.
- Capture the visible investigation scheme for each. Deep Research and Perplexity show this directly; AI Mode partially exposes it done the description view.
- Log each sub-query the supplier issues. Save them successful a spreadsheet, 1 statement per sub-query, 3 columns for the 3 platforms.
- For each sub-query, tally it arsenic a standalone hunt and cheque whether your contented appears successful the apical retrieval set. If yes, people hit. If no, people miss.
- Compare your sub-query sum to your final-citation complaint connected the original 5 queries. The spread is your reflection-loss problem aliases the places wherever your contented makes it into retrieval and past loses pairwise aliases fails the critic.
- For each sub-query you miss entirely, categorize why: nary contented connected the topic, contented excessively broad, mediocre chunking, missing schema, missing instrumentality surface, freshness gap. The classification is the input to your contented roadmap for the adjacent quarter.
This will springiness you a consciousness of wherever you’re falling retired of the pipeline and what improvements you request to make to your content.
Part B — The Distillation Audit.
This attack is much technical. Part A told you what the accumulation agents publically admitted. Part B tells you what they didn’t. The planner sub-queries you couldn’t read, the reranker verdicts you couldn’t see, the circumstantial shape wherever your contented fell out.
I built the harness truthful you wouldn’t person to: https://github.com/iPullRank-dev/agentic-rag-audit. It’s a local, observable type of the agentic-RAG loop the accumulation systems tally pinch the aforesaid five-node style (planner, router, retriever, synthesizer pinch pairwise reranker, professional pinch reflection) connected Google Gemma 4 via Ollama, pinch SerpAPI seeds, Scrapling fetching, Trafilatura extraction, and an opt-in LangExtract chunker. Strictly speaking it’s structural distillation, not exemplary distillation. The constituent is diagnostic — observable end-to-end.
- Install. Python 3.10+, Ollama running connected a workstation GPU (8GB+ VRAM is fine), a SerpAPI key, your marque domain.

Set OLLAMA_CONTEXT_LENGTH=8192 in your strategy situation variables and restart Ollama — the 2048 default silently truncates prompts. Verify with ollama ps that the exemplary lands astatine 100% GPU.
- Run the aforesaid 5 queries from Part A. One astatine a time:

It’ll return astir 90–120 seconds per query. You get 8 diagnostic sections successful your terminal — scheme & routing, retrieval funnel, pairwise verdicts, marque journey, professional verdict, pipeline timing, last answer, citations — positive a trace JSON and a log file.
Here’s an illustration terminal output:

- Read the marque journey. This is the conception you came for. For each of your URLs that was surfaced, it shows which sub-queries recovered it, what the chunker really extracted, whether it made the reranker pool, the head-to-head verdicts that named it, and whether it ended up cited. When your contented falls out, you spot your URL’s existent opening transition side-by-side pinch the URLs that did make the excavation pinch targeted recommendations based connected the observable diff (opening sentence, query-term overlap, transition density).
- Roll up the metrics crossed the query set. After moving each 5 Part A queries:

You’ll get six metrics: sub-query coverage, retrieval-to-citation ratio, reflection endurance rate, tool-call inclusion, and stage-failure complaint by stage. Here’s an example:

The stage-failure complaint is what drives the contented roadmap. Failing astatine retrieval is 1 benignant of activity — accepted SEO for the circumstantial sub-queries the planner is generating. Failing astatine the reranker is different — passage-level contented density and directness. Failing astatine synthesis action is simply a 3rd — unique-signal coverage. Each demands different work.
- Calibrate against Part A. Capture each accumulation Deep Research scheme arsenic YAML (template at examples/production-template.yaml) and diff:

Where the 2 converge, you person a calibrated harness. Where they diverge sharply, your planner punctual aliases your seed-page supplier needs work. Re-calibrate quarterly aliases aft immoderate awesome punctual change.
Note: The section supplier isn’t the accumulation system. Gemma 4 E2B is the smallest variant; reranker value and professional decisions amended materially pinch E4B (one-line exemplary switch in .env). The retriever depends connected SerpAPI, truthful marque visibility upstream is still a difficult prerequisite. Pairwise verdicts connected mini models are directional, not authoritative. You should publication the existent reasoning successful conception 3 of each tally to judge confidence.
What this gives you that Part A can’t: the circumstantial shape wherever your contented falls out, your URL’s existent extracted transition compared to the winners, the reranker’s stated reasoning erstwhile you mislaid a head-to-head, and the circumstantial sub-queries your taxable vicinity doesn’t yet cover. That’s the diagnostic baseline you move into a contented roadmap.
Finally, arsenic pinch immoderate unfastened root codification I share, we apt person an soul type that is much robust. You should look astatine this arsenic a starting point, build your ain solutions connected top, and stock them backmost pinch the community.
Get the audit battalion and let’s talk
Classic SEO playbooks are obsolete. Single-shot RAG playbooks are obsolete. The brands that triumph successful 2026 and beyond will tally agentic-RAG-aware contented engineering connected apical of distilled measurement infrastructure, and they will fastener successful citation gravity that compounds for years. The brands that don’t will walk the adjacent 2 years arguing astir why it’s conscionable SEO and watching their citation count keeps going down.
Download the Part A Audit Sheet and, if you’re much method clone (and lend to) the Part B distillation starter repo. And if you person not already, cheque retired the AI Search Manual for the longer-form reference for overmuch of what we’ve discussed successful this article.
The retrieval-once playbook is over. The agentic loop is the caller default. It’s clip to build and analyse for it if we want to beryllium superior astir driving results.
This article was primitively published on the iPullRank blog and is republished pinch permission.
English (US) ·
Indonesian (ID) ·