Inside Googlebot: demystifying crawling, fetching, and the bytes we process

Mar 31, 2026 07:00 AM - 2 months ago 61276

Tuesday, March 31, 2026

If you tuned into episode 105 of the Search Off the Record podcast, you mightiness person heard america diving heavy into a taxable that is adjacent to our hearts (and our servers): the soul workings of Googlebot.

For a agelong time, the sanction "Googlebot" has conjured up the image of a single, tireless robot systematically reference the internet. But the reality is simply a bit much analyzable — and a batch much interesting. Today, we want to popular the hood connected our crawling infrastructure, pinch a typical attraction connected the very thing that makes our ain heads spin: bytesize limits.

First, Googlebot isn't a azygous program

Let's clear up a humanities misnomer first. Back successful the early 2000s, Google had 1 product, truthful we had 1 crawler. The sanction "Googlebot" stuck. But today, Googlebot is conscionable a personification of thing that resembles a centralized crawling platform.

When you spot Googlebot successful your server logs, you are conscionable looking astatine Google Search. Dozens of different clients — Google Shopping, AdSense, and much — each way their crawl requests done this aforesaid underlying infrastructure nether different crawler names, the larger ones documented connected the Google Crawler infrastructure site.

The 2MB limit: what happens to your bytes?

This is wherever things get somewhat confusing. Every customer of the crawler infrastructure needs to group immoderate settings for their fetches. These settings see the personification supplier string, what personification supplier tokens will they look for in robots.txt, and really galore bytes they will fetch from a azygous URL.

Googlebot presently fetches up to 2MB for immoderate individual URL (excluding PDFs). This intends it crawls only the first 2MB of a resource, including the HTTP header. For PDF files, the limit is 64MB.

Image and video crawlers typically person a wide scope of period values, and it mostly depends connected the merchandise that they're fetching for. For example, fetching a favicon mightiness person a very debased limit, dissimilar Image Search.

For immoderate different crawler that doesn't specify a limit, the default is 15MB sloppy of contented type.

What does this mean for the bytes your server sends complete the wire?

Partial fetching: If your HTML record is larger than 2MB, Googlebot doesn't cull the page. Instead, it stops the fetch precisely at the 2MB cutoff. Note that the limit includes HTTP petition headers.
Processing the cutoff: That downloaded information (the first 2MB of bytes) is passed on to our indexing systems and the Web Rendering Service (WRS) arsenic if it were the complete file.
The unseen bytes: Any bytes that beryllium after that 2MB period are wholly ignored. They aren't fetched, they aren't rendered, and they aren't indexed.
Bringing successful resources: Every referenced assets successful the HTML (excluding media, fonts, and a fewer exotic files) will beryllium fetched by WRS pinch Googlebot for illustration the genitor HTML. They person their own, separate, per-URL byte antagonistic and don't count towards the size of the genitor page.

For the immense mostly of the web, a 2MB HTML payload is massive, and you will ne'er deed this limit. However, if your page includes bloated inline base64 images, monolithic blocks of inline CSS/JavaScript, aliases starts with megabytes of menus, you could accidentally push your existent textual content aliases captious system information past the 2MB mark. If those important bytes aren't fetched, to Googlebot, they simply don't exist.

Rendering the bytes

Once the crawler has successfully retrieved the bytes (up to the limit), it passes the baton to the WRS. The WRS processes JavaScript and executes client-side codification akin to a modern browser to understand the last visual and textual authorities of the page. Rendering pulls successful and executes JavaScript and CSS files, and processes XHR requests to amended understand the page's textual contented and building (it doesn't petition images aliases videos). For each requested resource, the 2MB limit besides applies.

However, retrieve that the WRS tin only execute the codification that the crawler really retrieved. Furthermore, the WRS operates statelessly — it clears section retention and convention information betwixt requests. This may person peculiar implications for really dynamic, JavaScript-dependent elements are interpreted by our systems.

Best practices for your bytes

To guarantee Googlebot tin efficiently fetch and understand your content, keep these byte-level champion practices successful mind:

Keep your HTML lean: Move dense CSS and JavaScript to outer files. While the first HTML archive is capped astatine 2MB, external scripts, and stylesheets are fetched separately (subject to their own limits).
Order matters: Place your astir captious elements — for illustration meta tags, <title> elements, <link> elements, canonicals, and basal structured information — higher up successful the HTML document. This ensures they are unlikely to beryllium recovered beneath the cutoff.
Monitor your server logs: Keep an oculus connected your server consequence times. If your server is struggling to service bytes, our crawlers will automatically backmost disconnected to debar overloading your infrastructure, which will driblet your crawl frequency.

Note that this limit is not group successful stone and whitethorn change complete clip arsenic the web evolves and HTML pages turn successful size. (Or shrink. Hopefully shrink.)

Crawling isn't magic; it's a highly orchestrated, scaled speech of bytes. By knowing really our cardinal fetching infrastructure retrieves and limits those bytes, you tin guarantee your site's astir important contented always makes the cut.

Happy optimizing!

Want to perceive much behind-the-scenes details? Check out Episode 105 of the Search Off the Record podcast connected YouTube aliases wherever you get your podcasts!

Posted by Gary.

Except arsenic different noted, the contented of this page is licensed nether the Creative Commons Attribution 4.0 License, and codification samples are licensed nether the Apache 2.0 License. For details, spot the Google Developers Site Policies. Java is simply a registered trademark of Oracle and/or its affiliates.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the accusation I need","missingTheInformationINeed","thumb-down"],["Too analyzable / excessively galore steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / codification issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[],[]]