Robots Refresher: page-level granularity

Mar 14, 2025 09:30 PM - 1 year ago 483553

Stay organized pinch collections Save and categorize contented based connected your preferences.

Friday, March 14, 2025

With the robots.txt file, tract owners person a elemental measurement to power which parts of a website are accessible by crawlers. To thief tract owners further definitive really hunt engines and web crawlers tin usage their pages, the organization progressive successful processing web standards came up pinch robots meta tags successful 1996, conscionable a fewer months aft meta tags were projected for HTML (and anecdotally, besides earlier Google was founded). Later, X-Robots-Tag HTTP consequence headers were added. These instructions are sent together pinch a URL, truthful crawlers tin only return them into account if they're not disallowed from crawling the URL done the robots.txt file. Together, they shape the Robots Exclusion Protocol (REP).

Meta tags (or elements) are a measurement to see machine-readable metadata. Robots meta tags are 1 "kind" of meta tag, and use to crawlers, including hunt engine crawlers. They signal: Is the contented blocked from indexing? Should links connected the page not beryllium followed for crawling? It's easy to springiness this accusation connected the page straight pinch robots meta tags.

A Robots Exclusion Protocol for immoderate URL

To springiness the aforesaid level of power to non-HTML content, the "X-Robots-Tag" HTTP consequence header was created. These HTTP headers are considered a portion of the REP arsenic well. The header supports the aforesaid values arsenic the robots meta tag, and tin beryllium added to immoderate portion of contented served online. Besides HTML, Google supports it for contented for illustration PDFs, archive files, and moreover images. Most of these record formats don't person a system balanced to meta tags, truthful a HTTP consequence header is helpful.

The syntax is elemental and extensible. The rules are mostly either implemented by the web developer aliases done a Content Management System (CMS), wherever site-owners whitethorn person checkboxes aliases drop-down menus to prime their preferences. These controls tin reside a circumstantial crawler, specified arsenic Googlebot or, by omitting a circumstantial name, reside each crawlers that support these values.

For example, the pursuing rules show each crawlers not to usage the associated page for indexing:

In shape of an HTML meta tag, connected a web page: <meta name="robots" content="noindex">
Looking astatine existing meta tags aliases consequence headers is simply a spot much involved, and requires nonstop introspection of page contented aliases headers. You tin position HTML meta tags connected immoderate page either looking astatine the page root successful your browser, aliases utilizing Chrome's developer devices to inspect the page.
In shape of an HTTP consequence header: X-Robots-Tag: noindex
You tin cheque the HTTP consequence headers for individual URLs pinch Chrome's developer tools, successful the network panel.

Other examples of what you tin do:

Don't show a snippet for this page aliases document.

In HTTP header: X-Robots-Tag: nosnippet aliases successful HTML: <meta name="robots" content="nosnippet">

Don't scale this page successful ExampleBot-News, without specifying a penchant for others.

These controls explicitly specify 1 crawler.

X-Robots-Tag: examplebot-news: noindex or <meta name="examplebot-news" content="noindex">

ExampleBot should not show a snippet, and additionally, each crawlers should not travel the links connected this page.

Note that the astir restrictive, valid directives apply, truthful for ExampleBot the directive would beryllium mixed arsenic "nosnippet, nofollow".

X-Robots-Tag: examplebot: nosnippet X-Robots-Tag: nofollow or <meta name="examplebot" content="nosnippet"> <meta name="robots" content="nofollow">

Choosing a REP mechanism

How do you take which 1 to use? Fundamentally robots.txt and page-level controls are similar, but not wholly interchangeable. Sometimes there's a circumstantial action that's only possible pinch 1 of the mechanisms, for example, if it's desired to extremity the enactment of crawling (such as for endless hunt results pages, imaginable pinch robots.txt), if you request a power for an FTP server (possible pinch robots.txt), aliases if it's desired not to person a snippet shown for a page (which is only imaginable pinch page-level elements). If you don't request to differentiate betwixt blocking crawling and blocking indexing, 1 attack is to usage robots.txt for broader controls (to artifact ample parts of a website), and page-level controls for blocking individual pages.

Robots Exclusion Protocol—a powerful, surviving standard

All of these controls are extensible by nature. Over the years, site-owners, crawler operators, and hunt engines person worked together to germinate them. Historically it started pinch a handful of values, including noindex and nofollow, past later connected much values like nosnippet, noarchive, and max-snippet: were adopted. And sometimes values are deprecated, arsenic was the lawsuit pinch noodp, which utilized snippets from DMOZ / Open Directory Project earlier the directory was closed. There's a plethora of values supported by Google for tract owners, and a akin magnitude from different ample crawler operators.

Under the REP umbrella, tract owners person power complete what is crawled and really the crawled data is utilized successful hunt engines. They tin do this connected a wide level for bigger parts of websites, aliases astatine a very granular level, for individual pages, moreover for images wrong pages. These controls are well-known, disposable successful each communal contented guidance systems, broadly supported by commercialized operators, and utilized on billions of hosts connected the net today.

Posted by John Mueller, Search relations team

Except arsenic different noted, the contented of this page is licensed nether the Creative Commons Attribution 4.0 License, and codification samples are licensed nether the Apache 2.0 License. For details, spot the Google Developers Site Policies. Java is simply a registered trademark of Oracle and/or its affiliates.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the accusation I need","missingTheInformationINeed","thumb-down"],["Too analyzable / excessively galore steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / codification issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[],[]]