The Modern Guide To Robots.txt

Robots.txt simply turned 30 – cue the existential disaster! Like many hitting the massive 3-0, it’s questioning if it’s nonetheless related in as we speak’s world of AI and superior search algorithms.

Spoiler alert: It positively is!

Let’s check out how this file nonetheless performs a key function in managing how search engines crawl your site, the best way to leverage it accurately, and customary pitfalls to keep away from.

What Is A Robots.txt File?

A robots.txt file supplies crawlers like Googlebot and Bingbot with pointers for crawling your web site. Like a map or listing on the entrance of a museum, it acts as a set of directions on the entrance of the web site, together with particulars on:

What crawlers are/aren’t allowed to enter?

Any restricted areas (pages) that shouldn’t be crawled.

Precedence pages to crawl – by way of the XML sitemap declaration.

Its main function is to handle crawler entry to sure areas of an internet site by specifying which elements of the positioning are “off-limits.” This helps make sure that crawlers give attention to essentially the most relevant content somewhat than losing the crawl price range on low-value content material.

Whereas a robots.txt guides crawlers, it’s essential to notice that not all bots comply with its directions, particularly malicious ones. However for many reliable serps, adhering to the robots.txt directives is commonplace follow.

What Is Included In A Robots.txt File?

Robots.txt information include traces of directives for search engine crawlers and different bots.

Legitimate traces in a robots.txt file include a area, a colon, and a price.

Robots.txt information additionally generally embrace clean traces to enhance readability and feedback to assist web site homeowners preserve monitor of directives.

Picture from creator, November 2024

To get a greater understanding of what’s usually included in a robots.txt file and the way completely different websites leverage it, I checked out robots.txt information for 60 domains with a excessive share of voice throughout well being, monetary providers, retail, and high-tech.

Excluding feedback and clean traces, the typical variety of traces throughout 60 robots.txt information was 152.

Giant publishers and aggregators, reminiscent of inns.com, forbes.com, and nytimes.com, usually had longer information, whereas hospitals like pennmedicine.org and hopkinsmedicine.com usually had shorter information. Retail web site’s robots.txt information usually fall near the typical of 152.

All websites analyzed embrace the fields user-agent and disallow inside their robots.txt information, and 77% of web sites included a sitemap declaration with the sector sitemap.

Fields leveraged much less continuously had been enable (utilized by 60% of web sites) and crawl-delay (utilized by 20%) of web sites.

Area	% of Websites Leveraging
`user-agent`	100%
`disallow`	100%
`sitemap`	77%
`enable`	60%
`crawl-delay`	20%

Robots.txt Syntax

Now that we’ve coated what kinds of fields are usually included in a robots.txt, we are able to dive deeper into what each means and the best way to use it.

For extra data on robots.txt syntax and the way it’s interpreted by Google, try Google’s robots.txt documentation.

Person-Agent

The user-agent field specifies what crawler the directives (disallow, enable) apply to. You should use the user-agent area to create guidelines that apply to particular bots/crawlers or use a wild card to point guidelines that apply to all crawlers.

For instance, the beneath syntax signifies that any of the next directives solely apply to Googlebot.

user-agent: Googlebot

If you wish to create guidelines that apply to all crawlers, you need to use a wildcard as a substitute of naming a particular crawler.

user-agent: *

You may embrace a number of user-agent fields inside your robots.txt to offer particular guidelines for various crawlers or teams of crawlers, for instance:

user-agent: *

#Guidelines right here would apply to all crawlers

user-agent: Googlebot

#Guidelines right here would solely apply to Googlebot

user-agent: otherbot1

user-agent: otherbot2

user-agent: otherbot3

#Guidelines right here would apply to otherbot1, otherbot2, and otherbot3

Disallow And Enable

The disallow field specifies paths that designated crawlers mustn’t entry. The allow field specifies paths that designated crawlers can entry.

As a result of Googlebot and different crawlers will assume they will entry any URLs that aren’t particularly disallowed, many websites preserve it easy and solely specify what paths shouldn’t be accessed utilizing the disallow area.

For instance, the beneath syntax would inform all crawlers to not entry URLs matching the trail /do-not-enter.

user-agent: *

disallow: /do-not-enter

#All crawlers are blocked from crawling pages with the trail /do-not-enter

If you happen to’re utilizing each enable and disallow fields inside your robots.txt, ensure to learn the part on order of precedence for rules in Google’s documentation.

Usually, within the case of conflicting guidelines, Google will use the extra particular rule.

For instance, within the beneath case, Google gained’t crawl pages with the trail/do-not-enter as a result of the disallow rule is extra particular than the enable rule.

user-agent: *

enable: /

disallow: /do-not-enter

If neither rule is extra particular, Google will default to utilizing the much less restrictive rule.

Within the occasion beneath, Google would crawl pages with the trail/do-not-enter as a result of the enable rule is much less restrictive than the disallow rule.

user-agent: *

enable: /do-not-enter

disallow: /do-not-enter

Be aware that if there isn’t a path specified for the enable or disallow fields, the rule will likely be ignored.

user-agent: *

disallow:

That is very completely different from solely together with a ahead slash (/) as the worth for the disallow area, which might match the foundation area and any lower-level URL (translation: each web page in your web site).

In order for you your web site to point out up in search outcomes, be sure to don’t have the next code. It’s going to block all serps from crawling all pages in your web site.

user-agent: *

disallow: /

This might sound apparent, however imagine me, I’ve seen it occur.

URL Paths

URL paths are the portion of the URL after the protocol, subdomain, and area starting with a ahead slash (/). For the instance URL https://www.instance.com/guides/technical/robots-txt, the trail could be /guides/technical/robots-txt.

Picture from creator, November 2024

URL paths are case-sensitive, so be sure you double-check that using capitals and decrease circumstances within the robotic.txt aligns with the meant URL path.

Particular Characters

Google, Bing, and other major search engines additionally assist a restricted variety of particular characters to assist match URL paths.

A particular character is an emblem that has a novel operate or which means as a substitute of simply representing a daily letter or quantity. Particular characters supported by Google in robots.txt are:

Asterisk (*) – matches 0 or extra situations of any character.

Greenback signal ($) – designates the tip of the URL.

As an instance how these particular characters work, assume we’ve got a small web site with the next URLs:

https://www.instance.com/

https://www.instance.com/search

https://www.instance.com/guides

https://www.instance.com/guides/technical

https://www.instance.com/guides/technical/robots-txt

https://www.instance.com/guides/technical/robots-txt.pdf

https://www.instance.com/guides/technical/xml-sitemaps

https://www.instance.com/guides/technical/xml-sitemaps.pdf

https://www.instance.com/guides/content material

https://www.instance.com/guides/content material/on-page-optimization

https://www.instance.com/guides/content material/on-page-optimization.pdf

Instance Situation 1: Block Website Search Outcomes

A standard use of robots.txt is to dam inner web site search outcomes, as these pages usually aren’t beneficial for natural search outcomes.

For this instance, assume when customers conduct a search on https://www.instance.com/search, their question is appended to the URL.

If a person searched “xml sitemap information,” the brand new URL for the search outcomes web page could be https://www.instance.com/search?search-query=xml-sitemap-guide.

Whenever you specify a URL path within the robots.txt, it matches any URLs with that path, not simply the precise URL. So, to dam each the URLs above, utilizing a wildcard isn’t mandatory.

The next rule would match each https://www.instance.com/search and https://www.instance.com/search?search-query=xml-sitemap-guide.

user-agent: *

disallow: /search

#All crawlers are blocked from crawling pages with the trail /search

If a wildcard (*) had been added, the outcomes could be the identical.

user-agent: *

disallow: /search*

#All crawlers are blocked from crawling pages with the trail /search

Instance Situation 2: Block PDF information

In some circumstances, you could wish to use the robots.txt file to dam particular kinds of information.

Think about the positioning determined to create PDF variations of every information to make it simple for customers to print. The result’s two URLs with precisely the identical content material, so the positioning proprietor might wish to block serps from crawling the PDF variations of every information.

On this case, utilizing a wildcard (*) could be useful to match the URLs the place the trail begins with /guides/ and ends with .pdf, however the characters in between range.

user-agent: *

disallow: /guides/*.pdf

#All crawlers are blocked from crawling pages with URL paths that comprise: /guides/, 0 or extra situations of any character, and .pdf

The above directive would forestall serps from crawling the next URLs:

https://www.instance.com/guides/technical/robots-txt.pdf

https://www.instance.com/guides/technical/xml-sitemaps.pdf

https://www.instance.com/guides/content material/on-page-optimization.pdf

Instance Situation 3: Block Class Pages

For the final instance, assume the positioning created class pages for technical and content material guides to make it simpler for customers to browse content material sooner or later.

Nonetheless, for the reason that web site solely has three guides revealed proper now, these pages aren’t offering a lot worth to customers or serps.

The positioning proprietor might wish to briefly forestall serps from crawling the class web page solely (e.g., https://www.instance.com/guides/technical), not the guides inside the class (e.g., https://www.instance.com/guides/technical/robots-txt).

To perform this, we are able to leverage “$” to designate the tip of the URL path.

user-agent: *

disallow: /guides/technical$

disallow: /guides/content material$

#All crawlers are blocked from crawling pages with URL paths that finish with /guides/technical and /guides/content material

The above syntax would forestall the next URLs from being crawled:

https://www.instance.com/guides/technical

https://www.instance.com/guides/content material

Whereas permitting serps to crawl:

https://www.instance.com/guides/technical/robots-txt

https://www.instance.com/guides/content material/on-page-optimization

Sitemap

The sitemap field is used to offer serps with a hyperlink to a number of XML sitemaps.

Whereas not required, it’s a finest follow to incorporate XML sitemaps inside the robots.txt file to offer serps with an inventory of precedence URLs to crawl.

The worth of the sitemap area needs to be an absolute URL (e.g., https://www.instance.com/sitemap.xml), not a relative URL (e.g., /sitemap.xml). You probably have a number of XML sitemaps, you possibly can embrace a number of sitemap fields.

Instance robots.txt with a single XML sitemap:

user-agent: *

disallow: /do-not-enter

sitemap: https://www.instance.com/sitemap.xml

Instance robots.txt with a number of XML sitemaps:

user-agent: *

disallow: /do-not-enter

sitemap: https://www.instance.com/sitemap-1.xml

sitemap: https://www.instance.com/sitemap-2.xml

sitemap: https://www.instance.com/sitemap-3.xml

Crawl-Delay

As talked about above, 20% of web sites additionally embrace the crawl-delay area inside their robots.txt file.

The crawl delay area tells bots how briskly they will crawl the positioning and is usually used to decelerate crawling to keep away from overloading servers.

The worth for crawl-delay is the variety of seconds crawlers ought to wait to request a brand new web page. The beneath rule would inform the desired crawler to attend 5 seconds after every request earlier than requesting one other URL.

user-agent: FastCrawlingBot

crawl-delay: 5

Google has stated that it doesn’t assist the crawl-delay area, and it is going to be ignored.

Different main serps like Bing and Yahoo respect crawl-delay directives for his or her net crawlers.

Search Engine	Main user-agent for search	Respects crawl-delay?
Google	Googlebot	No
Bing	Bingbot	Sure
Yahoo	Slurp	Sure
Yandex	YandexBot	Sure
Baidu	Baiduspider	No

Websites mostly embrace crawl-delay directives for all person brokers (utilizing user-agent: *), search engine crawlers talked about above that respect crawl-delay, and crawlers for search engine optimization instruments like Ahrefbot and SemrushBot.

The variety of seconds crawlers had been instructed to attend earlier than requesting one other URL ranged from one second to twenty seconds, however crawl-delay values of 5 seconds and 10 seconds had been the commonest throughout the 60 websites analyzed.

Testing Robots.txt Recordsdata

Any time you’re creating or updating a robots.txt file, ensure to check directives, syntax, and construction earlier than publishing.

This robots.txt Validator and Testing Tool makes this simple to do (thanks, Max Prin!).

To check a stay robots.txt file, merely:

Add the URL you wish to take a look at.

Choose your person agent.

Select “stay.”

Click on “take a look at.”

The beneath instance exhibits that Googlebot smartphone is allowed to crawl the examined URL.

Example robots.txt test - crawling allowed

Picture from creator, November 2024

If the examined URL is blocked, the instrument will spotlight the particular rule that stops the chosen person agent from crawling it.

Example robots.txt test - crawling disallowed

Picture from creator, November 2024

To check new guidelines earlier than they’re revealed, swap to “Editor” and paste your guidelines into the textual content field earlier than testing.

Frequent Makes use of Of A Robots.txt File

Whereas what’s included in a robots.txt file varies vastly by web site, analyzing 60 robots.txt information revealed some commonalities in how it’s leveraged and what kinds of content material site owners generally block serps from crawling.

Stopping Search Engines From Crawling Low-Worth Content material

Many web sites, particularly massive ones like ecommerce or content-heavy platforms, typically generate “low-value pages” as a byproduct of options designed to enhance the person expertise.

For instance, inner search pages and faceted navigation choices (filters and kinds) assist customers discover what they’re on the lookout for rapidly and simply.

Whereas these options are important for usability, they may end up in duplicate or low-value URLs that aren’t beneficial for search.

The robots.txt is usually leveraged to dam these low-value pages from being crawled.

Frequent kinds of content material blocked by way of the robots.txt embrace:

Parameterized URLs: URLs with monitoring parameters, session IDs, or different dynamic variables are blocked as a result of they typically result in the identical content material, which might create duplicate content material points and waste the crawl price range. Blocking these URLs ensures serps solely index the first, clear URL.

Filters and kinds: Blocking filter and kind URLs (e.g., product pages sorted by value or filtered by class) helps keep away from indexing a number of variations of the identical web page. This reduces the chance of duplicate content material and retains serps targeted on crucial model of the web page.

Inner search outcomes: Inner search end result pages are sometimes blocked as a result of they generate content material that doesn’t supply distinctive worth. If a person’s search question is injected into the URL, web page content material, and meta elements, websites may even danger some inappropriate, user-generated content getting crawled and listed (see the pattern screenshot in this post by Matt Tutt). Blocking them prevents this low-quality – and doubtlessly inappropriate – content material from showing in search.

Person profiles: Profile pages could also be blocked to guard privateness, scale back the crawling of low-value pages, or guarantee give attention to extra essential content material, like product pages or weblog posts.

Testing, staging, or growth environments: Staging, growth, or take a look at environments are sometimes blocked to make sure that private content material isn’t crawled by serps.

Marketing campaign sub-folders: Touchdown pages created for paid media campaigns are sometimes blocked once they aren’t related to a broader search viewers (i.e., a junk mail touchdown web page that prompts customers to enter a redemption code).

Checkout and affirmation pages: Checkout pages are blocked to stop customers from touchdown on them instantly by way of serps, enhancing person expertise and defending delicate data in the course of the transaction course of.

Person-generated and sponsored content material: Sponsored content material or user-generated content material created by way of critiques, questions, feedback, and so forth., are sometimes blocked from being crawled by serps.

Media information (pictures, movies): Media information are typically blocked from being crawled to preserve bandwidth and scale back the visibility of proprietary content material in serps. It ensures that solely related net pages, not standalone information, seem in search outcomes.

APIs: APIs are sometimes blocked to stop them from being crawled or listed as a result of they’re designed for machine-to-machine communication, not for end-user search outcomes. Blocking APIs protects their utilization and reduces pointless server load from bots attempting to entry them.

Blocking “Unhealthy” Bots

Unhealthy bots are net crawlers that interact in undesirable or malicious actions reminiscent of scraping content material and, in excessive circumstances, on the lookout for vulnerabilities to steal delicate data.

Different bots with none malicious intent should be thought of “dangerous” in the event that they flood web sites with too many requests, overloading servers.

Moreover, site owners might merely not need sure crawlers accessing their web site as a result of they don’t stand to achieve something from it.

For instance, you could select to dam Baidu if you happen to don’t serve clients in China and don’t wish to danger requests from Baidu impacting your server.

Although a few of these “dangerous” bots might disregard the directions outlined in a robots.txt file, web sites nonetheless generally embrace guidelines to disallow them.

Out of the 60 robots.txt information analyzed, 100% disallowed at the least one person agent from accessing all content material on the positioning (by way of the disallow: /).

Blocking AI Crawlers

Throughout websites analyzed, essentially the most blocked crawler was GPTBot, with 23% of web sites blocking GPTBot from crawling any content material on the positioning.

Orginality.ai’s stay dashboard that tracks how most of the high 1,000 web sites are blocking particular AI net crawlers discovered related outcomes, with 27% of the highest 1,000 websites blocking GPTBot as of November 2024.

Causes for blocking AI net crawlers might range – from issues over knowledge management and privateness to easily not wanting your knowledge utilized in AI coaching fashions with out compensation.

The choice on whether or not or to not block AI bots by way of the robots.txt needs to be evaluated on a case-by-case foundation.

If you happen to don’t need your web site’s content material for use to coach AI but additionally wish to maximize visibility, you’re in luck. OpenAI is clear on the way it makes use of GPTBot and different web crawlers.

At a minimal, websites ought to think about permitting OAI-SearchBot, which is used to function and hyperlink to web sites within the SearchGPT – ChatGPT’s just lately launched real-time search function.

Blocking OAI-SearchBot is way much less frequent than blocking GPTBot, with solely 2.9% of the highest 1,000 websites blocking the SearchGPT-focused crawler.

Getting Artistic

Along with being an essential instrument in controlling how net crawlers entry your web site, the robots.txt file can be a possibility for websites to point out their “inventive” facet.

Whereas sifting by way of information from over 60 websites, I additionally got here throughout some pleasant surprises, just like the playful illustrations hidden within the feedback on Marriott and Cloudflare’s robots.txt information.

Screenshot of marriot.com/robots.txt, November 2024

Screenshot of cloudflare.com/robots.txt, November 2024

A number of corporations are even turning these information into distinctive recruitment instruments.

TripAdvisor’s robots.txt doubles as a job posting with a intelligent message included within the feedback:

“If you happen to’re sniffing round this file, and also you’re not a robotic, we’re trying to meet curious people reminiscent of your self…

Run – don’t crawl – to use to hitch TripAdvisor’s elite search engine optimization staff[.]”

If you happen to’re on the lookout for a brand new profession alternative, you may wish to think about searching robots.txt information along with LinkedIn.

How To Audit Robots.txt

Auditing your Robots.txt file is an important a part of most technical search engine optimization audits.

Conducting an intensive robots.txt audit ensures that your file is optimized to reinforce web site visibility with out inadvertently proscribing essential pages.

To audit your Robots.txt file:

Crawl the positioning utilizing your most popular crawler. (I usually use Screaming Frog, however any net crawler ought to do the trick.)

Filter crawl for any pages flagged as “blocked by robots.txt.” In Screaming Frog, you could find this data by going to the response codes tab and filtering by “blocked by robots.txt.”

Overview the record of URLs blocked by the robots.txt to find out whether or not they ought to be blocked. Confer with the above record of frequent kinds of content material blocked by robots.txt that can assist you decide whether or not the blocked URLs needs to be accessible to serps.

Open your robots.txt file and conduct further checks to verify your robots.txt file follows search engine optimization finest practices (and avoids frequent pitfalls) detailed beneath.

Picture from creator, November 2024

Robots.txt Greatest Practices (And Pitfalls To Keep away from)

The robots.txt is a robust instrument when used successfully, however there are some frequent pitfalls to avoid if you happen to don’t wish to hurt the positioning unintentionally.

The next finest practices will assist set your self up for fulfillment and keep away from unintentionally blocking serps from crawling essential content material:

Create a robots.txt file for every subdomain. Every subdomain in your web site (e.g., weblog.yoursite.com, store.yoursite.com) ought to have its personal robots.txt file to handle crawling guidelines particular to that subdomain. Search engines like google and yahoo deal with subdomains as separate websites, so a novel file ensures correct management over what content material is crawled or listed.

Don’t block essential pages on the positioning. Be certain that precedence content material, reminiscent of product and repair pages, contact data, and weblog content material, are accessible to serps. Moreover, make it possible for blocked pages aren’t stopping serps from accessing hyperlinks to content material you wish to be crawled and listed.

Don’t block important assets. Blocking JavaScript (JS), CSS, or picture information can forestall serps from rendering your web site accurately. Be sure that essential assets required for a correct show of the positioning are usually not disallowed.

Embrace a sitemap reference. All the time embrace a reference to your sitemap within the robots.txt file. This makes it simpler for serps to find and crawl your essential pages extra effectively.

Don’t solely enable particular bots to entry your web site. If you happen to disallow all bots from crawling your web site, aside from particular serps like Googlebot and Bingbot, you could unintentionally block bots that might profit your web site. Instance bots embrace:
- FacebookExtenalHit – used to get open graph protocol.
- GooglebotNews – used for the Information tab in Google Search and the Google Information app.
- AdsBot-Google – used to examine webpage advert high quality.

Don’t block URLs that you really want faraway from the index. Blocking a URL in robots.txt solely prevents serps from crawling it, not from indexing it if the URL is already identified. To take away pages from the index, use different strategies just like the “noindex” tag or URL removing instruments, making certain they’re correctly excluded from search outcomes.

Don’t block Google and different main serps from crawling your complete web site. Simply don’t do it.

TL;DR

A robots.txt file guides search engine crawlers on which areas of an internet site to entry or keep away from, optimizing crawl efficiency by specializing in high-value pages.

Key fields embrace “Person-agent” to specify the goal crawler, “Disallow” for restricted areas, and “Sitemap” for precedence pages. The file may also embrace directives like “Enable” and “Crawl-delay.”

Web sites generally leverage robots.txt to dam inner search outcomes, low-value pages (e.g., filters, type choices), or delicate areas like checkout pages and APIs.

An rising variety of web sites are blocking AI crawlers like GPTBot, although this won’t be one of the best technique for websites trying to acquire site visitors from further sources. To maximise web site visibility, think about permitting OAI-SearchBot at a minimal.

To set your web site up for fulfillment, guarantee every subdomain has its personal robots.txt file, take a look at directives earlier than publishing, embrace an XML sitemap declaration, and keep away from unintentionally blocking key content material.

Extra assets:

Featured Picture: Se_vector/Shutterstock

Source link

The Modern Guide To Robots.txt

Using Google Merchant Center Next For Competitive Analysis

The Definitive Guide For Your Online Store

Bluesky Emerges As Traffic Source: Publishers Report 3x Engagement

Google Chrome site engagement service metrics

The Top 10 Newsletter Strategies to Boost Your Engagement and Reach

The Ultimate Cheat Sheet to Holiday Advertising in 2025

Data, AI, and the New Era of Creator-Led Growth

A Comprehensive Guide to the Future of Influencer Marketing 2025–2026

18 AWeber Alternatives: Our Top Choice Revealed

Top Insights

The Top 10 Newsletter Strategies to Boost Your Engagement and Reach

The Ultimate Cheat Sheet to Holiday Advertising in 2025

Data, AI, and the New Era of Creator-Led Growth

The Modern Guide To Robots.txt

What Is A Robots.txt File?

What Is Included In A Robots.txt File?

Robots.txt Syntax

Person-Agent

Disallow And Enable

URL Paths

Particular Characters

Instance Situation 1: Block Website Search Outcomes

Instance Situation 2: Block PDF information

Instance Situation 3: Block Class Pages

Sitemap

Crawl-Delay

Testing Robots.txt Recordsdata

Frequent Makes use of Of A Robots.txt File

Stopping Search Engines From Crawling Low-Worth Content material

Blocking “Unhealthy” Bots

Blocking AI Crawlers

Getting Artistic

How To Audit Robots.txt

Robots.txt Greatest Practices (And Pitfalls To Keep away from)

TL;DR

Related Posts