robots.txt for AI crawlers: a practical guide

Your robots.txt file was written for Googlebot. In 2026, it is also the first thing that GPTBot, ClaudeBot, PerplexityBot, and a growing list of AI crawlers check before deciding whether to read your site. If it blocks them — which many default configurations do — your business is invisible to the AI systems that an increasing number of your potential customers use to make decisions.

This is the second article in our series on AI discoverability. The first covered llms.txt — a file that tells AI systems what your site is about. This one covers the file that determines whether AI systems can access your site at all.

What is robots.txt and why does it matter for AI?

robots.txt is a plain text file at the root of every website (e.g., yourdomain.com/robots.txt) that tells web crawlers which parts of the site they can and cannot access. It has been a web standard since 1994, originally designed for search engine crawlers like Googlebot and Bingbot.

AI companies now operate their own crawlers that check robots.txt before accessing a site. If your robots.txt blocks these crawlers — either explicitly or through a broad wildcard rule — your content will not be included in AI training data, will not appear in AI-powered search answers, and will not be cited when someone asks ChatGPT, Claude, Perplexity, or Gemini a question that your website could answer.

The critical point: robots.txt is a request, not enforcement. Well-behaved crawlers respect it, but it does not physically prevent access. Its power lies in the fact that the major AI companies have publicly committed to honouring it — making it the primary control mechanism for AI visibility.

Which AI crawlers exist and what do they do?

The AI crawler landscape has evolved rapidly. Most major AI companies now operate multiple crawlers, each serving a different purpose. Understanding the distinction is essential because blocking or allowing them requires different decisions depending on your goals.

OpenAI (ChatGPT)

OpenAI operates three crawlers. GPTBot collects content for model training — the data that shapes what ChatGPT knows about the world. OAI-SearchBot indexes content for ChatGPT's search features, determining whether your site appears when ChatGPT searches the web to answer a question. ChatGPT-User fetches pages in real time when a user asks ChatGPT to browse a specific URL.

Blocking GPTBot prevents your content from being used in future training, but does not remove existing knowledge. Blocking OAI-SearchBot means your site will not appear in ChatGPT search answers. OpenAI has noted that ChatGPT-User, the user-initiated fetcher, may not be fully governed by robots.txt.

Anthropic (Claude)

Anthropic recently expanded to a three-bot system mirroring OpenAI's approach. ClaudeBot handles training data collection. Claude-SearchBot indexes content for Claude's search features. Claude-User fetches pages when a user requests it during a conversation.

Blocking ClaudeBot stops training data collection but does not affect Claude-SearchBot or Claude-User. Anthropic states that blocking Claude-SearchBot "may reduce" visibility in Claude's search-powered answers.

Google (Gemini)

Google-Extended controls whether your content is used for Gemini and Google's AI features. Crucially, it is separate from Googlebot. Blocking Google-Extended does not affect your traditional Google Search rankings — it only controls whether your content feeds Gemini and AI Overviews.

Perplexity

PerplexityBot indexes content for Perplexity's AI search engine. Perplexity-User handles real-time retrieval. Perplexity's compliance with robots.txt has been publicly contested — Cloudflare documented cases where Perplexity used undeclared crawlers to access sites that had blocked PerplexityBot.

Others

Applebot-Extended powers Apple Intelligence features. CCBot crawls for the Common Crawl dataset, an open repository used by multiple AI systems for training. Amazonbot supports Alexa and Amazon's AI features.

How does a misconfigured robots.txt make you invisible?

The most common problem is not a deliberate decision to block AI crawlers. It is an accidental one.

The wildcard trap

Many websites use a broad rule like User-agent: * / Disallow: / to block all unrecognised crawlers, then add specific Allow rules for Googlebot and Bingbot. Before 2023, this was a reasonable security measure. In 2026, it means every AI crawler that is not explicitly allowed is blocked. If you did not add GPTBot, ClaudeBot, and PerplexityBot to your allow list, they are being turned away — silently, without any notification.

CMS defaults

Some content management systems and security plugins ship with default robots.txt configurations that block unknown crawlers. These defaults were written before AI crawlers existed and have not been updated. The result is that businesses running on these platforms are invisible to AI systems without anyone having made a conscious decision about it.

CDN and WAF interference

Even when robots.txt is correctly configured, Cloudflare, AWS WAF, or other security layers may block AI crawlers at the network level before the crawler reaches robots.txt. If you have configured your robots.txt to allow AI crawlers but still do not see them in your server logs, check your CDN and firewall rules.

What is the right strategy for your business?

The decision about which AI crawlers to allow depends on your business model and how your content relates to your revenue.

If your content supports a product or service

For most businesses — including banks, fintechs, SaaS companies, professional services firms, and e-commerce brands — maximum AI visibility is the right default. Your content exists to attract customers and build authority. Being cited in AI-generated answers drives awareness and trust. If your institution is already spending significantly on API-based AI services, being invisible to the same AI systems your customers use is a double cost. Allow all major AI crawlers.

If your content is your product

For publishers, research firms, and data providers whose content is their primary revenue source, the calculation is different. Allowing training crawlers means your content becomes part of the model's knowledge — potentially reducing the need for users to visit your site. In this case, a selective approach makes sense: block training crawlers (GPTBot, ClaudeBot) but allow search and retrieval crawlers (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot) so your site still appears in AI search answers with attribution.

The strategic distinction: training vs retrieval

The shift from single-crawler to multi-crawler systems by OpenAI and Anthropic created a new decision layer. You can now separately control whether your content is used for training (shaping what the model knows permanently) and whether it is used for retrieval (appearing in real-time search answers with citations). This distinction is the most important development in AI crawl management in the past year — and for financial institutions evaluating their broader AI strategy, it adds another dimension to the build-vs-buy decision.

How does robots.txt connect to your broader AI visibility?

robots.txt is the foundation layer. If AI crawlers cannot access your site, nothing else matters — not your llms.txt file, not your schema markup, not your content quality. It is the prerequisite that must be in place before any other AI discoverability effort can take effect. For financial institutions already navigating DORA and EU AI Act compliance, AI crawler configuration is one more layer that requires deliberate attention rather than defaults.

But configuring it correctly is not as simple as copying a template. It requires understanding which crawlers are relevant to your business, how your CMS and hosting infrastructure interact with crawler access, whether your CDN or WAF is interfering, and how your robots.txt rules interact with each other — because specificity, ordering, and wildcard behaviour all affect the outcome.

Getting this wrong does not produce an error message. It produces silence — your business simply does not appear in AI-generated answers, and you have no way of knowing unless you actively check.

If you want to understand whether AI crawlers can currently access your site and what they see when they get there, we can run a quick check.

Related reading:

robots.txt and AI crawlers