Skip to content
Back to Blog
AI Search

AI Crawler robots.txt: How I Stopped Being Invisible to ChatGPT

The exact robots.txt configuration I run on this site, the Cloudflare default that's silently blocking over a million websites from AI search, and the difference between training-time bots and retrieval bots that most photographers and small business owners don't realize matters.

If you skip this 30-minute setup, none of the other AI search visibility work matters. Cloudflare blocks AI crawlers by default for over one million customer websites. Every bot needs an explicit allow rule. The difference between training-time bots and retrieval bots is the difference between being read by AI and being invisible. The reason a corporate photographer is the one writing this post is that the photographs I shoot for clients are useless if AI search bots cannot reach the pages those photos live on. This post is the technical foundation for the AI search visibility playbook. You should not write a single FAQ answer or rewrite a single landing page until your bots are configured correctly.

The Cloudflare gotcha that breaks one million sites

In July 2025, Cloudflare rolled out a default setting called AI Audit that blocks AI crawlers from reaching customer websites unless the site owner explicitly allows them. Per Cloudflare's own Radar reporting and reporting from major SEO publications in 2025-2026, over one million sites activated this feature without realizing it.

If you use Cloudflare, log into your dashboard and navigate to Security > Bots > AI Crawl Control. You will see a list of identified AI crawlers with toggle switches. Many will be set to Block by default. ChatGPT-User, PerplexityBot, ClaudeBot, OAI-SearchBot, GPTBot, Google-Extended, Applebot-Extended, and several others.

If those bots are blocked, your site is invisible to AI search. The robots.txt file does not matter. The schema markup does not matter. The 134-word FAQ answers you wrote do not matter. The bot reaches Cloudflare's edge, gets a 403 response, and the AI never reads a single word of your site. Cloudflare is enforcing the block at the network level before the request ever reaches your server.

Set every retrieval-based AI crawler to Allow. The training-time crawlers (the ones that scrape content for future model training) are a separate decision discussed below. The retrieval-time crawlers (the ones that fetch your content in real time when someone asks an AI a question) must be allowed if you want AI citations.

If you don't use Cloudflare, you are not affected by this default. But check your hosting provider, your CDN, and any security plugins (Wordfence, Sucuri, MalCare on WordPress sites) for similar AI-blocking defaults. They are increasingly common.

Confident professional headshot, St. Louis. The kind of real-photograph asset AI retrieval bots crawl when robots.txt allows them in

Training bots versus retrieval bots

Most robots.txt advice on the internet treats AI crawlers as a single category. They are not. There are two distinct types and the trade-offs are different.

Training-time bots crawl your content to feed it into the next version of an AI model. GPTBot, ClaudeBot (in its training capacity), CCBot, Google-Extended, Applebot-Extended, and a handful of others fall into this category. If you allow them, your content potentially becomes training data for the next model release. If you block them, your content does not enter the training corpus, but the trade-off is that future model versions will know less about your business than they otherwise would have.

Retrieval-time bots crawl your content in real time when a user asks the AI a question. ChatGPT-User, OAI-SearchBot (the dedicated ChatGPT search retrieval bot), PerplexityBot (in retrieval mode), Claude-Web, GoogleOther, and similar agents are retrieval bots. They fetch your page when an AI is generating a live answer that may cite your site. If you block them, you are invisible to live AI search. Period. There is no upside to blocking retrieval bots if you want AI search visibility.

The distinction matters because some businesses have legitimate reasons to block training bots (proprietary content, paywall models, competitive intelligence) but no reason to block retrieval bots. The two categories are separate user agents and require separate rules in robots.txt. Blocking GPTBot does not block ChatGPT-User. Allowing ChatGPT-User does not allow GPTBot. Configure them independently.

My exact robots.txt for this site

Here is the configuration I run on henrydavidphotography.com. The rules are explicit, comprehensive, and cover every AI bot worth tracking in May 2026. The format below is the standard robots.txt format that any web server can serve. My implementation uses Next.js's `MetadataRoute.Robots` type in `src/app/robots.ts`, but the underlying output is the same plain robots.txt file.

```

User-agent: *

Allow: /

Disallow: /api/

Disallow: /image-review.html

# OpenAI training bot

User-agent: GPTBot

Allow: /

# OpenAI ChatGPT live retrieval (when a user asks ChatGPT a question)

User-agent: ChatGPT-User

Allow: /

# OpenAI ChatGPT search dedicated retrieval bot

User-agent: OAI-SearchBot

Allow: /

# Anthropic

User-agent: ClaudeBot

Allow: /

User-agent: Claude-Web

Allow: /

# Perplexity

User-agent: PerplexityBot

Allow: /

# Apple Intelligence

User-agent: Applebot-Extended

Allow: /

# Google's AI training corpus (separate from regular Googlebot)

User-agent: Google-Extended

Allow: /

# Google's other crawlers (including AI Overviews retrieval)

User-agent: GoogleOther

Allow: /

# Common Crawl (training data for many open models)

User-agent: CCBot

Allow: /

# Cohere

User-agent: cohere-ai

Allow: /

# Amazon

User-agent: Amazonbot

Allow: /

# Meta (Llama training)

User-agent: Meta-ExternalAgent

Allow: /

# SEO research crawlers (I want to see my own data)

User-agent: AhrefsBot

Allow: /

User-agent: AhrefsSiteAudit

Allow: /

Sitemap: https://www.henrydavidphotography.com/sitemap.xml

```

A few details worth flagging. The first block (`User-agent: *`) sets the default for every bot not explicitly named. I disallow `/api/` (no public-facing API endpoints worth indexing) and `/image-review.html` (an internal review page for tethered shooting). Every named AI bot below has its own explicit Allow rule, which overrides the default. This belt-and-suspenders approach means a bot is allowed even if a specific user agent string differs from the default rule.

The two Ahrefs entries (`AhrefsBot` and `AhrefsSiteAudit`) are not AI bots but I include them so I can see my own SEO data in Ahrefs. If you do not pay for Ahrefs, you can omit those.

The `Sitemap:` line at the bottom points to my sitemap.xml. AI bots use this as a discovery shortcut. If you have a sitemap and don't reference it from robots.txt, you are making the bots work harder than they need to. They may still find it via standard discovery, but the explicit reference is faster and more reliable.

You can verify my live file at any time at henrydavidphotography.com/robots.txt. The whole point of being citable is that the proof is public.

How to test your robots.txt right now

Three free tools verify whether your robots.txt is configured correctly for AI search.

Direct check. Open `yourdomain.com/robots.txt` in a browser. If you see your rules, the file is being served correctly. If you see a 404 or a redirect, the file is missing or misconfigured.

Google Search Console robots.txt tester. In Search Console, navigate to Settings > Crawling > robots.txt. Paste a URL from your site and select a Googlebot variant (try Google-Extended for the AI training bot). The tool tells you whether that specific URL is allowed or blocked for that specific bot.

Cloudflare AI Audit dashboard. If you use Cloudflare, the AI Crawl Control screen shows you which AI bots are currently allowed versus blocked at the network level. This is the most important screen on Cloudflare for AI visibility, and most site owners have never opened it.

Do all three checks. The robots.txt itself can be perfect while Cloudflare blocks every bot at the edge. The configurations have to align. If they don't, the most permissive one loses to the most restrictive one.

Want the full AI-Visual Branding Package?

Robots.txt is the foundation. The shoot day, the multi-modal pages, the FAQ schema, and the citation tracking are the rest of the engagement.

See the package

What I changed and when

My robots.ts file went through three iterations during the last two months as I tightened the AI bot allowlist. The Git history is public:

The initial Next.js implementation allowed Googlebot and a generic `*` allow rule. Adequate for traditional SEO, invisible to AI search.

A mid-March commit added GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Applebot-Extended, Google-Extended, CCBot, and a handful of others. This was the first version that was actually allowing AI retrieval bots in.

Commit `e90ca5e` (April 2026) added explicit allows for AhrefsBot and AhrefsSiteAudit so I could see my own backlink and crawl data. Per the commit message, that single change was driven by recognizing that I needed visibility into how my own site was being crawled before I could audit the rest of the work.

The current configuration shipped at the start of May 2026 and adds Claude-Web (Anthropic's separate live-retrieval agent), GoogleOther (Google's catch-all for non-Googlebot crawlers including some AI Overviews retrieval), cohere-ai (Cohere's training and retrieval bot), Amazonbot (Amazon's crawler, relevant if you want visibility in Alexa-related AI), and Meta-ExternalAgent (Meta's Llama-related crawler).

The iteration matters because the AI bot landscape is moving fast. New bots ship every quarter. Existing bots split into multiple agents (the GPTBot to ChatGPT-User to OAI-SearchBot trio is one example). I am updating my robots.ts every time a new credible AI vendor publishes their bot identification, which is roughly monthly right now.

AI search crawler activity map illustrating which bots reach a small business website after robots.txt is configured correctly

What about the privacy or competitive arguments for blocking AI?

Some service businesses are wary of letting AI crawl their content. The concerns I hear most:

"AI will steal my content." AI systems quote and cite content; they don't replace your site. Per the conversion data in my AI search visibility pillar, AI-referred traffic converts at roughly 5x the rate of traditional Google organic. Being cited drives qualified traffic. Being invisible drives nothing.

"My pricing is in my content. I don't want competitors learning my pricing through AI." Your competitors can already read your website. AI is not a meaningfully different threat than a human competitor browsing your site. If you are uncomfortable having pricing publicly indexed, the issue is your pricing transparency strategy, not the AI bots.

"I don't want my content used for training." This is a legitimate concern, and it is why training bots and retrieval bots are separate categories. You can block GPTBot (training) and allow ChatGPT-User and OAI-SearchBot (retrieval). Same for Anthropic, Google, and Common Crawl. You stay out of the training corpus while remaining visible in live AI answers.

The configuration I run allows both training and retrieval bots because, as a service business doing visible work, I benefit from being part of how the next generation of AI models understands my industry. For a competitive-content business (proprietary research, paid newsletters, paywall content), the training-bot block is reasonable. The retrieval-bot block is rarely the right choice for any business that depends on being found.

Common configuration mistakes

Four patterns show up consistently in robots.txt audits I've done for client sites and competitors.

Blocking everything by accident. A misplaced `Disallow: /` under a `User-agent: *` block hides the entire site from every crawler. Often happens after a migration when a placeholder rule wasn't removed. Check your live robots.txt at `yourdomain.com/robots.txt` and verify the first block doesn't have a top-level `Disallow: /`.

Treating GPTBot and ChatGPT-User as the same bot. They are different user agents. Allowing one does not allow the other. If you want full ChatGPT visibility, you need both, plus OAI-SearchBot for ChatGPT's dedicated search retrieval.

WordPress "Discourage search engines" toggle. WordPress's Settings > Reading > "Discourage search engines from indexing this site" injects a `Disallow: /` into the auto-generated robots.txt and adds a noindex meta tag site-wide. If you ever toggled this on while your site was in development, double-check it is now off.

Security plugins overriding your rules. Wordfence, Sucuri, iThemes Security, and similar plugins sometimes inject their own robots.txt rules that override what's in your file. Test your live robots.txt and verify the rules you wrote are actually being served.

What to do today

Three actions, roughly 20 minutes total.

One: open `yourdomain.com/robots.txt`. Verify the AI bots you want allowed have explicit Allow rules. If you are using my full template above, copy and paste it. If you want a more conservative setup that allows retrieval bots but blocks training bots, use the same template and change the `Allow: /` to `Disallow: /` for GPTBot, Google-Extended, Applebot-Extended, CCBot, cohere-ai, and Meta-ExternalAgent. Leave ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, and GoogleOther on `Allow`.

Two: if you use Cloudflare, log into your dashboard and check Security > Bots > AI Crawl Control. Make sure every retrieval bot is set to Allow. This is the step that breaks one million sites silently. Don't skip it.

Three: open Google Search Console's robots.txt tester, paste a real URL from your site, and test it against Google-Extended and GoogleOther. Confirm both come back as Allowed.

If you want me to look at your robots.txt and Cloudflare bot settings as part of a wider engagement, get in touch and we'll talk it through.

Topics

AI crawler robots.txtGPTBot robots.txtClaudeBot allowPerplexityBot robots.txtCloudflare AI bot blockOAI-SearchBotGoogle-Extended robots.txtChatGPT-UserAI bot allowlist

Want a robots.txt audit for AI search visibility?

We're happy to discuss anything covered in this article, or your specific photography and video needs.

Get a Quote