Skip to main content

robots.txt Optimization

5.1 robots.txt Basics

robots.txt is a plain-text file placed at your website’s root (yourdomain.com/robots.txt) that tells crawlers which pages they may or may not access. All well-behaved crawlers — including AI agent crawlers — read this file before crawling your site.

5.2 Traditional Crawlers vs AI Crawlers

Between 2024 and 2026, a large number of new AI crawlers have emerged. They use different User-Agent strings from traditional search engine crawlers:
CrawlerOperatorUser-AgentPurpose
GooglebotGoogleGooglebotTraditional search indexing
BingbotMicrosoftbingbotTraditional search indexing
ChatGPT-UserOpenAIChatGPT-UserChatGPT real-time browsing
GPTBotOpenAIGPTBotAI training and search
Claude-WebAnthropicClaude-WebClaude real-time browsing
ClaudeBotAnthropicClaudeBotAI training
PerplexityBotPerplexityPerplexityBotAI search engine
Applebot-ExtendedAppleApplebot-ExtendedApple Intelligence
Google-ExtendedGoogleGoogle-ExtendedGemini AI training
cohere-aiCoherecohere-aiAI training
For e-commerce sites that want to maximize AI visibility:
# Traditional search engines — allow all
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

# AI agent browsing — allow (these agents recommend your products)
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

# AI training crawlers — decide based on your preference
# If you want AI models to learn about your brand (recommended):
User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /

# If you do not want your content used for training:
# User-agent: GPTBot
# Disallow: /

# Universal rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
Disallow: /api/

# Sitemap location
Sitemap: https://yourdomain.com/sitemap.xml

5.4 AI Training vs AI Browsing: An Important Distinction

TypeRepresentative CrawlersPurposeConsequence of Blocking
AI BrowsingChatGPT-User, Claude-WebReal-time page fetching when users ask questionsAI agents cannot see your latest content
AI TrainingGPTBot, Google-ExtendedCrawling content to train AI modelsAI knowledge base will not include your information
Recommendation: AI browsing crawlers must be allowed (otherwise AI agents cannot see your pages when recommending you). AI training crawlers depend on your preference, but allowing training generally means AI has a better understanding of your brand and products.

5.5 robots.txt Management by Platform

Shopify

Shopify controls robots.txt through the theme file robots.txt.liquid:
  1. Online Store → Themes → Edit code
  2. Find robots.txt.liquid
  3. Add the AI crawler rules you need

WordPress / WooCommerce

WordPress auto-generates robots.txt. Customize via:
  1. Yoast SEO: SEO → Tools → File editor
  2. RankMath: General Settings → Edit .htaccess and robots.txt
  3. Manual: Create a physical robots.txt file in the WordPress root directory (overrides the auto-generated version)

Self-Hosted Sites

Simply create or edit the robots.txt file in your website’s root directory.

5.6 Common Mistakes

MistakeConsequenceFix
No robots.txt at allAll crawlers allowed by default (acceptable but unprofessional)Create one
Disallow: / blocks everythingAI agents cannot see any of your pagesBlock only admin pages
ChatGPT-User/Claude-Web blockedAI agents cannot fetch real-time content when recommending youRemove those rules
No Sitemap declarationCrawlers may miss pagesAdd a Sitemap: line
Syntax errors in robots.txtRules may not take effectValidate with Google’s robots.txt testing tool

5.7 Verification

  1. Visit yourdomain.com/robots.txt and confirm the file exists with correct formatting
  2. Use Google Robots Testing Tool to validate rules
  3. Confirm that AI crawler User-Agents do not appear under any Disallow rules

Next chapter: Writing llms.txt — Your company brief for AI agents