AI Crawlers

PerfectSearch gives you complete visibility into how AI systems interact with your website. Track which crawlers visit, how often they request pages, and whether they are scraping data for model training or fetching content in real-time for conversational AI responses. Use built-in access control presets to block, allow, or rate-limit each crawler individually.

What AI crawlers does PerfectSearch track?

PerfectSearch identifies and tracks twelve major AI crawlers across training, retrieval, and hybrid categories. Each crawler is detected by its User-Agent string and classified automatically, so you can see exactly which AI systems are accessing your content without any configuration.

CrawlerOperatorType
GPTBotOpenAITraining
anthropic-aiAnthropicTraining
Google-ExtendedGoogleTraining
BytespiderByteDanceTraining
CCBotCommon CrawlTraining
cohere-aiCohereTraining
meta-externalagentMetaTraining
ChatGPT-UserOpenAIRetrieval
ClaudeBotAnthropicRetrieval
Claude-WebAnthropicRetrieval
PerplexityBotPerplexityRetrieval
Applebot-ExtendedAppleHybrid

What is the difference between training and retrieval crawlers?

Training crawlers scrape your site to collect data that is used to train or fine-tune large language models. Retrieval crawlers fetch your content in real-time when a user asks a conversational AI a question, so your pages can appear as cited sources in AI answers. Hybrid crawlers perform both functions depending on the context.

Training crawlers include GPTBot, anthropic-ai, Google-Extended, Bytespider, CCBot, cohere-ai, and meta-externalagent. These crawlers typically make large, systematic crawls across your entire site. Blocking them prevents your content from being used in future model training, but has no immediate effect on AI-generated answers since the models may have already been trained on previously crawled data.

Retrieval crawlers include ChatGPT-User, ClaudeBot, Claude-Web, and PerplexityBot. These crawlers visit your site when a user asks a question and the AI decides to look up current information. Allowing retrieval crawlers means your content can appear as a cited source in AI chat responses, driving referral traffic back to your site.

Hybrid crawlerslike Applebot-Extended collect data for both training Apple's AI models and powering real-time features like Siri and Apple Intelligence. You may want to evaluate these on a case-by-case basis, weighing the training data contribution against the visibility benefits.

How do I view AI crawler analytics?

Navigate to your site in the PerfectSearch dashboard and open the AI Crawlers tab. This view displays total request volume, request trends over time, your most-crawled pages, and a breakdown of requests by individual crawler. Use the date range picker to focus on specific time periods.

The overview panel at the top shows aggregate metrics: total AI crawler requests in the selected period, the percentage change compared to the previous period, and how many unique pages were visited. Below that, a time-series chart breaks down daily request volume by crawler, color-coded so you can quickly identify which crawlers are most active.

The Top Pages table lists the most frequently crawled URLs, the number of requests each received, and which crawlers visited them. This is useful for understanding which content AI systems find most valuable and ensuring those pages have up-to-date snapshots.

Advanced analytics including per-crawler trend charts, path-level filtering, and historical comparison views are available on the Growth plan and above. Growth+

How do I block AI training crawlers?

Go to your site in the PerfectSearch dashboard, open the Access Control tab, and apply the “block-training” preset. This instantly creates rules that block all known training crawlers while continuing to serve snapshots to retrieval crawlers, so your content stays visible in AI chat responses.

The block-training preset creates individual block rules for GPTBot, anthropic-ai, Google-Extended, Bytespider, CCBot, cohere-ai, and meta-externalagent. Each rule is set to return a 403 Forbidden response. Retrieval crawlers like ChatGPT-User, ClaudeBot, and PerplexityBot are left unaffected.

If you prefer granular control, you can create individual rules instead of using a preset. For example, you might want to block GPTBot and Bytespider but allow anthropic-ai. See the Access Control documentation for details on creating custom rules.

What are crawler presets?

Crawler presets are one-click rule sets that configure access control for common scenarios. PerfectSearch includes three presets: block-training, allow-retrieval-only, and block-all. Each preset creates multiple access control rules at once, saving you from configuring each crawler individually.

  • block-training — Blocks all seven training crawlers while allowing the four retrieval crawlers and Applebot-Extended. This is the most popular preset because it prevents your data from being used for model training while maintaining visibility in AI chat products.
  • allow-retrieval-only — Same behavior as block-training but expressed as explicit allow rules for retrieval crawlers plus a default-deny for everything else. Use this if you prefer a whitelist approach.
  • block-all — Blocks all twelve AI crawlers, both training and retrieval. Your site will not be crawled by any AI system. Use this only if you want complete AI opt-out; note that this means your content will not appear in any AI-generated answers.

Presets are accessible from the Access Control tab in your site dashboard. Applying a preset does not delete existing rules — it adds the preset rules alongside them. You can remove individual preset rules afterward if needed.

Can I export crawler data?

Yes, you can export AI crawler analytics data as a CSV file directly from the AI Crawlers tab. The export includes all requests in the selected date range with columns for timestamp, crawler name, crawler type, requested path, response status, and format served. Growth+

To export, select your desired date range on the AI Crawlers tab, then click the “Export CSV” button in the top-right corner. The file is generated server-side and downloaded to your browser. For large date ranges with hundreds of thousands of requests, the export may take a few seconds to prepare.

Exported data is useful for custom analysis, compliance reporting, or sharing AI crawler activity with stakeholders. You can import the CSV into spreadsheet tools, data visualization platforms, or your own analytics pipeline.

Retrieval crawlers boost your AI visibility

Retrieval crawlers like ChatGPT-User, ClaudeBot, and PerplexityBot are how your content gets cited in AI-generated answers. When a user asks an AI assistant a question, these crawlers fetch your pages in real-time and the AI can reference your content with a link back to your site. Blocking retrieval crawlers removes your site from AI-powered search results entirely. For most sites, the recommended strategy is to block training crawlers while allowing retrieval crawlers — this prevents your data from training future models while maximizing your visibility in AI answers today.