LLM crawlers from ChatGPT, Claude, Gemini, and Perplexity are visiting your marketing site every day, and most B2B teams cannot see them in GA4 because client-side tagging filters bots aggressively. This guide is the behavioral intelligence implementation playbook for surfacing AI-crawler traffic: user-agent regex, server-side detection, custom dimensions, and the segments that show which pages AI engines are actually consuming.
Why LLM crawler traffic is invisible in default GA4
The AI engines that buyers now query — ChatGPT, Claude, Gemini, Perplexity, and the Google AI Overviews retrieval stack — visit B2B marketing sites in two ways. Some crawl on a schedule to refresh their training or retrieval indexes; others fetch live when a user asks a question that requires fresh content. Both patterns leave fingerprints in your server logs, and both are systematically filtered out of Google Analytics 4 by default.
The default filter is the right call for most analytics use cases. GA4 strips bot traffic so the human-buyer report is not contaminated by automated visits. The problem on a B2B marketing site in 2026 is that the "bots" include the engines that are actively consuming your content to answer prospect queries. A team that cannot see ChatGPT fetching its pricing page cannot answer the question every B2B marketer is now being asked: which of our pages are AI engines reading, and what content are they citing back to buyers?
The framework below is the implementation playbook we ship for AI-crawler visibility on B2B marketing sites: user-agent regex (the front-line detection), server-side capture (the only place ad blockers and ITP do not reach), custom dimensions in GA4 (so the data is filterable in the standard reports), and the segment-and-explore patterns that turn raw crawler events into actionable behavioral intelligence.
The LLM crawlers actually visiting your site
The 2026 B2B marketing site is visited by at least eight distinct AI agents on a weekly basis. Knowing which one is which matters because their behavior is different.
- OpenAI's GPTBot. User agent contains
GPTBot. This is the training crawler; it respects robots.txt and indexes pages OpenAI uses to update model knowledge. - OpenAI's OAI-SearchBot. User agent contains
OAI-SearchBot. This is the live search retrieval crawler used when ChatGPT browses to answer a query in real time. Different bot, different respect rules, different freshness implications. - OpenAI's ChatGPT-User. User agent contains
ChatGPT-User. This fires when a user explicitly asks ChatGPT to fetch a specific URL on their behalf. Lower volume, higher intent — the user has named your URL. - Anthropic's ClaudeBot. User agent contains
ClaudeBot. The Anthropic crawler analog to GPTBot. - Anthropic's anthropic-ai. User agent contains
anthropic-ai. Live retrieval for Claude when it browses. - Google's Google-Extended. User agent token included in Google's crawler documentation. This is the opt-out token Google uses to identify when a Google crawler is fetching for AI training as opposed to Search indexing.
- PerplexityBot. User agent contains
PerplexityBot. Perplexity's general crawler. - Perplexity-User. User agent contains
Perplexity-User. Live fetch when a Perplexity user's question requires reading a specific URL.
The list grows. Meta's Meta-ExternalAgent, Apple's Applebot-Extended, and category-specific crawlers (Bytespider for ByteDance, You.com's YouBot) all have their own user-agent strings. The implementation discipline is to maintain the regex as a living artifact, not a one-time configuration, because new crawlers ship faster than the analytics tooling tracks them.
Step-by-step: setting up LLM crawler detection in GA4
This is the implementation walkthrough. Six steps, each with the exact action, the location in the tools, the values to configure, and how to verify the step worked before moving to the next one. Allow 24-48 hours after step 6 for GA4 to begin populating the custom-dimension data — GA4's registration of a custom dimension is not retroactive, so events fired before registration will not appear in those reports.
Before you start, you need:
- Edit access to a GA4 property (admin or editor role).
- Edit access to a Google Tag Manager web container, plus permission to provision a server-side container (or an existing server container).
- A subdomain you can point at the tagging server (typically
tags.yourdomain.com) and DNS access to set the CNAME. - A Google Cloud Platform project (or equivalent runtime) for the server container; auto-provisioning during setup is fine.
Step 1 — Build the LLM crawler regex
What to do: assemble the user-agent fragment regex you will use to identify AI crawlers. The minimum-viable pattern as of 2026:
(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|anthropic-ai|Google-Extended|PerplexityBot|Perplexity-User|Applebot-Extended|Meta-ExternalAgent|Bytespider|YouBot)
Why these strings: each fragment matches the official user-agent published by the AI provider. OpenAI's bot documentation lists GPTBot, OAI-SearchBot, and ChatGPT-User; Anthropic's crawler documentation lists ClaudeBot, Claude-User, and Claude-SearchBot.
Verify: save the regex as a Tag Manager constant variable named llm_crawler_pattern so it can be referenced from triggers and tag templates without copy-pasting the pattern around.
Step 2 — Set up server-side Google Tag Manager
What to do: in your GTM account, create a new container with platform set to Server. Choose the auto-provision option to let Google deploy a Cloud Run instance for you, or skip auto-provision and deploy your own container if you already run server-side tagging.
Where: Tag Manager → Admin → Create container → select Server. Once provisioned, point your tags.yourdomain.com CNAME at the tagging server URL and wait for SSL to provision.
Why server-side: AI crawlers do not run JavaScript, so the user-agent never reaches client-side Tag Manager. Server-side captures the request at the layer where the user-agent is still readable. Google's client-vs-server documentation covers the trade-offs in detail.
Verify: visit https://tags.yourdomain.com/healthz; if the server is up you will get a 200 response. If you get a TLS or DNS error, the CNAME has not propagated yet.
Step 3 — Detect crawlers and emit the event in the server container
What to do: in the server container, add a Custom Template tag (or a Custom Variable) that reads the incoming request's user-agent header, applies llm_crawler_pattern from step 1, and produces three outputs: crawler_matched (boolean), crawler_name (the matched string like GPTBot), and traffic_type (set to ai_crawler when matched, human otherwise).
Where: Tag Manager (Server) → Variables → New → Variable Configuration → pick Request Header for the user-agent input, then a Custom JavaScript variable that runs the regex.
Configuration:
// Custom JavaScript variable, returns matched crawler name or empty string
const ua = String({{Header: user-agent}} || '');
const m = ua.match(/(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|anthropic-ai|Google-Extended|PerplexityBot|Perplexity-User|Applebot-Extended|Meta-ExternalAgent|Bytespider|YouBot)/);
return m ? m[1] : '';
Then add a tag — a Google Analytics: GA4 tag — configured to fire whenever crawler_matched is true. Event name: llm_crawler_visit. Parameters: crawler_name, traffic_type, page_path (from the request URL), response_code.
Verify: open the server container Preview mode, visit your site with a user-agent override (e.g., curl with -A "Mozilla/5.0 GPTBot/1.1"), and confirm the llm_crawler_visit tag fires with crawler_name=GPTBot.
Step 4 — Register the custom dimensions in GA4
What to do: register two custom dimensions in the GA4 admin so the crawler_name and traffic_type parameters become available in reports.
Where: Google Analytics 4 → Admin → Custom definitions (under Data display) → Custom dimensions → Create custom dimension.
Configuration:
- Dimension name:
LLM crawler; Scope: Event; Event parameter:crawler_name; Description: "The matched AI crawler user-agent fragment, e.g., GPTBot." - Dimension name:
Traffic type; Scope: Event; Event parameter:traffic_type; Description: "ai_crawler, human, or other_bot."
Note: custom dimensions are not retroactive. Events fired before registration will not appear in dimension-filtered reports. Plan to register the dimensions before flipping step 3 on, not after.
Verify: in GA4 → Reports → Realtime, look for events named llm_crawler_visit. The dimensions take 24-48 hours to appear in standard reports but will surface in the DebugView immediately if you have GA4 debug mode on.
Step 5 — Exclude crawler events from the bot filter
What to do: GA4's default known-bot filter strips events whose user-agent is on Google's public bot list, which includes most AI crawlers. The implementation uses the traffic_type dimension to keep the events visible in segments designed to read them.
Where: there is no admin toggle that opts specific events out of the bot filter. The mechanism is that you reference your traffic_type custom dimension in segment definitions: build segments where traffic_type = ai_crawler, and use those segments in the explorations and reports that should show crawler activity. The bot filter does not affect events that pass through; it affects which events appear in the default human-buyer reports.
Verify: in Explore, build an exploration with rows = page_path, columns = crawler_name, metric = Event count, and filter to traffic_type = ai_crawler. After 24-48 hours of registered events, the table populates.
Step 6 — Build the AI-crawler reports
What to do: create two GA4 explorations that turn the raw events into the operating reports the marketing team will read.
- Page-fetch frequency report. Free-form exploration. Rows:
page_path. Columns:crawler_name. Cell metric: Event count. Date range: last 28 days. Filter:traffic_type = ai_crawler. Sort by total event count descending. What it answers: which pages are AI engines reading, segmented by which engine. - Live-fetch trend report. Free-form exploration. Rows:
date. Columns:crawler_name(filtered toChatGPT-User,Claude-User,Perplexity-User,anthropic-ai— the live-retrieval bots). Metric: Event count. Overlay (separate panel): content publish dates. What it answers: when AI engines fetched a page in real time, what changed about visibility in their answers.
Verify: the reports populate with real data after 28 days of crawler events. In the meantime, the seven-day view will show the events arriving and let you confirm the parameter wiring works end-to-end.
Time to data: server-side capture starts immediately. Custom-dimension reports start populating after 24-48 hours. The 28-day frequency report stabilizes after a month of capture, at which point the team has a working AI-crawler visibility dashboard.
The robots.txt and crawl-policy decisions that pair with detection
Detection without policy is half the implementation. Once the team can see GPTBot, OAI-SearchBot, ClaudeBot, and PerplexityBot fetching specific pages, the next decision is which fetches to allow and which to deny. The artifact that carries the decision is robots.txt, and the well-behaved crawlers respect it.
The decision is not site-wide. It is per-path. The honest framework most B2B marketing sites should run:
- Allow training crawlers on public, ungated content. Blog posts, product pages, public docs, comparison pages — everything that exists to introduce buyers to the category and the brand. These are the pages AI engines are most likely to cite back to prospects researching the space, and citation eligibility starts with crawl access.
- Block training crawlers on gated content. Case studies behind a form, pricing detail behind a request, technical docs reserved for customers, anything you want a human to convert through. The robots.txt directive denies
GPTBot,ClaudeBot,Google-Extendedon those paths. - Allow live-retrieval crawlers more permissively. ChatGPT-User, Perplexity-User, and anthropic-ai fetch on a user's explicit request. A user asked the engine to read your URL; blocking that fetch denies the user's explicit intent. Most teams allow live-retrieval bots even on paths they block training bots from.
- Treat the policy as a maintained artifact. New crawlers ship every quarter. The robots.txt file should have an owner, a review cadence, and a changelog. The same telemetry that detected the crawlers is what informs the next robots.txt update.
The pairing of detection (the GA4 events) and policy (the robots.txt directives) is what turns AI-crawler visibility from a vanity dashboard into an operational system. The team can see which engines fetched what, decide which fetches were appropriate, and adjust policy — and the next round of crawler events validates whether the policy worked. The behavioral intelligence loop on AI search runs through this pairing.
One operational note worth keeping in view: AI crawlers also re-fetch when content changes. A page that emits a llm_crawler_visit spike right after a content update is a page the engines noticed updated, and that spike is itself a signal worth instrumenting. The same dashboard that reports first-fetch volume should report re-fetch volume, because re-fetch is what tells the team whether content updates are reaching the AI index in the first place.
How Pressfit.ai approaches LLM crawler visibility
Our analytics implementation engagement provisions GA4 and Tag Manager, wires conversion events on every CTA and form, ships Consent Mode v2 by default, and adds a server-side event mirror where it matters — the components on the live product page that make AI-crawler detection a deliverable rather than a bolt-on. The custom-event taxonomy includes the LLM-crawler events from layer two of this implementation, so the same telemetry layer that captures buyer behavior captures AI-engine behavior, and the reporting layer can read both.
Behavioral intelligence is the layer above. The AI-crawler signals join the rest of the buyer-response data we run through the system: which content earns AI citations, which pages convert in human-buyer journeys, and which patterns connect the two. The work feeds the messaging system that tests what your buyers actually respond to and the AI visibility system that reads which content is being cited in ChatGPT, Claude, Gemini, Perplexity, and Google AI Overviews. The crawler data is the upstream half of that read; the citation data is the downstream half. Tied to pipeline outcomes, the loop tells the team which content the AI engines are actually consuming and citing back to buyers.
The engagement is scoped against documented best practice (server-side tagging, GA4 Measurement Protocol, robots.txt as the policy artifact for which crawlers are welcome) rather than speculative AEO checklists. We do not promise that AI engines will start citing the brand more after the implementation, because crawler visibility is a measurement layer, not a citation lever. What it does deliver is the read on which pages AI engines are reading and which they are not, which is the data the team needs to make the next round of content decisions.
Common LLM crawler tracking mistakes
- Treating the bot filter as a benefit. GA4's default bot filter strips AI-crawler events before they reach reporting. That is the right default for most analytics; it is the wrong default for AI-search visibility. The implementation has to opt the AI-crawler events out of the bot filter (via the
traffic_typecustom dimension and segment definitions) before any of this data is visible. - Relying on client-side tagging alone. Most AI crawlers do not run JavaScript. Client-side Tag Manager fires only when JS executes, which means the GPTBot fetch your origin server saw is invisible to the GA4 client-side stream. Server-side capture is the layer that solves this, and it is layer two for a reason.
- Naming events generically. A single
bot_visitevent without bot-identity parameters is one step better than nothing and three steps short of useful. The implementation discipline is one event per bot with parameters for the requested URL, response code, and (where available) the user query that triggered the fetch. - Ignoring robots.txt as the policy artifact. Robots.txt is where you tell ChatGPT, Claude, Gemini, and Perplexity what they may and may not crawl. Most B2B sites ship a robots.txt that has not been updated since 2023 and is silent on the AI crawlers entirely. The honest implementation pairs detection with policy: if you do not want GPTBot crawling your gated content, the robots.txt has to say so.
- Building the report before the warehouse. The AI-crawler data is most useful when joined to content metadata, the internal-link graph, and AI-citation tracking from the visibility tooling. Reports built inside GA4 alone show the events; reports built in the warehouse show the patterns.
Frequently asked questions
Which AI crawlers should I be tracking on a B2B marketing site?
At minimum: GPTBot, OAI-SearchBot, ChatGPT-User (OpenAI), ClaudeBot, anthropic-ai (Anthropic), Google-Extended (Google's AI training crawler), PerplexityBot, Perplexity-User. Worth adding: Applebot-Extended, Meta-ExternalAgent, Bytespider, YouBot. The list grows; the implementation discipline is to maintain it as a living artifact in the Tag Manager regex.
Why do I not see LLM crawlers in default GA4?
GA4 ships a bot filter that strips known-bot user agents before the event reaches reporting. That default is the right call for most analytics use cases but the wrong call for AI-search visibility, because the "bots" include the engines actively consuming your content to answer prospect queries. The implementation opts the AI-crawler events out of the bot filter via a custom dimension so they appear in segments designed to read them.
Do I need server-side tagging to detect LLM crawlers?
Effectively yes. Most AI crawlers do not run JavaScript, so client-side Tag Manager never fires for them. Server-side tagging captures the request at the layer where the user-agent is still readable and the event is reliably emittable. Without server-side, you see only the small fraction of crawlers that do execute JavaScript, which is not the picture you want.
What can I do with LLM crawler data once I have it?
Two reports pay back the implementation immediately: page-by-page count of crawler visits per AI engine over 28 days (which pages are AI engines reading?), and a time-series of live-retrieval events overlaid with content publish dates and AI-citation tracking (when crawlers fetch fresh content, what changes about citation visibility?). Both feed the behavioral intelligence loop on AI search and the next round of content decisions.
How does LLM crawler detection relate to AI search visibility?
Crawler detection is the upstream measurement: which pages AI engines are reading. AI citation tracking is the downstream measurement: which pages AI engines are citing back to buyers in answers. The two reads together close the loop on AI-search visibility — you can correlate crawler activity with citation patterns and identify which content is being consumed and used.
Should I block AI crawlers in robots.txt?
It depends on the page. Gated content (case studies, pricing, anything you want a human to convert through a form) is reasonable to block from training crawlers. Public ungated content (blog posts, product pages, public docs) is usually content you want AI engines to read because it is the content they are most likely to cite back to prospects researching your category. The honest answer is that the robots.txt decision is per-path, not site-wide, and the same telemetry that detects the crawlers should inform the robots.txt policy.
What is next
LLM crawler visibility stops being theoretical the moment the GA4 custom dimension llm_crawler is populated and segmentable. Want to see what AI engines are actually reading on your site? Book a Pressfit.ai discovery call and we will audit your current crawler visibility, install the server-side detection layer, and build the reports that turn raw crawler events into AI-search visibility intelligence.