The Great Divergence

Executive Summary

A new economic model is taking shape

Cloudflare sees over 200 trillion Internet interactions every single day. The data from Q1 2026 tells a story of rapid transition — from a human-centric Open Web to a machine-centric Programmable Web — and the emergence of a new content economy built on permission, attribution, and value exchange. AI is not the enemy of the web. It is the forcing function for a better deal.

30.5%

of all web traffic is now automated — a signal of how central AI has become to the web's infrastructure

21.8%

of bot traffic is AI crawlers

+4.2pp vs Q4 2025

50+

publisher-AI licensing deals signed since 2023 — a market forming in real time

#8

ChatGPT's rank among all internet services globally — AI search is a referral channel, not just an extractor

↑ from outside top 10

31.4%

of all web crawlers are GPTBot or ClaudeBot — scale that makes identity and consent infrastructure essential

500+

publishers now in collective AI licensing pools — the market infrastructure is being built

AI crawler share grew 4.2 percentage points quarter-over-quarter, from 17.5% in Q4 2025 to 21.8% in Q1 2026. That growth represents enormous demand for publisher content — demand that is increasingly being channelled into formal licensing agreements. The question is no longer whether AI companies will pay for content. The question is whether publishers have the tools to set the terms. That infrastructure exists now.

"The web is becoming programmable. That is not a threat — it is an opportunity for creators to define the terms on which their work powers the next generation of the Internet."

— Matthew Prince, CEO, Cloudflare

Section 1

The fundamental collapse is not hard to find in the data. The bot category breakdown tells the story of a web that has pivoted away from human utility:

Bot Traffic by Category — Q1 2026 vs Q4 2025

Bot Category Distribution (% of all verified bot traffic)

Category	Q4 2025	Q1 2026	Change
Search Engine Crawler	36.3%	31.1%	▼ −5.2pp
AI Crawler	17.5%	21.8%	▲ +4.2pp
SEO & Analytics	12.2%	13.3%	▲ +1.1pp
Advertising & Marketing	13.0%	10.6%	▼ −2.4pp
Page Preview	6.6%	6.6%	→ flat
Monitoring & Analytics	3.6%	3.7%	▲ +0.1pp
AI Search	2.2%	3.2%	▲ +1.0pp
Webhooks	2.5%	3.1%	▲ +0.6pp
Aggregator	2.7%	2.5%	▼ −0.2pp

The Top Bot Operators

Among individual bot operators, Google still commands 35.5% of all verified bot traffic — but its share dropped 6.2 percentage points QoQ. OpenAI is now the #3 bot operator on the entire Internet, at 9.5%, and growing.

1 Google

35.5% ▼ −6.2pp

2 Meta

16.6% ▲ +2.5pp

3 OpenAI

9.5% ▲ +1.2pp

4 Microsoft

6.1% → flat

5 Amazon

3.5% ▲ +0.1pp

AI Services Are Now Internet Infrastructure

One measure of how deeply AI has embedded itself in the Internet: ChatGPT is the #8 most-trafficked service globally, ahead of Amazon Shopping, TikTok, and Netflix. It is the #12 most-visited domain in the world.

AI Services Category — Weekly Ranking (Oct 2025 – Mar 2026)

Note: Lower rank number = higher traffic. Y-axis inverted. Source: Cloudflare Radar Internet Services.

Key Finding

DeepSeek surged from #9–10 in the AI ranking to #4 in the single week of February 2, 2026 — immediately following global attention on DeepSeek-R1. A single model release can reshape the entire competitive landscape within days.

Section 2

The Asymmetry of Extraction

The core dysfunction of the current AI web is not that bots exist — it's that they consume without reciprocating. Among all AI bot crawl activity in Q1 2026:

44.6%

explicitly for model Training

44.7%

Mixed Purpose (includes training)

8.1%

for Search — the only use that returns traffic

0.44%

Undeclared — shadow scrapers with no stated purpose

AI Bot Crawl Intent — Q1 2026

89.3% of all AI crawler requests are extractive, consuming content to build models that may eventually route around the source entirely. Only 8.1% powers a search product that sends users back to original content.

The Individual Crawler Pecking Order

The individual crawler ranking reveals the precise hierarchy of extraction. These are the ten most active crawlers on the web, ranked by share of all crawler traffic:

#	Crawler	Operator	% of Crawlers
1	Googlebot	Google	23.7%
2	GPTBot	OpenAI	17.0%
3	ClaudeBot	Anthropic	14.4%
4	Meta-ExternalAgent	Meta	12.6%
5	BingBot	Microsoft	7.9%
6	Amazonbot	Amazon	5.3%
7	FacebookExternalHit	Meta	3.1%
8	YandexBot	Yandex	2.6%
9	PetalBot	Huawei	2.5%

The 31.4% Problem

GPTBot and ClaudeBot together account for 31.4% of all web crawler traffic — nearly a third of the entire crawler ecosystem — yet the referral products attached to them remain nascent. They consume at the scale of Google; they return traffic at a fraction of it.

Section 3

The Industry Heatmap: What AI Is Actually Consuming

Cloudflare Radar data on AI bot vertical distribution reveals which industries are most exposed — and what kind of content they should be protecting.

AI Bot Crawl Activity by Vertical — Q1 2026 (% of total AI crawl requests)

What Each Vertical Should Protect

Vertical	AI Crawl Share	Primary Content at Risk
Shopping & Retail	31.1%	Product descriptions, pricing, imagery, reviews
Internet & Telecom	16.7%	Infrastructure docs, API references, technical content
Computer & Electronics	14.9%	Technical documentation, code, specs
News, Media & Publications	9.2%	Articles, analysis, journalism — highest text quality per request
Business & Industry	5.1%	B2B content, industry reports, market data
Travel & Tourism	4.0%	Listings, reviews, itineraries, pricing
Finance	2.9%	Market data, financial reporting, disclosures

Shopping and retail is the most-crawled category at 31.1% — not because consumers are using AI to shop (yet), but because product data is rich training corpus for AI models that aim to assist with purchasing decisions. The crawl happens long before any agentic commerce product launches.

News and Media accounts for 9.2% of AI crawl activity — a figure that understates the impact. A retail product description is one sentence. A news article is thousands of words of structured, fact-checked prose. The training value per crawl request is not equal.

Meaningful Scarcity Creates the Conditions for a Deal

The ability to credibly control access is what makes content a licensable asset. Publishers who can define who accesses their content, at what terms, and verify compliance — have something concrete to bring to a negotiation. Cloudflare's enforcement layer is what converts that capability from a policy statement into a technical reality.

Section 4

From Robots.txt to Honest Bot Detection

Publishers are responding with the only tool most of them have: a 47-year-old text file. GPTBot and ClaudeBot are the most restricted crawlers in the robots.txt ecosystem. In Cloudflare's verified bot catalog, 37 of 200 tracked bots are now AI crawlers or AI assistants — a category that barely existed two years ago.

Why Robots.txt Is Not Enough

First: compliance is voluntary. Our AI crawl timeseries shows that AI bot traffic hit its single-day peak of Q1 2026 on February 9 — even as blocking rates have been rising throughout the quarter. Volume is growing faster than restrictions can contain it.

Second: shadow scrapers exist. Our data shows 0.44% of AI bot crawl activity comes from bots declaring no purpose — Undeclared crawlers operating without transparency. These are bots that deliberately misidentify themselves, wearing a Googlebot mask to crawl as if they have permission they have not sought.

Third: the past is already gone. Models trained on data crawled before publishers erected their defenses have already consumed that content. The window of prevention is partly closed.

Web Bot Auth: Cryptographic Verification

This is why Cloudflare built Web Bot Auth — a new standard using cryptographic signatures to verify a bot's identity at the protocol level. If a bot claims to be Googlebot, it signs its requests with a key only Google holds. If the signature fails, the claim is false. Web Bot Auth does not rely on bots choosing to be honest; it makes dishonesty technically detectable.

The 0.16% Signal

AI agents — bots that actively browse, fill forms, and execute tasks — currently represent 0.16% of verified bot traffic. Small today, but the fastest-growing subcategory. These agents cannot be managed with robots.txt alone. Cryptographic identity is the necessary infrastructure for the Programmable Web.

Section 5

AI Search: A New Referral Ecosystem

The AI Search category — crawlers attached to AI-powered search products that do return traffic to original sources — is one of the most strategically important in the ecosystem. It grew 1.0 percentage point QoQ — the fastest-growing bot subcategory this quarter.

Operator	Share of AI Search Traffic	Product
Apple	58.8%	Applebot → Siri, Spotlight, Apple Intelligence
OpenAI	41.0%	OAI-SearchBot → ChatGPT Search
Cloudflare	0.13%	Cloudflare AI Search
Brave	0.07%	Brave Search AI

Apple is the dominant AI search crawler, driven by Applebot's role in powering Siri and Apple Intelligence. Apple's crawl activity was highly volatile in Q1 2026 — near zero on some days, spiking by late March — suggesting active product development. A company with 2.35 billion active devices and a first-party AI search product is a fundamentally different referral engine than anything that preceded it.

OAI-SearchBot spiked noticeably during peak AI news periods (Jan 22–24, Feb 9, Mar 10–11), suggesting its crawl intensity tracks closely with ChatGPT usage surges. As AI search products mature and publishers begin demanding referral as a condition of access, this is the category that matters most.

Section 6

The AI Agent Layer: From Crawlers to Action

The most nascent but strategically significant development is the emergence of AI assistants as a distinct traffic category. These are not crawlers passively indexing content — they are agents actively executing tasks: browsing pages, filling forms, clicking buttons, completing purchases.

Operator	Share of AI Assistant Traffic
OpenAI	86.0%
Cloudflare (Browser Rendering)	10.7%
DuckDuckGo	2.5%
Mistral AI	0.26%
Meta	0.25%
Manus	0.22%
Amazon	0.03%
Devin AI	0.03%

The 21 verified AI assistant bots in our catalog include Amazon Bedrock AgentCore Browser (deployed across 9 AWS regions), Amazon's "Buy For Me" agent, Devin AI, CartAI, Apify, and Browserbase.

The Agent Opportunity

An agent that can complete a purchase on a user's behalf represents a new distribution channel — one where publishers and developers who establish the right integrations will be present at the moment of action, not just the moment of discovery. The shift from crawl-and-display to act-on-behalf creates new surfaces for attribution, licensing, and value exchange that do not exist in the search paradigm.

Section 7

The AI Inference Economy

Beyond what bots consume, Cloudflare's AI inference data reveals what developers are building on top of AI infrastructure in Q1 2026.

Workers AI — Task Type Distribution, Q1 2026

Top Models by Deployment

#	Model	Share of Inference Accounts
1	Meta Llama 3 8B Instruct	40.6%
2	Stable Diffusion XL 1.0	13.4%
3	Whisper (OpenAI)	8.3%
4	Meta Llama 4 Scout 17B	7.0%
5	M2M-100 1.2B (Translation)	5.4%

Text generation dominates at 62.9% of all AI inference, confirming that the primary use of AI infrastructure is language. Meta's Llama 3 is the most widely deployed model on Cloudflare's network at 40.6% of all inference accounts — Meta's open-weight strategy has produced the Internet's most-used inference model. The newly released Llama 4 Scout already accounts for 7.0% of inference accounts despite launching during the quarter — the fastest adoption curve we have observed for any new model.

Section 8

Bot Management: Cloudflare's Defense Layer

The data in this report — crawl volumes, intent classification, shadow scrapers, verified identities — doesn't surface without infrastructure. Cloudflare's Bot Management platform is the layer between publishers and the automated web, scoring every request with a bot probability, a purpose classification, and an enforcement decision. At the scale of 200+ trillion signals per day, this is not a feature. It is the operating system of the Programmable Web.

200+

verified bots tracked in Cloudflare's catalog

37

AI crawlers or AI assistants in the verified bot registry — a category that barely existed two years ago

7×

more Cloudflare customers actively blocking GPTBot and ClaudeBot than blocking Googlebot

0.44%

of AI crawl traffic is "Undeclared" — bots deliberately misrepresenting their identity to evade publisher blocks

How Bot Management Works

Every request that passes through Cloudflare receives a bot score from 1 to 99. Scores near 1 indicate automated traffic; scores near 99 indicate human. This score is derived from a multi-layered detection stack that runs at the edge without adding latency:

1 Machine Learning Fingerprinting
Models trained on Cloudflare's global traffic continuously classify requests against billions of known-good and known-bad behavioral patterns.

2 Verified Bot Allowlist
200+ verified bots — Googlebot, GPTBot, ClaudeBot, OAI-SearchBot — confirmed by reverse IP/DNS lookup. Known-good bots pass; impersonators are flagged immediately.

3 Behavioral & Protocol Fingerprinting
Request timing, TLS fingerprints (JA3/JA4), HTTP/2 settings, and header ordering expose non-human sessions even when user agents claim otherwise.

4 Turnstile — CAPTCHA-Free Challenges
For ambiguous scores, Cloudflare's smart challenge validates humans without friction. Automated sessions that evade passive detection fail the challenge.

5 Web Bot Auth — Cryptographic Identity
The next layer: bots sign requests with a private key only their operator holds. If a crawler claims to be Googlebot but fails the signature check, it is lying. No policy can be gamed.

AI Crawl Control: Publisher Enforcement at the Edge

The most consequential Bot Management capability for publishers in Q1 2026 is AI Crawl Control — purpose-built tooling for per-operator access policy on AI bots. Unlike robots.txt, which is a voluntary convention any bad actor can ignore, AI Crawl Control enforces access decisions at the network edge before content is ever served.

Why robots.txt Is Not Enough

Cloudflare data shows AI crawl traffic hit its single-day peak of Q1 2026 on February 9 — even as publisher blocking rates have been rising throughout the quarter. Volume growth is outpacing restriction. The 0.44% of AI crawl traffic classified as "Undeclared" operates with no stated purpose and no user agent honesty — deliberately rotating identities to slip through text-file defenses. A text file on a server is not enforcement. Edge-level bot scoring is.

AI Crawl Control lets publishers set distinct rules per crawler identity — allowing OAI-SearchBot (which drives referral traffic back to the source) while blocking GPTBot (which trains models without reciprocation) on the same domain. Publishers can say, for the first time: you may index, but you may not train.

The Googlebot Impersonation Problem

Shadow scrapers preferentially impersonate Googlebot — the one crawler virtually no publisher can afford to block. A bot wearing a Googlebot mask gets through the vast majority of publisher defenses by exploiting a structural coercion: block Google, lose search. Web Bot Auth's cryptographic signature requirement closes this permanently. A crawler that cannot produce Google's private key cannot impersonate Googlebot, regardless of what it claims in its user agent string.

Bot Score Distribution — Q1 2026

Cloudflare classifies 30.5% of all web traffic as automated in Q1 2026. Within that automated layer, the distribution polarizes: most bot traffic is either clearly machine (score 1–29) or clearly human (score 80–99). The ambiguous middle — scores 30–79 — is where sophisticated evasion operates and where ML models are most actively contested.

Bot Score	Classification	Recommended Action	Share of Traffic
1–29	Likely Bot	Block or challenge	24.1%
30–49	Possibly Bot	Challenge with Turnstile	2.8%
50–79	Ambiguous	Log, monitor, apply additional signals	3.6%
80–99	Likely Human	Allow	69.5%

Crawler Hints: Telling Good Bots When to Come

Bot Management is not only about blocking. The other half of the problem is waste. Cloudflare's analysis of search crawler traffic found that 53% of good bot visits are wasted — crawlers revisiting pages that haven't changed since their last visit, burning compute and energy for no new information. For a category that represents over 5% of all global Internet traffic, that inefficiency has material scale: Cloudflare estimates that reducing excessive crawl could eliminate as much as 26 million tonnes of carbon equivalent per year — the equivalent of shutting down six coal-fired power plants permanently.

The root cause is structural: crawlers have no reliable signal for when content changes. Without that signal, they default to visiting pages on a fixed schedule regardless of whether anything is new. Most of the time, nothing is.

Crawler Hints

Cloudflare's Crawler Hints solves this by proactively signalling to verified search crawlers when content on a site has actually changed — and by extension, when it hasn't. Rather than waiting for a crawler to discover a change on its own schedule, Cloudflare monitors cache-status at the edge and pushes real-time update signals via IndexNow, the open protocol for content freshness notification. Crawlers receive a personalized, continuously updated map of exactly which URLs changed and when — zero load on the origin server, available on all Cloudflare plans including Free.

The mechanism works through Cloudflare's edge cache layer. When a page's content changes — detected by hash or timestamp — Cloudflare records the event and notifies participating crawlers via IndexNow. Each search engine gets its own tailored feed of what changed since its last visit, ordered by page importance (inferred from human visitor frequency). The result: crawlers spend their budget on pages that are actually fresh, and publishers get their new content indexed faster.

Without Crawler Hints	With Crawler Hints
Crawler visits on a fixed schedule regardless of content changes	Crawler visits triggered by actual content changes, signalled in real time
53% of crawl visits find no new content — wasted compute and bandwidth	Crawl budget concentrated on pages that have actually changed
New content may sit unindexed for days or weeks	Content changes surfaced to search engines in near real-time
Origin server absorbs repeated unnecessary requests	Fewer bot requests reach origin; performance and resource load improve
Publisher has no visibility into crawler freshness	Publisher benefits from faster indexing without any additional configuration

Crawler Hints connects directly to the economics of the AI content market. When a licensed AI crawler has access to a publisher's content, both parties benefit if that crawler indexes fresh content efficiently — the AI company's product improves, and the publisher's content is correctly represented. Excessive recrawl of stale content wastes the licensed crawler's budget on pages it has already seen, while new articles sit unindexed. Crawler Hints ensures the deal works as intended: the crawler arrives when there is something new to read, not on an arbitrary schedule.

The Broader Signal

Crawler Hints operates within the same verified bot ecosystem as AI Crawl Control and Web Bot Auth. A crawler that is allowed under AI Crawl Control because it is a verified, licensed partner gets the benefit of Crawler Hints — efficient, timely access to fresh content. A crawler that is blocked gets neither access nor hints. The enforcement layer and the efficiency layer are the same infrastructure, applied in opposite directions depending on whether the bot is trusted.

The /crawl Endpoint: Compliant Crawling by Design

On March 10, 2026, Cloudflare launched a new Browser Rendering /crawl endpoint — a developer API that submits a starting URL and returns a full website's content as HTML, Markdown, or structured JSON, via a headless browser that automatically discovers and renders pages. It is designed for building RAG pipelines, training datasets, and content monitoring — and critically, it is built from the ground up to be a compliant, permission-respecting crawler.

What Makes /crawl Different

Unlike the shadow scrapers and undeclared bots documented throughout this report, the /crawl endpoint is a signed agent — it cryptographically identifies itself and cannot misrepresent its identity. It respects robots.txt directives including crawl-delay, honours AI Crawl Control settings, and cannot bypass Cloudflare bot detection or CAPTCHAs. It self-identifies as CloudflareBrowserRenderingCrawler/1.0 in every request — more transparent than the vast majority of AI crawlers operating on the web today.

Key capabilities include incremental crawling via modifiedSince and maxAge parameters — meaning it skips pages that haven't changed, saving cost and reducing unnecessary load. Configurable crawl depth, URL patterns, and page limits give operators precise control over scope. Static mode (render: false) fetches HTML without a headless browser for faster crawls of non-JS sites. Available on both Free and Paid plans.

Some publishers raised concerns in the days following launch — specifically that the crawler's user agent was not yet listed in Cloudflare's Verified Bot catalog, meaning bot management rules keyed to that catalog required a manual update to explicitly reference it. Cloudflare clarified the behavior in an edited changelog post and the path to blocking it for any publisher who chooses to is straightforward: the crawler identifies itself honestly, which is precisely what makes it possible to block with a simple firewall rule. That is the opposite of the problem posed by shadow scrapers that rotate identities to evade detection.

The broader point is significant: the /crawl endpoint represents what compliant AI crawling should look like. It is honest about its identity. It respects publisher signals. It cannot impersonate other bots. Every AI company operating a crawler should be held to the same standard — and Cloudflare's infrastructure is what makes that standard enforceable at scale.

From Defense to Access Control Infrastructure

The strategic shift underway is this: Bot Management is no longer purely a security product. It is the access control layer for a new content economy. Block the bad actors. Challenge the ambiguous. Signal the verified good ones to crawl smarter. The /crawl endpoint is a blueprint for what the well-behaved automated web looks like: identified, transparent, and operating within publisher-defined terms. Publishers who use Cloudflare's bot infrastructure are not just protecting their content — they are creating the conditions for productive partnership. Granular identity enforcement at the edge is what makes tiered access possible: free for search, licensed for training, auditable for both. The deal is only as good as the infrastructure beneath it — and that infrastructure now exists.

Section 9

The Attribution Dashboard: Your Negotiating Brief

The existing Cloudflare AI bots dashboard tells you which bots are hitting your site and whether they are being blocked. The new Attribution Overview goes further — it turns bot management data into a deal-ready evidence base. These are the capabilities that are new, and why each one matters for licensing leverage.

Bot View — Operator-Level Attribution

Cloudflare Attribution Overview — Bot View

New: Crawl-to-Referral Ratio — the number that reframes every licensing conversation

The original dashboard shows crawl volume. The Attribution Overview shows what you get back. The crawl-to-referral ratio — how many AI crawl requests it takes to produce one human referral visitor — is the single metric that quantifies the imbalance of extraction. An operator with an 80:1 ratio is taking 80 times more than it returns. A 42:0 ratio means it is returning nothing at all. This number did not exist in the previous dashboard. It belongs at the top of every negotiation brief.

New: Intent Classification — Training, Inference, or Grounding per operator

The original dashboard classifies bots broadly. The Attribution Overview classifies the purpose of each visit: Training (content used to build the model — the highest-value use), Inference (query-time retrieval), or Grounding (factual anchoring in AI responses). An operator crawling for Training is building its product on your journalism. An operator crawling for Inference is using your content to answer user queries. Each has a different commercial value — and the licensing price should reflect which category applies. The dashboard makes this visible, per operator, continuously.

New: Bandwidth per Operator — consumption quantified as a billing unit

The original dashboard counts requests. The Attribution Overview measures data transfer in gigabytes and terabytes per AI operator. This matters because bandwidth is the unit cloud providers use to charge for egress — it is an established commercial metric, not an abstract count. When an operator's consumption is expressed in terabytes, it arrives at the negotiating table with a number already attached. "You consumed X TB of our content last quarter" is a different opening than "you sent us a lot of requests."

New: Peer Blocking Intelligence — what 80% of publishers have already decided

The Insights panel surfaces market intelligence alongside your own data: which bots your peers have blocked, at what rate, and why. When a bot is flagged as blocked by the majority of publishers in the network, that is not just a recommendation — it is evidence of industry consensus about that operator's behaviour. This context changes the negotiation dynamic: a publisher blocking a bot that 80% of peers have also blocked is acting with the weight of the market behind them.

New: High Exposure Classification — automated flagging of unfair extraction

The Attribution Overview automatically classifies bots by exposure risk — High, Medium, or Low — based on the combination of crawl volume, intent type, and referral return rate. High Exposure bots are those taking content at scale without meaningful reciprocation. This classification did not require manual analysis in the previous dashboard. It surfaces automatically, giving publishers an immediate view of which operators present the highest unlicensed extraction risk.

Existing: Enforcement State — now the starting position for deal terms

The Action column (Blocked, Partially Blocked, Allow, Custom Rule) existed in the original dashboard. In the Attribution Overview context it gains a new dimension: it is the negotiation starting position. When an AI company comes to the table, a publisher can show exactly how that operator is currently being treated — and name the specific terms under which the status changes. Moving a row from Blocked to Allow is the deal. The dashboard makes that concrete.

Articles View — Content-Level Attribution

Cloudflare Attribution Overview — Articles View

New: Per-URL Bot Breakdown — which articles AI companies are training on

The Articles view is entirely new. It answers the question AI companies least want publishers to ask: which of my articles specifically are you training on? Each URL shows which operators crawled it, what their intent was, and how AI bot traffic compares to human traffic on that page. An article with a high AI-to-human ratio is not generating audience — it is generating training data. That distinction has a price, and the Articles view makes it visible at the content level for the first time.

New: Traffic Distribution per Article — the inventory for a licensing deal

Each article row shows the full traffic breakdown: AI bots, other bots, human direct, and human referral, with a colour-coded distribution bar. This is the publisher's content inventory in commercial terms — the specific URLs an AI company has been trying to access, the volume at which they have been trying, and what the crawl-to-referral ratio looks like for each piece. In a deal negotiation, this is the product catalogue. Knowing which content an operator most wants is knowing where the leverage is.

When a deal is signed, the Articles view becomes the audit trail

A licensing agreement that cannot be audited is a handshake. Once an operator moves from Blocked to Allowed, the Attribution Dashboard — and the Articles view in particular — becomes the ongoing record of what they access, at what volume, with what intent, article by article. This is the consumption metering layer that pay-per-use deals require to function. Without it, there is no way to verify that an AI company is honouring what it agreed to. With it, the deal has verifiable, continuously tracked terms.

Section 10

Monetization: The Publisher-AI Deal Economy

The data has produced a market. Since the first publisher-AI deal in July 2023, a licensing economy has grown from a handful of bilateral experiments into 50+ documented agreements across OpenAI, Google, Perplexity, Microsoft, Meta, Amazon, Mistral, and Prorata — spanning wire services, national newspapers, magazines, specialist publishers, and collective licensing pools. The market is moving quickly, and the commercial models that will define it are being shaped right now.

"We are very happy with either model — both can be viable as long as our content is respected and paid for."

— Neil Vogel, CEO, People Inc. (Dotdash Meredith)

50+

documented publisher-AI deals since July 2023, across 8 AI platforms

$250m+

News Corp / OpenAI deal value over five years — the largest single known deal

$50m

News Corp / Meta annual value — up to $50m/year for at least 3 years (2026)

500+

publishers in Prorata's licensing pool — the largest collective AI content marketplace

Complete Deal Timeline: 2023–2026

Every confirmed publisher-AI content licensing deal, in chronological order. Covers training data licences, RAG/retrieval integrations, revenue-share programs, and platform-specific content marketplaces.

Date	Publisher(s)	AI Company	Key Terms
Jul 2023	Associated Press	OpenAI	Licence for AP archive back to 1985 for training; exploring generative AI use cases in news products
Dec 2023	Axel Springer (Politico, Business Insider, Bild, Welt)	OpenAI	Content summarised in ChatGPT globally including paywalled content, with links; permitted for training
Mar 2024	Le Monde, Prisa Media (El País)	OpenAI	Content summaries in ChatGPT with attribution; permitted for model training; multi-year revenue
Apr 2024	Financial Times	OpenAI	Current + archive FT content for real-time queries and training; reported $5–10m/year
Apr 2024	Axel Springer	Microsoft	Expanded partnership: AI chat experiences, advertising (Chat Ads API), Azure cloud migration
May 2024	Dotdash Meredith (People, Investopedia, 40+ titles)	OpenAI	Real-time content in ChatGPT; content used for training; helps build D/Cipher AI ad tool; $16m+
May 2024	Informa (B2B)	Microsoft	Non-exclusive data access to 2027; $10m+ first year; Copilot for M365, specialist AI agents
May 2024	News Corp (WSJ, NYP, The Times, The Australian, Barron's, MarketWatch)	OpenAI	Five-year content licence including training + real-time use; reported $250m+ over 5 years
May 2024	The Atlantic, Vox Media	OpenAI	Archive journalism in ChatGPT with attribution; access to OpenAI tech for product development
Jun 2024	TIME	OpenAI	Multi-year; OpenAI gets access to TIME's 101-year archive and current reporting
Jul 2024	TIME, Der Spiegel, Fortune, Entrepreneur, The Texas Tribune, Automattic	Perplexity	Revenue-share program; publishers share ad revenue when content referenced in AI results
Aug 2024	Financial Times, Axel Springer, The Atlantic, Fortune	Prorata	Content licensed to Prorata's Gist.ai; 50% revenue share proportional to content contribution
Aug 2024	Condé Nast (Vogue, Wired, Vanity Fair, GQ, The New Yorker)	OpenAI	Multi-year; content in ChatGPT with attribution; offsets declining search traffic revenue
Oct 2024	FT, Reuters, Axel Springer, Hearst, USA Today Network	Microsoft	Named partners for Copilot Daily news summaries; pay-per-use marketplace model
Oct 2024	Hearst (20+ magazines, 40+ newspapers)	OpenAI	US content in ChatGPT with citations and direct links; includes Esquire, Cosmo, Elle, Houston Chronicle
Oct 2024	Reuters	Meta	First news publisher Meta deal; real-time Reuters content in Meta AI across Facebook, WhatsApp, Instagram
Nov 2024	DMG Media (Mail, Metro, i), Sky News, The Guardian, Prospect	Prorata	Content licensing + equity investment (DMG); first UK publishers in Prorata's platform
Nov 2024	Washington Post	OpenAI	Summaries + links in ChatGPT; explicitly does not cover LLM training use
Dec 2024	The Independent, LA Times, Lee Enterprises, Adweek, Blavity, DPReview, Gear Patrol, MediaLab, Mexico News Daily, Minkabu Infonoid, NewsPicks, NTV (RTL), Prisa Media, Stern (RTL)	Perplexity	Revenue-share program expansion; ad revenue sharing, API access, free Enterprise Pro for one year
Dec 2024	Future plc (Tom's Guide, PC Gamer, TechRadar, Marie Claire)	OpenAI	Content partnership; Future uses OpenAI tools for sales, marketing and editorial functions
Dec 2024	Lee Enterprises	Prorata	Content deal; Prorata tech helps personalise local content and automate ad processes
Jan 2025	Axios	OpenAI	3-year; ChatGPT gets content with attribution + links; OpenAI funds 4 Axios Local newsrooms
Jan 2025	Associated Press	Google	Google's first AI licensing deal; real-time AP wire feeds into Gemini chatbot
Jan 2025	AFP (Agence France-Presse)	Mistral	Multi-year; 2,300 daily stories in 6 languages feed Le Chat AI assistant
Feb 2025	Schibsted (VG, Aftenposten, Aftonbladet, Svenska Dagbladet)	OpenAI	Real-time news in ChatGPT with clear attribution; Schibsted gets AI engagement insights
Feb 2025	The Guardian	OpenAI	Content with attribution in ChatGPT; Guardian gets OpenAI tech access for product development
Mar 2025	News/Media Alliance members (incl. McClatchy, The Atlantic, MIT Tech Review)	Prorata	Opt-in licence; Prorata pays 50% of revenues driven by AI responses using member content
Apr 2025	Washington Post	OpenAI	Summaries, quotes and links in ChatGPT; does not cover LLM training use
May 2025	New York Times (NYT Cooking, The Athletic)	Amazon	Alexa gets NYT summaries + recipes; covers AI model training; reported $20–25m/year
Jun 2025	500+ publishers (Atlantic, Fortune, Time, Boston Globe, The Verge, Vox, Guardian, Daily Mail, Sky News, Future)	Prorata	Prorata's Gist.ai reaches 500+ licensed publishers; one of the largest AI content libraries
Jul 2025	Condé Nast, Hearst	Amazon	Multi-year; content feeds Rufus AI shopping assistant
Jul 2025	People Inc.	Microsoft	Pay-per-use marketplace; "a la carte" model — contrasts with OpenAI's lump-sum deal
Jul 2025	Gannett / USA Today Co. (200+ local newsbrands)	Perplexity	Content in AI search + Comet browser; Gannett staff get Sonar API and Enterprise Pro access
Oct 2025	Getty Images	Perplexity	Global multi-year; Getty images in Perplexity search and discovery with credits and source links
Oct 2025	USA Today Co.	Microsoft	Joins Microsoft's pay-per-use AI content marketplace; Copilot first buyer
Nov 2025	Associated Press	Microsoft	Joins Microsoft pay-per-use marketplace; "very early days and experimental" per AP CRO
Dec 2025	CNN, Fox News, Fox Sports, Le Monde Group, People Inc., The Daily Caller, The Washington Examiner, USA Today	Meta	Meta's first publisher deals; real-time news in Meta AI with prominent linking; mixed with user-generated content from Facebook/Instagram
Dec 2025	The Guardian, Washington Post, FT, Der Spiegel, El País, Folha, Infobae, Kompas, Times of India, Washington Examiner + others	Google	AI pilot program (non-training); AI-powered article overviews on Google News pages; cash payments
Mar 2026	Reach (Mirror, Express, OK!, and 130+ regional UK titles)	Amazon	Content in Amazon Nova AI model and Alexa; usage-based compensation — not a flat fee; Reach CEO: "more visibility and control over how content is used"
Mar 2026	News Corp (WSJ, NYP, The Times, The Sun, The Australian)	Meta	Multi-year; up to $50m/year for at least 3 years; archive + real-time content in Meta AI products
Mar 2026	News/Media Alliance (2,200 member publishers)	Bria	Opt-in licence for NMA members; content forms foundation for new Bria enterprise AI product; recurring revenue based on usage frequency

Sources: Press Gazette, Digiday, BuzzStream, company announcements. 2023–March 2026. Orange rows = 2025 deals. Green rows = 2026 deals.

Deals Don't Guarantee Citations

BuzzStream's analysis of top sites appearing in ChatGPT found little correlation between having a licensing deal and frequency of AI citations. OpenAI partners broadly, but partnerships don't guarantee ChatGPT will surface that publisher's content in responses. The deal secures access and compensation — it doesn't guarantee visibility.

Google's Structural Exception

Google's approach stands apart from every other AI platform: the Associated Press is the only publisher with a confirmed deal covering LLM training data for AI Overviews, AI Mode, and Gemini. The December 2025 pilot program with The Guardian, Washington Post, and others is a non-training arrangement covering AI article overviews in Google News. Google extracts from virtually every publisher via Googlebot — and licenses from almost none. This is the structural gap at the centre of the CMA's investigation.

Two Deal Models — Different Risk Profiles

Model 1 — Lump-Sum "All You Can Eat"

Publisher receives a flat multi-year fee in exchange for broad content access rights. OpenAI's dominant approach with The Guardian, Axios, and others. Provides revenue certainty but misaligns incentives: the AI company has paid once and can consume indefinitely. People Inc. CEO Neil Vogel: "We are very happy with either model — both can be viable as long as our content is respected and paid for."

Model 2 — Pay-Per-Use Marketplace

Microsoft's approach: an "a la carte" marketplace where publishers set prices and AI companies pay per consumption event. The AP, USA Today Co., and People Inc. are participants. Prorata's Gist.ai runs a 50% revenue-share model across 500+ publishers, distributing revenue proportionally to how much each publisher's content drove AI responses. This model better aligns economic value with content consumption — and creates the architecture for a functioning market.

Four Industry Initiatives Forcing Payment

Beyond bilateral deals, publishers are forming collective structures to create negotiating leverage — and to address the transparency gap that makes individual licensing unenforceable at scale.

The Transparency Problem

"Voluntary transparency commitments alone are unlikely to deliver the level of disclosure needed to support licensing and enforcement." — Eilidh Wilson, Head of Policy & Public Affairs, PPA. Without knowing which content was consumed, when, and by which model, publishers cannot price, enforce, or audit licensing agreements. The entire market depends on solving this first.

1 SPUR — Standards for Publisher Usage Rights
Founded by the Financial Times, The Guardian, The Telegraph, BBC and Sky News. Aims to stop publisher content being used without permission or compensation, while creating a shared industry position rather than a fragmented response. Open to publishers of all sizes.

2 PLS Collective Licensing Scheme
Publishers' Licensing Services + Copyright Licensing Agency. Aggregates content across magazines, digital news, books and academic publishing into a single repository, negotiating with AI platforms at scale. Open to any publisher with an ISSN.

3 IAB Tech Lab — Content Monetisation Protocol
A technical specification that defines how AI systems interact with publisher content at the protocol level. Sets a standardised framework for agreeing commercial terms before content is crawled — making permissions and payments machine-readable and enforceable within the infrastructure itself.

4 News/Media Alliance + Bria — Opt-In AI Licensing
Live, opt-in model built around attribution and revenue share. Publisher content is tracked at the level of output, with compensation distributed based on usage. A working example of licensing operating at scale, combining aggregated supply with technology that assigns value to each contribution.

How Bot Management Makes Deals Enforceable

Commercial deals are only worth the paper they're printed on if the publisher can control who accesses their content. Cloudflare's Bot Management infrastructure is what converts a licensing agreement from a legal document into an enforced reality:

→ Identity: Bot Verification
Cloudflare's verified bot catalog confirms that the crawler claiming to be GPTBot actually resolves to OpenAI's infrastructure. Web Bot Auth extends this to cryptographic proof. Only verified bots can be licensed. Unverified bots can be blocked.

→ Granularity: AI Crawl Control
Publishers can allow OAI-SearchBot (drives referral traffic — the paid partner) while blocking GPTBot (trains models — the unlicensed extractor). This distinction is what makes tiered licensing viable and the IAB's Content Monetisation Protocol actionable.

→ Audit: Bot Analytics
Publishers get visibility into which crawlers hit which pages, at what frequency, with what intent scores. This is the consumption data needed to verify that AI companies are accessing what they contracted for — and to feed pay-per-use marketplace models like Microsoft's.

→ Scarcity: The Value of the Block
The ability to credibly withhold content is what gives publisher leverage. A publisher who cannot block has nothing to sell. Cloudflare's enforcement layer converts content into a scarce, licensable asset rather than a public good freely consumed at machine scale.

The Revenue Lag — and What Comes Next

Despite the deal velocity, publisher revenue from AI licensing remains far below what search advertising once delivered at its peak. People Inc.'s Jon Roberts described the state of the market at the Digiday Publishing Summit in March 2026 as an "AI licensing boom — and a revenue lag." Deals are being signed faster than revenue is materializing.

The pay-per-use marketplace model is most likely to close this gap, because it ties compensation to actual consumption at scale. But the market infrastructure — verified identity, consumption metering, access enforcement — must exist before that pricing can be trusted. Meanwhile, publishers are running out of time to wait for legislation. The UK stepped back from an opt-out approach in March 2026, leaving space for industry-led solutions — but also leaving the field open. Publishers will have to choose a route. The clear risk is fragmentation: different collective initiatives pulling in different directions, weakening the position of all of them.

The convergence of Web Bot Auth, AI Crawl Control, and pay-per-use marketplace models represents the technical and commercial stack needed for a functioning content economy on the Programmable Web. The infrastructure is in place. The deals are being signed. The window to shape the terms of this market — as a publisher, a developer, or an AI company — is open now.

Conclusion

The New Deal Is Being Written Now

The data from Q1 2026 tells a story of transition — and of opportunity. AI crawler volume is growing. So is the number of licensing deals. So is the sophistication of the tools publishers have to set terms, verify identity, attribute consumption, and enforce access. The infrastructure for a new content economy is not theoretical. It is operational.

Developers building on AI are not the enemy. They are the market. The question is whether publishers can participate in that market on fair terms — and the answer, for the first time, is yes.

50+

publisher-AI deals signed — a licensing market forming at pace

$250m+

largest single deal value — proof that content has quantifiable worth to AI companies

500+

publishers in collective licensing pools — industry infrastructure emerging

37

verified AI crawlers in Cloudflare's catalog — identity is becoming standardised

Cloudflare's role in this transition is to be the infrastructure layer that makes the new model work — for both sides. Bot Management and AI Crawl Control give publishers the ability to define who accesses their content and on what terms. Crawler Hints and the /crawl endpoint give developers efficient, compliant access to the web's content. Web Bot Auth gives everyone a cryptographic foundation for identity. The Attribution Dashboard gives publishers the evidence base to negotiate from a position of knowledge rather than assumption.

The Programmable Web is being built. Cloudflare is building the access layer that determines whether it is built on extraction or on agreement. The terms are being written now — by publishers who block, by developers who comply, and by AI companies who are discovering that a web full of licensed, attributed, fairly compensated content is more valuable than a web of dark patterns and silent scraping.

Block strategically. License intentionally. Build on agreement. The new deal is better for everyone.