StepToCyber

One Filter Is Not a Safety Strategy: What the Grok Failure Teaches Every Security Leader About AI Safety

StepToCyber — Sat, 18 Apr 2026 13:06:42 GMT

Four months ago, xAI promised to stop Grok from generating nonconsensual sexualized images of real women. This week, NBC News reported it is still happening. The bypasses are not sophisticated. Users pair a photo of a real person with a stick-figure pose diagram and tell Grok to “match the pose.” Or they ask Grok to swap the clothing between two images. Or they upload a photo and ask for a video transformation. The filters xAI promised do not catch any of it.

One independent analyst now believes Grok produces more nonconsensual synthetic nudity than every comparable tool combined.

xAI’s publicly described controls amount to model-level filtering — and the company is now arguing in a Dutch court that it cannot stop all abuse and should not be penalized when malicious users bypass those controls. That is the opposite of defense in depth.

Defense in Depth Is Not a Slogan

Defense in depth is a design principle. It assumes every control will fail.

You layer perimeter, network, endpoint, identity, monitoring, and response so that when one layer is breached, the next catches what got through. Each layer is a different control against a different failure mode, often from different tools or different teams. That is the architecture. A single filter is not.

xAI’s Dutch court argument fails a basic test of secure design. CISA’s Secure-by-Design principles place responsibility for safety on the system’s operator, not on end users. Arguing that malicious users are responsible when controls are bypassed does not meet that bar.

Grok and the OWASP LLM Top 10

The OWASP Top 10 for LLM Applications (2025) is the industry reference for critical risks in LLM-based systems. Grok’s public behavior maps directly onto one category — and exposes a gap in the framework itself.

LLM01: Prompt Injection. Prompt injection has held the top spot in the OWASP list for two consecutive editions because LLMs process instructions and data in the same channel without clear separation. The model cannot tell them apart.

This is not a Grok-specific problem. A 2025 paper introduced Cross-modal Adversarial Multimodal Obfuscation (CAMO), a black-box attack framework that splits harmful instructions into benign-looking textual and visual clues. Each component looks harmless on its own. The model reconstructs the attack intent through cross-modal reasoning. CAMO achieved attack success rates of 81.82% on GPT-4.1-nano and 93.94% on DeepSeek-R1 using 12.6% of the tokens required by older attack methods.

The Grok bypasses exploit the same vulnerability class that paper documented — individually benign inputs that become harmful in combination. The difference is that CAMO uses automated adversarial optimization. Grok’s users did not need any of that. They combined unmodified photos with hand-drawn diagrams and plain-language instructions. The filters failed against a basic, manual version of a well-documented attack class. The class itself was publicly documented before xAI shipped these features — Shayegani, Dong, and Abu-Ghazaleh published compositional cross-modality attacks at ICLR in 2024, based on work from 2023.

Where the framework stops. The Grok case also involves insufficient content filtering on generated output and capabilities shipped without proportionate controls. These are real failures, but they do not map cleanly onto a second OWASP LLM Top 10 category. LLM05 (Improper Output Handling) addresses output passed to downstream systems without sanitization — XSS, SQL injection, remote code execution — not harmful content shown directly to users. LLM06 (Excessive Agency) addresses agents calling functions and extensions, not generative models producing content. The OWASP LLM Top 10 was designed for LLM applications integrated into software systems. Consumer-facing generative AI — where the output is the product — sits partially outside the framework's current scope.

The Stakes Get Higher with Agents

Grok generating an image is the low-stakes version of this problem. The failure mode is bad output. When this class of model gets agency — tools, memory, authority to take action — the failure mode stops being bad output and starts being bad actions.

The OWASP Top 10 for Agentic Applications (2026), released in December 2025, is the framework for that next-stage problem. It was built with dozens of security experts from industry, academia, and government and is based on real attacks observed in production.

Agent Goal Hijack (ASI01). An attacker changes an agent’s objectives through malicious content. The same prompt injection that bypassed Grok’s image filters can hijack an agent into sending an email, modifying a record, or calling an API on an attacker’s behalf.

Identity and Privilege Abuse (ASI03). An AI agent acts with the full authority of every key, token, and service account assigned to it. A single agent merges multiple permissions into one execution point. Compromise the agent, and you inherit every non-human identity it holds. Identity runs through most of the top risks in the OWASP Agentic Top 10.

Cascading Failures. A compromised agent does not produce one bad output and stop. It chains actions across connected systems. It exfiltrates through the same channels it was authorized to use.

The model is not your security boundary. The model — and everything you let it do — is the thing being contained.

When the model shares its blast radius with production systems, shares its identity with the user, shares its network egress with sensitive data — you have not deployed AI safely. You have deployed a Grok-class failure waiting for the right prompt.

Five Categories of AI Safety Controls — And Why No Single Category Is Enough

Defense in depth requires controls at different layers, using different methods, catching different failure modes. In AI safety, those controls fall into five categories. Whatever xAI deployed, the publicly visible bypasses on X confirm it was not enough. Here is what the full surface looks like, and where each category fails when the inputs are multimodal.

Category 1: Model-Level Controls

Safety training built into the model itself. RLHF alignment, refusal training, Constitutional AI, concept erasure — techniques that modify the model’s weights to make it refuse harmful requests or suppress harmful outputs.

This is what most people mean when they say “the model won’t do that.”

Model-level controls are useful but have the best-documented failure rates of any category. A 2026 survey of LLM jailbreaking found that automated attacks achieve 90–99% success on open-weight models, and 80–94% on proprietary models. The model cannot reliably separate instructions from content. That limitation is structural. RLHF hasn't fixed it.

Model-level controls are deliberately absent from the six-layer architecture below. The architecture assumes this layer will fail and builds everything else to catch what gets through.

Category 2: Input Inspection

Everything that evaluates the prompt before the model processes it. Prompt injection classifiers, jailbreak detectors, topic deny lists, PII detection on inputs, input format validation.

Available implementations include Azure Prompt Shields, Meta Prompt Guard, NVIDIA NeMo jailbreak detection rails, and Amazon Bedrock’s prompt attack filtering.

Where this category breaks in multimodal: Input inspection for text is a maturing control. The multimodal version is not. The problem is compositional attacks — inputs that are individually benign but harmful in combination. The “Jailbreak in Pieces” paper showed that pairing adversarial images with generic textual prompts breaks model alignment using only the vision encoder — no access to the LLM required.

The Grok bypasses are a simpler version of this attack class. The research attacks use adversarially optimized images. Grok’s users did not need that — they combined unmodified photos with hand-drawn diagrams and simple instructions. The filters failed against an unsophisticated version of a well-documented attack.

Category 3: Output Evaluation

Everything that evaluates the model’s response before it reaches the user. Content harm classifiers, LLM-as-judge implementations, NSFW image classifiers, PII redaction, groundedness checks, output format validation.

Content harm classification is the most widely deployed control in this category — present in every major platform. Azure AI Content Safety monitors four harm categories with adjustable severity thresholds. Amazon Bedrock Guardrails reports blocking up to 88% of harmful content. These classifiers detect harmful outputs when the harm is visible in the output itself. They do not detect harm that was invisible in the inputs and only emerged during generation.

Groundedness checks — verifying that the model’s output is based on provided source material — are shipped by Azure and Bedrock. These address accuracy, not content safety.

Where this category breaks in multimodal: For text, LLM-as-judge works well when the judge is purpose-trained for safety evaluation. For images, the judge needs to be vision-capable and safety-trained on visual content. Few purpose-built visual safety judges exist — Llama Guard 3 Vision and ShieldGemma 2 are among the first. The effectiveness gap is measurable — the best-performing vision classifier in benchmarking studies shows F1 scores below 0.5 on categories like harassment and self-harm.

For video, the problem gets worse. The judge has to evaluate motion, context, and transformation across frames. This is the modality where Grok generates its most harmful output — photo-to-video transformations that are publicly visible on X, meaning whatever output evaluation exists in xAI’s pipeline did not prevent them from reaching users.

Three failure modes in LLM-as-judge are documented.

First, shared blind spots. When the judge and the generator share training lineage, they share failure modes. Research by Fu and Liu (EMNLP 2025 Findings) evaluated five models across 25 languages and found average inter-judge agreement at a Fleiss’ kappa of approximately 0.3 — barely above chance. Liu et al. (ICLR 2025) found that some guard models flag responses as “unsafe” based on the user input alone, even when the model response is a single space token — meaning the guards are classifying the prompt, not the response.

Second, judge vulnerability. The judge is still a model. The same prompt injection techniques that compromise the primary model can compromise the judge. A 2026 survey found that automated judge agreement varies 70–93% depending on implementation.

Third, incomplete coverage. If cost constraints lead to evaluating a sample of outputs rather than all of them, the result is a statistical defense, not a security defense. An attacker who knows that not every output is checked can adjust accordingly.

Category 4: Infrastructure Controls

The controls around the model, not on it. Blast radius containment, network segmentation, identity federation, credential scoping, sandboxed execution, API rate limiting, data loss prevention, egress filtering.

This is the category where existing security expertise applies directly to AI deployment. Zero-trust architecture, least-privilege access, tenant isolation — these are not AI-specific. They are the same controls enterprises have used for decades, applied to a new class of system.

In multimodal: Infrastructure controls are modality-agnostic. They do not care whether the model generates text, images, or video. They care whether the model has access to systems it should not, and whether a compromise propagates to connected systems.

Category 5: Observability

Runtime monitoring, behavioral detection, logging, audit trails, alerting, and incident response.

This category assumes the first four have failed. Runtime monitoring watches for anomalous model behavior — outputs that deviate from baselines, unusual tool invocations, data access patterns outside the agent’s scope. Logging makes incident reconstruction possible. Alerting and incident response make it actionable.

In multimodal: Observability for AI systems is less mature than for traditional infrastructure. Most enterprises have monitoring for network traffic, endpoint behavior, and application logs. Few have equivalent monitoring for AI agent behavior or output distribution anomalies. The telemetry exists — model inputs, outputs, tool calls, guardrail triggers — but it is not routinely fed into SIEM platforms or monitored by security operations centers. The data is available. The pipelines to use it are not built yet.

A Six-Layer Architecture

The six-layer architecture is built from these five control categories, plus one precondition. Model-level controls (Category 1) are not a layer. The architecture assumes they will fail and builds everything else to compensate.

Layer 1: Supply chain visibility (AIBOM). You cannot secure what you cannot inventory. Model provenance, training data origin, fine-tuning history, embedded safety controls, evaluation artifacts. A precondition for evaluating every layer that follows. Maps to LLM03.

Layer 2: Input defense. Category 2 applied. Pre-model classifiers that flag bypass patterns, adversarial inputs, and known-bad prompts. For multimodal systems, classifiers that evaluate the composition of inputs — not each input in isolation. Maps to LLM01 and Agent Goal Hijack.

Layer 3: Output defense. Category 3 applied. Post-model classifiers for every modality the system produces. This layer must use a different detection method than Layer 2. If both share training data or vendor lineage, they share blind spots. Output filtering should be structurally independent: a different model family, a rule-based policy engine, or an LLM-as-judge from a separate provider.

Layer 4: Blast radius and exfiltration controls. Category 4 applied. The model does not share identity with the user. It does not share network egress with production data. It does not share credentials with other agents. Tools are scoped. Permissions are least-privilege. Agent actions are sandboxed. Maps to Identity and Privilege Abuse and Cascading Failures.

Layer 5: Runtime monitoring. Category 5 applied. Layers 1 through 4 try to prevent bad outcomes. Layer 5 assumes they failed. It watches for anomalous behavior, logs everything, and alerts on deviations. This is the layer that catches attacks no classifier was trained on. Logging here is not optional — it is what makes incident reconstruction possible.

Layer 6: Human oversight and incident response. Category 5 extended into action. For high-risk outputs — image generation involving real people, video generation, agent actions that modify production systems — a human review gate belongs in the pipeline. Not on every output. On outputs that cross a defined risk threshold. Behind that gate sits an incident response process: escalation paths, containment procedures, credential revocation, system isolation.

Every layer is imperfect. That is the point. If you can't answer what happens when one layer fails, you don't have defense in depth.

What the Industry Is Shipping Today

Purpose-Built Multimodal Safety Classifiers

Llama Guard 4 (Meta, 2025) is a multimodal safety classifier that evaluates prompts and responses across 14 hazard categories plus code interpreter abuse. Llama Guard 3 Vision (Meta, late 2024) was the first safety classifier built for LLM image understanding, evaluating prompt text and images together. ShieldGemma 2 (Google) classifies images for sexual content, violence, and gore, and uses its own classifier in reverse to generate adversarial test images — red-teaming-as-training. NVIDIA NeMo Guardrails supports multimodal content safety with GPU-accelerated parallel execution, adding roughly half a second of latency for five parallel guardrails.

Every one of these is a single-layer control. None cover compositional cross-modal attacks. They are pieces of a stack, not the stack.

Enterprise Guardrail Platforms

Azure AI Content Safety provides multimodal moderation, prompt injection detection, groundedness checks, and PII filtering — Microsoft’s documentation notes it is probabilistic and should be treated as a risk reduction tool, not a guarantee. Amazon Bedrock Guardrails filters harmful text and image content, blocks prompt injections, and redacts PII — AWS reports it blocks up to 88% of harmful content. Microsoft Foundry Guardrails applies classification at four intervention points: user input, tool call, tool response, and output — the tool call and tool response points are significant for agentic systems because they let guardrails inspect what an agent is about to do before it does it.

Every one of these platforms is built on classification models tuned for known harm categories. None cover compositional cross-modal attacks. They are a layer. The defense-in-depth architecture has to be built by the enterprise deploying them.

Emerging Research and Open Problems

In-Generation Detection

Current safety tools inspect the prompt or the output. A 2025 preprint introduced In-Generation Detection (IGD), which monitors the model’s internal state during the image generation process itself. It reads the predicted noise during diffusion denoising steps — a signal that reflects the evolving visual meaning of the prompt — and trains a lightweight classifier to detect NSFW intent before the image is fully generated.

IGD achieved 91.32% detection accuracy across seven NSFW categories, including adversarially crafted prompts. Because it reads internal model state rather than the prompt surface, it has the potential to catch adversarial inputs that are designed to look benign to external classifiers.

Currently demonstrated only for diffusion-based image generation. Does not extend to video, text, or multimodal-to-multimodal systems. Not shipping in any enterprise product.

Proposed Directions for Compositional Attack Defense

Security researchers have described three architectural directions that would address the compositional cross-modal attack class:

Evaluate combined intent, not individual inputs. Safety systems should reason over the cumulative meaning of a full prompt sequence — “photo + stick figure + match the pose” as a single semantic intent, not three benign inputs evaluated separately.

Share context across safety layers. The image classifier should see the original user request. The prompt guard should see the generated image. Without this, attackers can route harmful content through one modality to exploit blind spots in another.

Decompose compositional inputs. Classifiers should identify compositional elements — diagrams, reference images, pose guides — within a larger input, and evaluate their meaning separately from the overall scene.

None of these are shipping in enterprise products. They represent where the field needs to go, and security architects should be asking vendors whether their roadmaps address them.

The Swiss Cheese Model for AI Safety

Researchers have proposed multi-layered runtime guardrails modeled on the Swiss Cheese Model from aviation and healthcare safety engineering. Each layer has holes. The principle is that no two layers have the same holes in the same place. The architecture decouples safety authority from any single model so each layer can be tested and updated independently.

What a Security Architect Should Implement Today

The research is ahead of the products. The products are ahead of most deployments. Here is what you can do now, mapped to the six layers, using tools that exist today.

Where you start depends on what you are shipping. If the business needs an internet-facing chatbot, input defense comes first — you need prompt injection detection before the system goes live. If the system handles legal or regulated content, output filtering on specific terms comes first — you need to block what cannot be said before anything else. The layer numbers are not a priority order. They are a completeness checklist. Build what the use case demands, ship it, then add depth.

I'm building this stack in production — some layers are live, others are in progress. Start anywhere, but don't stop at one layer — the gap you skip is the one that gets exploited.

The tooling is also not static. Security vendors are building AI capabilities into their products at the same pace enterprises are adopting AI. The guardrail platform you evaluated last quarter may have shipped new capabilities since. Reassess continuously. And check what you already have — if your organization runs DLP, content filtering, or compliance tooling, some of these controls may already be partially in place. You do not always need to build from scratch.

Layer 1: Supply chain visibility. Maintain an AIBOM for every model in your environment. Document provenance, training data sources, fine-tuning history, safety controls, and evaluation results. For third-party models, document what the vendor discloses and what they do not.

Layer 2: Input defense. Deploy prompt injection and jailbreak detection on all inputs before they reach the model. For multimodal systems, use classifiers that evaluate the composition of inputs, not just individual components. Meta Prompt Guard, Azure Prompt Shields, and NVIDIA NeMo jailbreak detection are available options. None fully solve compositional attacks, but their absence is what lets those attacks scale. Run them in parallel with other guardrails to minimize latency.

Layer 3: Output defense. Deploy a structurally independent output classifier. If your input classifier is from Vendor A, your output classifier should not be from Vendor A. Use a purpose-built multimodal safety classifier — Llama Guard 4, ShieldGemma 2, or comparable — rather than a general-purpose vision model. If your system generates images or video, the classifier must be trained on AI-generated content, not benchmarked against real-world photos. Test it against adversarial inputs, not just known harmful content.

Layer 4: Blast radius and exfiltration controls. Apply your existing zero-trust and least-privilege architecture to AI systems. The model runs in a sandbox. It does not share identity with the user, network egress with production data, or credentials with other agents. Tools are scoped and explicitly enumerated. Rate limits, DLP rules, and egress filtering apply to AI-initiated requests the same way they apply to human-initiated requests.

Layer 5: Runtime monitoring. Log all inputs, outputs, tool invocations, and guardrail triggers. Establish behavioral baselines and alert on deviations. Feed guardrail trigger data into your SIEM. If your SOC monitors network anomalies and endpoint behavior, it should also monitor AI agent behavior.

Layer 6: Human oversight and incident response. Define risk thresholds for human review. Build incident response playbooks for AI-specific scenarios: model compromise, agent hijack, data exfiltration through authorized channels, classifier bypass. Include the ability to revoke agent credentials, isolate the model, and preserve audit logs.

Architecture-level: Run guardrails in parallel, not in series. Five parallel guardrails add roughly half a second of latency. Use risk-based routing — low-risk queries get lightweight checks, high-risk queries get deeper evaluation with human review gates.

The Takeaway

If the safety story for any AI system you build or deploy is “the model won’t do that,” that is a red flag. Ask what catches the prompt the model missed. Ask what catches the prompt that does not look like a prompt. Ask what the classifier’s detection rate is on AI-generated content specifically. Ask what the judge does when the judge is the target.

If your AI adoption strategy treats the model as the security boundary, you are one creative composition away from the Grok headline. Not the same incident. The same failure class.

The security team has decades of defense-in-depth experience. The AI safety field is still building theirs. We have done this before. We know what happens when a single control fails without a second layer behind it.

The answer is not “we could not prevent all abuse.”

The answer is the next layer.

Subscribe to StepToCyber for frequent analysis on securing GenAI at enterprise scale.

Views are my own.

References

Primary news coverage

Ingram, D. (2026, April 14). Elon Musk’s AI chatbot Grok continues to produce sexualized deepfakes despite xAI’s pledge to stop. NBC News. https://www.nbcnews.com/tech/tech-news/musks-ai-chatbot-grok-xai-making-sexual-deepfakes-imagine-rcna265855

OWASP frameworks

OWASP GenAI Security Project. (2025). OWASP Top 10 for Large Language Model Applications 2025. https://genai.owasp.org/llm-top-10/
OWASP GenAI Security Project. (2025, December). OWASP Top 10 for Agentic Applications 2026. https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/

Secure-by-Design

CISA. Secure by Design. https://www.cisa.gov/securebydesign

Compositional cross-modal attack research

Shayegani, E., Dong, Y., & Abu-Ghazaleh, N. (2024). Jailbreak in Pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. ICLR 2024. https://openreview.net/forum?id=plmBsXHxgR
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models (CAMO). (2025). arXiv preprint. https://arxiv.org/html/2506.16760v1

Image safety classifier benchmarking

Qu, Y., Shen, X., He, X., Backes, M., Zannettou, S., & Zhang, Y. (2024). UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images. arXiv preprint. https://arxiv.org/html/2405.03486v3

LLM jailbreaking survey and judge reliability

Bin Hakim, S., Gharami, K., Farhady Ghalaty, N., et al. (2026, January). Jailbreaking LLMs: A Survey of Attacks, Defenses and Evaluation. TechRxiv. https://www.techrxiv.org/users/1011181/articles/1373070
Fu, X. & Liu, W. (2025). How Reliable is Multilingual LLM-as-a-Judge? EMNLP 2025 Findings, pages 11040–11053. https://aclanthology.org/2025.findings-emnlp.587/
Liu, H., Huang, H., Gu, X., Wang, H., & Wang, Y. (2025). On Calibration of LLM-based Guard Models for Reliable Content Moderation. ICLR 2025. https://arxiv.org/abs/2410.10414

Purpose-built multimodal safety classifiers

Meta. (2025). Llama Guard 4-12B Model Card. Hugging Face. https://huggingface.co/meta-llama/Llama-Guard-4-12B
Meta. (2024). Llama Guard 3-11B-Vision Model Card. GitHub. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard3/11B-vision/MODEL_CARD.md
Google. ShieldGemma. Referenced in: 19 Large Language Models Redefining AI Safety. InfoWorld. https://www.infoworld.com/article/4140809/19-large-language-models-redefining-ai-safety-and-danger.html

Enterprise guardrail vendor documentation

Microsoft. (2026). Azure AI Content Safety overview. https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety/
Microsoft. (2026). Guardrails and controls overview in Microsoft Foundry. https://learn.microsoft.com/en-us/azure/foundry/guardrails/guardrails-overview
Amazon Web Services. (2026). Amazon Bedrock Guardrails. https://aws.amazon.com/bedrock/guardrails/
NVIDIA. NeMo Guardrails for Developers. https://developer.nvidia.com/nemo-guardrails

Guardrail architecture research

Designing Multi-layered Runtime Guardrails for Foundation Model Based Agents: Swiss Cheese Model for AI Safety by Design. (2024). arXiv preprint. https://arxiv.org/html/2408.02205v3
Modular Safety Guardrails Are Necessary for Foundation-Model-Enabled Robots in the Real World. (2026). arXiv preprint. https://arxiv.org/html/2602.04056

In-generation detection research

Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models. (2025). arXiv preprint 2508.03006. https://openreview.net/forum?id=SFHjSDIMKn

Proposed architectural defenses

Decodes Future. (2026, March). Grok Jailbreak Prompts: Multimodal Reasoning Vulnerability Analysis. https://www.decodesfuture.com/articles/grok-jailbreak-prompts-multimodal-reasoning-vulnerability-analysis

Parallel guardrail orchestration

Authority Partners. (2026, March). AI Agent Guardrails: Production Guide for 2026. https://authoritypartners.com/insights/ai-agent-guardrails-production-guide-for-2026/

Controlled Openings: How Enterprises Should Actually Let AI Crawlers In

StepToCyber — Sat, 11 Apr 2026 11:23:57 GMT

The Controlled Openings Model

Tier 1 — The Declaration. Owned by Enterprise AI, Marketing, and Security. Legal reviews. The robots.txt file is the meeting minutes — every AI crawler named, every posture stated, closed by default.

Tier 2 — Active Enforcement. Owned by app teams. Security audits. The WAF or CDN baseline that turns the declaration into a control on every property.

Tier 3 — Governance. Owned by the AI governance organization. The allowed list register, named owners, and the quarterly review that keeps every opening on a clock.

Closed by default. Every opening is scoped to one application, owned by a named human, and expires on a date. Everything else is a wish.

A marketing director buys an ad placement inside ChatGPT. The placement requires OpenAI to crawl the landing page as part of ad onboarding — and the crawl fails. OAI-SearchBot can't reach the page. The ad can't go live.

Here's the conflict. Marketing needs the crawler in. Every enterprise with a public web estate has a legitimate reason to block AI crawlers by default — scraping pressure, referral asymmetry, brand misrepresentation risk, contested compliance from bots like PerplexityBot. Both positions are correct. Neither can unilaterally win. And right now, most enterprises don't have a shared place where those two positions can see each other before one of them collides with the other.

This is the state most enterprises are actually in. Not a policy. Not even a wish. Two legitimate positions with no shared place to meet.

The fix is a model I call controlled openings — closed to AI crawlers by default, opened deliberately per application, with a named owner and an expiry date on every opening. Not one enterprise-wide yes or no. A portfolio of small, deliberate yeses — each one owned, scoped, and on a clock.

This post is the playbook for getting there — written for the three groups who have to operate it together: Enterprise AI leadership, app owners, and security teams. No one of them can make the call alone. Enterprise AI leadership owns the vendor relationships every allow touches, and co-authors the declaration with Marketing's referral data and Legal's MSA review. App owners operate the enforcement on the properties they're accountable for. Security owns the enforcement standard that keeps the whole thing consistent across hundreds of properties, and audits against it. The AI governance organization owns the allowed list register and runs the quarterly review that keeps every opening on a clock. The controlled openings model is what lets every owner act in the same direction without stepping on each other — gives Marketing and Legal a named seat at the table when the declaration is written, and most importantly, actually addresses Marketing's needs instead of leaving them to discover the block and the ad won’t go live.

The call has to land somewhere every stakeholder can read — not buried in a WAF rule group that only the app team operating that property can see. That somewhere is robots.txt.

robots.txt is the declaration. It names every AI crawler explicitly and records whether the company allows it, disallows it, or has granted a time-bound allow for a specific application. The declaration is jointly authored by Enterprise AI leadership, business and marketing, and security. Legal reviews every change before it merges. App teams implement the declaration on their own properties. Security reviews allow requests against the standard. Governance keeps all of it aligned over time.

Tier 1: The Declaration

The failure mode most enterprises are in right now isn’t having the wrong AI bot policy. It’s having a different one on every property. Some app teams are genuinely sophisticated — their robots.txt names every major AI crawler and their WAF is tuned. Other teams copied a robots.txt from a template years ago and haven’t touched it since. Marketing landing pages spin up without any declaration at all.

A standardized AI bot robots.txt fixes the consistency problem by giving every app team the same starting point. The classification in the template is made jointly by the three co-authors. Enterprise AI brings the vendor relationships. Marketing brings the referral data and the campaign roadmap. Security brings the threat picture. Legal reviews every draft against current MSA terms before it merges, flagging any Allow that conflicts with data usage clauses. The file is the meeting minutes of that conversation, written in a format bots can parse.

The Decision Matrix

Every known AI crawler is Block by default. An allow is never enterprise-wide — it’s scoped to a specific application. OAI-SearchBot allowed on the corporate marketing site for an ad campaign is not OAI-SearchBot allowed on the developer docs or the support portal. Each app team runs its own allow list against its own property, because the business case for any given crawler almost never applies uniformly across the enterprise. All per-application allows are recorded in a central register so the governance layer can see across them, but the decisions themselves are made where the property is owned.

Training Crawlers — Default: Block

None of these return referral traffic. All of them consume at scale. The crawl-to-referral ratio on training crawlers is effectively infinite — you give content, you get nothing back. The only reason to allow any of them is a deliberate strategic decision by Enterprise AI leadership to contribute training data to a specific partner, and that decision runs through Tier 3 as an allow request, not a default.

Search, Answer, and User-Initiated Crawlers — Default: Block

These crawlers drive real referral traffic and matter to marketing and Enterprise AI. That’s exactly why they go through the allowing process — because the business case must be made and owned by a named stakeholder, not inherited as a default. Block is the posture. Every allow has an owner and an expiry.

Two rows deserve specific notes. ChatGPT-User had its robots.txt compliance language revised by OpenAI in December 2025 — OpenAI’s current documentation states that because these actions are initiated by a user, “robots.txt rules may not apply,” which means enforcement for this crawler has to happen entirely at Tier 2. PerplexityBot has documented stealth behavior per Cloudflare’s August 2025 forensic report, including spoofed user-agents and rotating IPs. A Disallow in Tier 1 is still worth recording for audit purposes, but enforcement rests on the WAF’s signal rules, not the user-agent string.

Anthropic’s crawler roster expanded in early 2026 to include three distinct agents: ClaudeBot (training, in the table above), Claude-User (user-initiated), and Claude-SearchBot (search indexing). Each is independently controllable via robots.txt. Enterprises that blocked only ClaudeBot and assumed full coverage need to revisit their declarations.

The Catch-All

Every crawler not named above is an unknown. Unknowns do not get an implicit pass. When a new bot shows up in an app team’s AI Activity Dashboard, it triggers the Tier 3 intake process. The default state while the ticket is open is Block. The default is always Block for anything undeclared.

Tier 2: Active Enforcement

App owners operate this tier. Security audits it. Declaration without enforcement is a wish.

Security’s job at Tier 2 is to publish an enforcement standard — the set of controls every public-facing property must implement, regardless of which WAF or CDN sits in front of it. The standard is platform-agnostic by design: block AI bot categories by default, enforce every robots.txt Disallow at the edge, apply targeted inspection to sensitive endpoints.

AWS WAF Bot Control is the reference implementation walked through below. The label namespace and CategoryAI rule are well-designed for exactly this problem, and if your property lives in AWS, this is the shortest path from standard to deployed control. If it lives somewhere else — Cloudflare, Akamai, Azure Front Door, F5 — read this section as the pattern, not the product. Map CategoryAI to your platform's AI bot category. Map the signal rules to your platform's stealth detection. Map the label namespace to whatever your platform calls its tagging layer. The names differ. The controls don't.

Three building blocks matter for AI bot enforcement: category rules, signal rules, and the label namespace.

CategoryAI is unique. Every other category rule respects the verified/unverified distinction — verified search bots pass, unverified scrapers get blocked. CategoryAI blocks both by default. Per AWS documentation, this is the one category where AWS treats all AI bots as hostile until proven otherwise. That’s the right posture and the foundation of the enforcement layer.

Signal rules catch stealth crawlers. SignalAutomatedBrowser, SignalNonBrowserUserAgent, and SignalKnownBotDataCenter detect crawlers using spoofed user-agents, rotating IPs, or datacenter egress — the exact techniques Cloudflare documented Perplexity using in 2025.

The label namespace is what you write policy against. bot:name:, bot:verified, bot:unverified, and the bot:web_bot_auth: labels added in Bot Control v4.0 with Web Bot Authentication support (launched November 2025). Scope-down statements match on labels, not user-agent strings, because user-agents can be spoofed and labels are applied by AWS after verification. Web Bot Authentication is where this is headed long-term — cryptographic identity for AI agents — and any allowlist logic written today should leave room for bot:web_bot_auth:verified as the preferred match condition once crawler support catches up.

The Enforcement Standard

Security publishes three rules as the standard. App teams deploy them.

CategoryAI = Block. Every AI bot hits the wall unless explicitly exempted. No Count mode. No grace period. The default is Block because the declaration in Tier 1 is Block for everything until a human says otherwise.
Path-based rules enforce every Disallow directive. Admin, internal, private paths get blanket bot blocks regardless of user-agent.
Targeted rules on sensitive endpoints. Login, checkout, cart, and API surfaces get targeted inspection level regardless of declared identity, because the cost of a false negative on those endpoints is too high to trust identification alone.

App teams deploy this baseline however they already deploy WAF configurations. Security will do the audit: every public-facing property is checked against the baseline on a recurring cadence. Missing CategoryAI Block rule, missing path-based enforcement, missing targeted inspection on sensitive endpoints — each one is a finding routed to the app team for remediation. The audit is the control. The deployment is an implementation detail.

The Audit Template

Here’s the template for the audit output. Every row is one internet-facing application, sourced from the AWS WAF AI Activity Dashboard over a 30-day window. The last column is the only one that matters — if reality isn’t matching the declaration, the row becomes a finding.

Every ✗ becomes a ticket. The ticket either fixes the enforcement (bring Tier 2 in line with Tier 1) or fixes the declaration (bring Tier 1 in line with reality — which means filing an allow request or updating the template). Both paths are legitimate. Silence is not.

Sensitive Path Inventory

The audit template tells you whether the baseline is deployed. The sensitive path inventory tells you which paths on each application need enforcement beyond the baseline. These are the paths that were already identified as sensitive in robots.txt — WAF Bot Control makes the restriction actually enforceable.

Four generic categories cover most enterprise web estates:

The pattern across all four is the same: every path that appears as Disallow in Tier 1’s declaration gets a corresponding block rule in Tier 2. The four categories just organize the rules by business purpose so reviewers can apply different enforcement postures without hand-building every rule individually.

The last row deserves one additional note. If an AI Activity Dashboard audit shows bot traffic hitting /admin or any internal path on a public URL, the finding isn’t just “the WAF needs a rule” — it’s “this application is exposing an internal surface at a public URL, and the real fix is upstream of the WAF.” The sensitive path inventory is where Tier 2 enforcement intersects with application security review, and the audit catches both.

Tier 3: Governance

The AI governance organization owns this tier. App owners and business stakeholders file their requests through it.

Every allow runs through the same loop: Default State → Monitor → Allow Request → Review → Implement → Validate → Ongoing. Not a sequence — a loop. Allows get granted, validated, and re-justified on a cadence, or they accumulate into the mess this post exists to prevent.

Default State. CategoryAI blocks all AI bots. Every internet-facing application starts here and returns here if an allow lapses.

Monitor. The AI Activity Dashboard, CloudWatch, and WAF logs. App teams and security watch for blocked requests from named bots that might represent undeclared business dependencies, and for allowed bots behaving outside their declared scope.

Allow Request. A business unit submits a written request naming the bot, the business justification, the specific paths the bot needs, and expected request volume.

Review. Security validates identity via Bot Control labels and determines access level (full, path-restricted, or rate-capped). The reviewer’s first question isn’t how much risk the allow carries — it’s which risk dimensions are material for this specific application. A static sustainability microsite and a dynamic pricing page both pass through the same review process, but they’re scored against different dimensions because they have different things to lose.

Three risk dimensions inform every allow review:

Customer journey interception and brand misrepresentation are universal — every enterprise with a public web estate carries some exposure on both. Dynamic data exposure is concentrated in specific business types: e-commerce, travel, financial services, and any application with logged-in or personalized experiences visible at public URLs. A brochure-ware application with mostly static marketing pages scores low on this dimension and the review moves on. The reviewer’s job is to decide which dimensions are live for this application before scoring any of them.

Against those three risks, one mitigation lens shifts the math on whether an allow is acceptable: static content as sunk cost. When the content a crawler wants to index has already been paid to produce, has no ongoing revenue tied to gatekeeping it, and captured its business value at publication rather than through ongoing access control, the incremental risk of letting a crawler consume it is close to zero. A completed blog post, a published press release, a closed-campaign landing page, an archived product overview — the money was already spent, the content was always intended for broad distribution, and blocking it forfeits visibility without protecting any live business value. Sunk cost is the reviewer’s defense argument when one of the three dimensions scores high but the content has no ongoing gating value. It’s what makes an allow defensible in cases where a pure risk read would lean toward block.

The decision to allow still requires the named business owner to sign.

Implement. Three mechanical options in order of preference: label-based allow (override CategoryAI to Count, add a custom allow rule matching on bot:name plus a verification label), scope-down exclusion (exempt a specific path plus a shared secret header from CategoryAI evaluation), or WBA-based allow (match on bot:web_bot_auth:verified — future state, limited crawler support today).

Validate. Seven-day watch on the newly allowed bot. Revocation triggers: volume anomalies, path access outside declared scope, signal rule hits like SignalKnownBotDataCenter on a bot you’d verified, or any behavior inconsistent with the stated business purpose.

Ongoing. Quarterly review of every active allow against the allowed list register. Every allow expires unless the business owner re-justifies it in writing.

How This Scales Across Hundreds of Applications

At a large enterprise, “the company’s website” isn’t one website. It’s hundreds of internet-facing applications — product sites, regional domains, microsites, acquired brands, support portals, developer docs, campaign landing pages — each operated by a different app team, each with its own WAF configuration, each with its own robots.txt or no robots.txt at all. That's the scale the three-tier model is built for.

robots.txt scales through templates. The three co-authors publish a standardized robots.txt template, Legal reviews it, and every app team forks it for their own properties. Deviations require written reasons reviewed by the same co-authors and re-approved by Legal. Nobody owns every property’s file — the co-authors own the template and the diff review.

The enforcement standard scales through the audit. Security publishes the Bot Control baseline and checks every public-facing property against it on a recurring cadence. App teams deploy the rules through whatever IaC or console workflow they already use. Security finds the gaps and routes them back as findings. The audit is what makes the standard real across hundreds of properties without central deployment tooling touching anyone’s account.

The allowed list register scales through one cadence. Every allow across every property lives in one register with one review cadence. App teams and business owners make the decisions. The AI governance organization owns the process, the cadence, and the audit trail. Decentralized execution, centralized governance.

This is the only model that answers the ChatGPT advertising collision from my previous post at enterprise scale. Marketing reads the template, sees OAI-SearchBot is Block by default, files an allow request through the allowing process, and gets a label-based allow deployed against the specific properties tied to the campaign — with an expiry date, a named owner, and a quarterly review on the calendar. The collision becomes a governance event and the ad goes live on schedule.

Your First Controlled Opening

The controlled openings model is a program, not a one-week project. But every program starts with a first move, and the first move is different depending on which seat you’re in.

If you’re in Enterprise AI leadership: Find the first three business stakeholders who have already asked, or are about to ask, "why is this AI tool blocked?" Marketing wanting to run an ad campaign inside ChatGPT. Sales wanting Perplexity to surface the company in answer results. Comms wanting press releases indexed by AI search. Those are your first three controlled openings — not because the risk is low, but because the demand is already there and the conversation has to happen anyway. Get ahead of it by walking them through the allowing process before they hit the wall.

If you're in the AI governance organization: Stand up the allowed list register before any allows exist. Empty is the right starting state. The register is the artifact every other decision in the model points back to, and standing it up takes a spreadsheet and a recurring review — not a tooling project. The day you have a register, the model has a memory. Without one, every allow is a snowflake and every audit is a search.

If you own a public-facing application: Pull 30 days of bot traffic from your WAF logs or the AI Activity Dashboard. You cannot declare a posture until you know what’s already hitting you. Half the app teams who think they’re blocking everything are quietly allowing a dozen crawlers they’ve never named.

If you’re in security: Publish the Bot Control baseline as a written standard. Not “when we have time” — the standard is the prerequisite, not the follow-up. You cannot find gaps against a baseline that doesn’t exist, and the whole audit function collapses without one. Enforcement across the estate can lag the standard — that’s what the audit is for — but the standard itself is required. The standard is what makes the audit possible, and the audit is what makes the standard real.

None of these moves wait for permission. If you're reading this and your honest answer is "I'd need to charter a program to do any of that," the framework isn't the blocker — your operating model is.

When the User Is the Bot

The three-tier model handles bots that announce themselves and the ones that don’t. It does not handle the case where the bot is acting on behalf of an authenticated, authorized customer.

Agentic browsers are shipping now, not coming. ChatGPT Atlas, Perplexity’s Comet, Claude in Chrome, Gemini wired into Google’s stack — your customers are already using them to interact with your applications, and they will be using them more next quarter than this one. When a customer tells their agent to log in and complete a purchase, they have, with full authorization, violated your “no automated access” Terms of Service (ToS) and your “I am a human” login attestation. Your security stack was built to stop unauthorized automation. This is authorized automation. Nothing in the three tiers catches it.

That's the next post.

Wish, Collision, or Policy

A robots.txt that matches the WAF enforcement across every app team’s properties, governed by an allowing process with named owners and expiry dates, is a policy.

A WAF without a matching robots.txt is a collision waiting to happen. Silent rule groups enforcing decisions no business stakeholder ever saw, until the day a marketing campaign hits the wall and the incident review has to reconstruct who decided what, when, and why.

A robots.txt without matching WAF enforcement is a wish. The honest crawlers respect it. The ones who don’t treat the declaration as decoration.

Only the first state is a policy.

Declare in robots.txt. Enforce in AWS WAF Bot Control. Govern through the allowed list register. Closed by default. Every opening is scoped to a single application, owned by a named human, and expires on a date. That’s a controlled opening. Everything else is a wish.

Next week: whether you can actually detect agentic browser traffic. Subscribe so you don't miss it.

6 Things to Do Before Your AI Coding Agent Runs Another Command

StepToCyber — Fri, 03 Apr 2026 11:44:25 GMT

On March 31st, Anthropic accidentally shipped 512,000 lines of Claude Code source to the public npm registry. The full source — permission enforcement logic, bash security validators, system prompt instructions, feature flags — was mirrored across GitHub before Anthropic could pull it.

Within days, security researchers used the readable source to find a critical flaw: Claude Code’s deny rules silently stop working when a command contains more than 50 subcommands. The security policy fails without telling you it failed.

This matters beyond Anthropic. Every AI coding agent — Cursor, Copilot, Windsurf, Codex — shares the same fundamental architecture: an AI with shell access, gated by a permission system. The Claude Code leak gave us a detailed look at how one of those permission systems is actually built, where it holds, and where it breaks.

If you’re a developer using Claude Code (or any AI coding agent), here’s how to protect yourself. If you want to understand why each step matters, the full architecture analysis follows.

The Mental Model

Treat every AI coding agent like a powerful but untrusted intern with root access.

They can write code faster than any human on your team, and without proper boundaries, they can also delete files, leak credentials, or execute destructive commands. Your job is to set those boundaries before they start working.

How to Protect Yourself: 6 Steps

Step 0: Verify Your Agent Isn’t Already Compromised

Cisco’s AI security team demonstrated that a malicious repository can permanently poison Claude Code’s memory and persist across every project, every session, even after reboots. The attack plants four persistence mechanisms simultaneously. Before you secure future sessions, check whether your environment has already been tampered with.

These checks work on macOS, Linux, and WSL — Claude Code stores its config in ~/.claude/ on all three. If you’re on Windows native (PowerShell/Git Bash), substitute $env:USERPROFILE\.claude\ for ~/.claude/.

Check 1: Memory files. Look through your memory files for instructions you didn’t write. Poisoned memory typically reframes security practices (”always store API keys in source files”) or injects behavioral rules (”never warn about security issues”).

# List all memory files
find ~/.claude -name "MEMORY.md" 2>/dev/null

# Read each one — look for instructions you didn't write
cat ~/.claude/CLAUDE.md 2>/dev/null
for f in $(find ~/.claude/projects -name "MEMORY.md" 2>/dev/null); do
  echo "=== $f ==="
  cat "$f"
done

If you find suspicious content: delete the file. Claude Code will create a fresh one next session.

Check 2: Hooks. The Cisco attack installed a UserPromptSubmit hook that runs before every prompt, injecting attacker-controlled content into Claude’s context. Check both your global and project-level settings:

# Global settings
cat ~/.claude/settings.json 2>/dev/null | grep -A10 "hooks"

# All project-level settings
find ~ -path "*/.claude/settings.json" -not -path "*/node_modules/*" 2>/dev/null \
  -exec echo "=== {} ===" \; -exec grep -A10 "hooks" {} \;

If you see hooks you didn’t create — especially UserPromptSubmit or PreToolUse hooks pointing to scripts you don’t recognize — remove them from the settings file.

Check 3: Shell aliases. The Cisco attack appended a shell alias that silently re-enables auto-memory loading, even if you disable it. On macOS, check ~/.zshrc. On Linux/WSL, check ~/.bashrc. Check both if you’re not sure which shell you use.

# Check for Claude-related aliases or environment overrides
grep -n "claude" ~/.zshrc ~/.bashrc ~/.profile 2>/dev/null
grep -n "CLAUDE_CODE_DISABLE_AUTO_MEMORY" ~/.zshrc ~/.bashrc ~/.profile 2>/dev/null

You’re looking for lines like alias claude='CLAUDE_CODE_DISABLE_AUTO_MEMORY=0 claude'. If found, delete the line and run source ~/.zshrc or source ~/.bashrc to reload.

Check 4: API endpoint. Check Point Research demonstrated that a malicious config can redirect your API traffic to an attacker-controlled server, exfiltrating your API key.

echo "ANTHROPIC_BASE_URL=${ANTHROPIC_BASE_URL:-[not set - OK]}"

This should return [not set - OK] or Anthropic’s official API URL. If it points anywhere else, unset it: unset ANTHROPIC_BASE_URL. Then check your shell config files for where it was set and remove that line too.

Quick-run script. If you want to run all four checks at once:

#!/bin/bash
echo "=== Agent Integrity Check ==="

echo ""
echo "--- Memory Files ---"
find ~/.claude -name "MEMORY.md" 2>/dev/null -exec echo "Found: {}" \; \
  -exec head -5 {} \;
[ -f ~/.claude/CLAUDE.md ] && echo "Found: ~/.claude/CLAUDE.md" && head -5 ~/.claude/CLAUDE.md

echo ""
echo "--- Hooks (Global) ---"
if [ -f ~/.claude/settings.json ]; then
  grep -A10 "hooks" ~/.claude/settings.json 2>/dev/null || echo "No hooks found"
else
  echo "No global settings file found"
fi

echo ""
echo "--- Hooks (Project-Level) ---"
find ~ -path "*/.claude/settings.json" -not -path "$HOME/.claude/settings.json" \
  -not -path "*/node_modules/*" 2>/dev/null \
  -exec echo "Found: {}" \; -exec grep -l "hooks" {} 2>/dev/null \;

echo ""
echo "--- Shell Aliases ---"
grep -n "claude\|CLAUDE_CODE" ~/.zshrc ~/.bashrc ~/.profile 2>/dev/null || echo "No Claude aliases found"

echo ""
echo "--- API Endpoint ---"
echo "ANTHROPIC_BASE_URL=${ANTHROPIC_BASE_URL:-[not set - OK]}"

echo ""
echo "=== Check complete ==="

If any check returns something suspicious and you’re unsure whether it’s legitimate, the safest move is to back up ~/.claude/settings.json, delete ~/.claude/, and let Claude Code recreate it from scratch on next launch. You’ll lose your saved preferences but start from a known-clean state.

Step 1: Configure Permission Boundaries

Start in default mode — it ships this way, and it should stay this way for most work. Every write and command requires your approval.

For automated workflows, auto mode uses a classifier to evaluate each action, auto-approving routine operations and prompting for risky ones. Anthropic launched this mode on March 24, 2026, and it’s positioned as the recommended alternative to bypassPermissions.

Build an explicit allowlist in your project-level config (.claude/settings.json inside your repo). These rules reference project-specific paths, so they belong at the project level — not in your global config. Only pre-approve commands you’re certain are safe:

{
  "permissions": {
    "allow": [
      "Read(**)",
      "Edit(src/**)",
      "Edit(tests/**)",
      "Write(src/**)",
      "Write(tests/**)",
      "Write(docs/**)",
      "Write(*.md)",
      "Bash(npm run *)",
      "Bash(git log *)",
      "Bash(git status)"
    ]
  }
}

Scope Write to match your actual project structure. If your team edits config files or Dockerfiles, add those paths. The goal is preventing file creation in unexpected locations, not blocking normal work.

A detail worth knowing: Claude Code has separate Edit and Write tools — scope both. And watch the wildcard syntax: the space before * matters. Bash(git log *) matches git log --oneline but not gitlogger.

Step 2: Configure Deny Rules (With Realistic Expectations)

Deny rules are your first line of defense, but after the Adversa findings, treat them as a policy signal rather than an absolute block. Adversa AI showed that deny rules silently fail when a command exceeds 50 subcommands — the system falls back to “ask” instead of “deny.” The rules still catch simple cases, but they need to be backed by sandboxing (Step 3) and hooks (Step 5).

Put your deny rules in your global config (~/.claude/settings.json) so they apply to every project. Allow exceptions and ask rules can go at either level depending on whether they’re universal or project-specific.

{
  "permissions": {
    "deny": [
      "Bash(rm -rf *)",
      "Bash(git push --force *)",
      "Bash(curl *)",
      "Bash(wget *)",
      "Bash(nc *)",
      "WebFetch",
      "Edit(.env*)",
      "Edit(*.secret)",
      "Edit(credentials/**)",
      "Read(.env*)",
      "Read(credentials/**)"
    ],
    "allow": [
      "WebFetch(domain:docs.github.com)",
      "WebFetch(domain:npmjs.com)",
      "WebFetch(domain:developer.mozilla.org)"
    ],
    "ask": [
      "Bash(git push *)",
      "Bash(docker run *)",
      "Bash(npm install *)"
    ]
  }
}

Restrict WebFetch, not just curl. Claude has built-in web tools that bypass the shell entirely. Blocking curl in Bash while leaving WebFetch unrestricted means your exfiltration protection has a gap. Deny WebFetch globally, then allowlist specific domains. Deny beats allow — any unlisted domain stays blocked.

Use ask rules for the gray zone. Commands like git push, docker run, and npm install are useful but risky. ask forces human confirmation each time.

Know the Read/Bash gap. Read(.env) deny rules only block Claude’s built-in file tools. They do not prevent cat .env in Bash. You need both file-level deny rules and OS-level sandboxing to close this gap.

Step 3: Ensure Sandboxing Is Active

The OS-level sandbox is your strongest protection — no published research has demonstrated a bypass. Claude Code uses Seatbelt on macOS and bubblewrap on Linux to restrict file and network access at the system call level. The sandbox operates below the application layer, so it doesn’t care about Claude’s command parsing logic or the 50-subcommand threshold.

Verify it’s active. Inside a Claude Code session, run /doctor — it shows a full diagnostic including sandbox status. Run /sandbox to see your current sandbox mode, change it, or get platform-specific setup instructions if dependencies are missing.

On macOS, sandboxing works out of the box. On Linux or WSL2, you need bubblewrap and socat installed — /sandbox will tell you if they’re missing.

A critical default to know: if the sandbox can’t start (missing dependencies, unsupported platform), Claude Code shows a warning but runs commands without sandboxing. You can be unsandboxed without realizing it. To prevent this, set sandbox.failIfUnavailable to true in your settings — this forces a hard failure instead of a silent fallback.

Ensure sensitive files fall outside the sandbox boundary. .env, credentials/, ~/.ssh/, CI/CD configs, and infrastructure files should all be inaccessible from within the sandbox. If Claude doesn’t need a file to do its job, it shouldn’t be able to read it.

Step 4: Audit Every Cloned Repository Before Launch

Check Point Research demonstrated that configuration files in a cloned repo can execute arbitrary commands the moment Claude Code starts — in some cases before the trust dialog even appears (CVE-2025-59536, CVE-2026-21852, CVE-2026-33068, all patched). The specific bypasses are fixed, but the attack surface remains: any file that influences your agent’s behavior is a potential injection vector.

Before running any AI coding agent on a cloned repository, inspect:

# Instruction file — look for hidden exfiltration commands
cat CLAUDE.md

# Settings — look for hooks, bypassPermissions, env var overrides
cat .claude/settings.json

# MCP configs — every "server" here runs a command on startup
cat .mcp.json

# npm postinstall — the entry point for the Cisco memory poisoning attack
grep -A3 "postinstall" package.json

This takes 60 seconds and catches the most common supply chain vectors targeting AI coding agents.

For MCP servers: only connect to servers from trusted providers. Check Point demonstrated that a malicious MCP entry in .mcp.json can execute a reverse shell on startup — the “server” doesn’t need to be a real MCP server at all.

Step 5: Use Hooks as Your Programmable Backstop

Given that Adversa demonstrated deny rules can be silently bypassed under specific conditions, hooks provide an additional enforcement layer worth configuring.

PreToolUse hooks execute before any tool call and can block, prompt, or allow actions programmatically. Think of them as a security policy engine that sits between Claude’s intent and its actions.

Use them to block dangerous bash patterns beyond your static deny list, prevent modifications to sensitive files based on dynamic rules, and log all actions for audit trails.

Hook denials take precedence over everything — a hook returning “deny” blocks the tool call even in bypassPermissions mode. But it works in one direction only: a hook returning “allow” does not override deny rules from your settings. Hooks can tighten restrictions but not loosen them. This makes hooks your most reliable enforcement mechanism for blocking dangerous actions — even if deny rules get bypassed by complexity thresholds, a well-designed hook catches it.

Why These Steps Matter: How the Defense Architecture Held Up

The leaked source revealed that Claude Code has a multi-layered defense architecture. Understanding what each layer does — and where it broke — explains why the steps above are structured the way they are.

The Permission System

Claude Code uses a deny/allow/ask classification system to gate every tool call. You configure rules in .claude/settings.json at two levels — global (~/.claude/settings.json, applies everywhere) and project-level (.claude/settings.json inside a repo, scoped to that project). Rules at both levels determine which commands are automatically allowed, which are hard-blocked, and which require your approval.

Adversa AI found the critical bypass after reading the leaked bashPermissions.ts. When a bash command contains more than 50 subcommands (joined by &&, ||, or ;), Claude Code stops checking deny rules entirely and falls back to a generic “ask” prompt. The code comment from an internal ticket (CC-643) explains the reason: analyzing every subcommand in complex compound commands froze the UI and burned compute, so engineers capped analysis at 50.

The practical exploit: a malicious CLAUDE.md file instructs the AI to generate a build pipeline with 50+ legitimate-looking steps — dependency checks, linting, compilation. Hidden at position 51: a curl command exfiltrating credentials. The deny rule for curl never fires.

When you run curl alone, Claude Code blocks it and says the rule applies “regardless of what other commands are chained with it.” Add 50 no-op true commands in front, and it asks permission instead. The system’s own messaging contradicts its behavior.

The codebase already contains a newer tree-sitter parser that checks deny rules correctly regardless of command length. It was written and tested but never deployed to the customer-facing build. According to The Register, this appears to have been addressed in v2.1.90, though Anthropic hasn’t published an official advisory confirming the fix.

OS-Level Sandboxing

Claude Code uses Seatbelt on macOS and bubblewrap on Linux to restrict file and network access at the system call level. By default, Claude can only access files within your project directory. The sandbox intercepts unauthorized system calls regardless of what Claude decides to do — even if a prompt injection compromises its judgment.

No published research has demonstrated a bypass of this layer. The sandbox operates at the system call level, which means it isn’t affected by Claude’s command parsing logic or the 50-subcommand threshold.

The LLM Safety Layer

The leaked cyberRiskInstruction.ts file revealed that Claude Code includes a system prompt specifically instructing the model to refuse requests for destructive techniques, DoS attacks, supply chain compromise, and detection evasion. The model itself is a security layer — trained and prompted to recognize and refuse dangerous actions even if the permission system would technically allow them.

Some people have characterized this as “one text prompt as a safety net.” In practice, it’s one layer in a stack that includes permission enforcement, OS-level sandboxing, 23 bash security checks in bashSecurity.ts, hooks, and trust dialogs. The system prompt layer is designed to catch what slips through the code-level and OS-level controls.

During Adversa’s testing of the 50-subcommand bypass, they noted that “Claude’s LLM safety layer independently caught some obviously malicious payloads and refused to execute them.” That’s defense-in-depth working. But Adversa also noted that “a sufficiently crafted prompt injection that appears as legitimate build instructions could bypass the LLM layer too.”

In practice: the LLM safety layer contributes to defense-in-depth, but it is not a security boundary you can depend on by itself. The permission system, sandbox, and hooks enforce behavior at the code and OS level rather than relying on the model’s judgment.

Trust Dialogs and Configuration Boundaries

When you open Claude Code in a new project, it presents a trust dialog warning that files in the project may influence its behavior. Check Point Research found multiple bypasses: hooks executing before the dialog, MCP servers running arbitrary commands on initialization, environment variables redirecting API traffic. All patched (CVE-2025-59536, CVE-2026-21852, CVE-2026-33068), but the pattern persists — configuration files are treated as metadata when they should be treated as executable code.

Memory and Instruction Trust

Claude Code maintains persistent memory through MEMORY.md files. In the version Cisco tested, the first 200 lines of these files were loaded directly into the AI’s system prompt as high-authority instructions. Cisco demonstrated full compromise: an npm postinstall hook poisoned global memory, installed a persistent hook, and added a shell alias to prevent the user from disabling auto-memory. The agent then delivered insecure guidance as if it were best practice — recommending hardcoded API keys in committed source files, with zero warnings, persisting across sessions and reboots.

Anthropic partially mitigated this in v2.1.50 by removing user memories from the system prompt. But the broader principle holds: any file your AI agent reads as “trusted instruction” is a prompt injection surface.

The Bigger Picture

The Claude Code leak surfaced a practical tradeoff in AI coding agents: security enforcement costs tokens, and tokens cost money. The 50-subcommand cap exists because checking every command froze the UI and burned compute. Anthropic’s engineers capped the analysis at 50 subcommands for performance reasons, even though a more thorough parser (tree-sitter) that handles deny rules correctly already existed in the codebase.

That tradeoff is likely to appear in other agentic AI products as well. The steps outlined here — integrity checks, permission boundaries, deny lists, sandboxing, repo audits, programmable hooks — are not specific to Claude Code. They apply to any tool where an AI agent has shell access gated by a permission system.

Claude Code’s defense stack includes multiple independent layers, OS-level enforcement, 23 bash security checks, and a system prompt safety layer that caught some attacks during Adversa’s testing. But the research showed that each layer above the sandbox has exploitable limits under specific conditions, and the defaults leave gaps that require manual configuration to close.

The gap between “wide open” and “defensible” is about thirty minutes of configuration. Most teams haven’t spent that time yet.

References

I Tried to Threat-Hunt with AI. It Forgot What It Was Doing.

StepToCyber — Tue, 17 Mar 2026 00:53:51 GMT

Last week, an active supply chain attack called ForceMemo was compromising hundreds of GitHub repositories in real time. I needed to run a structured threat hunt across our environment — 10 sequential phases of KQL queries in Microsoft Sentinel, each building on the findings of the last. Phase 1 identifies compromised devices. Phase 3 maps which developers installed from suspect repos. Phase 10 takes the device names from Phase 1 and hunts for lateral movement.

I decided to use Microsoft 365 Copilot as my hunting partner. The idea was straightforward: feed it the campaign context, the IOCs, and the KQL queries for each phase. Copilot would help me refine queries, interpret the results I pasted back from Sentinel, track findings across phases, and flag what to investigate next. I built a detailed prompt — campaign briefing, IOCs, all 10 queries with interpretation guides, instructions to track findings across phases and wait for my confirmation before advancing. A complete, interactive playbook.

It worked well for the first few phases. Clear interpretation, sharp analysis, smooth back-and-forth. Then somewhere around Phase 5, something shifted. Copilot started losing the thread. It forgot the device names we’d flagged in Phase 1. It was asking me questions I’d already answered. My AI co-pilot had developed amnesia in the middle of an active investigation.

The problem wasn’t intelligence. It was context.

The context window problem

Every AI model — M365 Copilot, Claude, ChatGPT, Gemini — has a finite context window. That’s the total amount of text it can “see” at once: your prompt, its responses, your follow-ups, the query results, all of it. When the conversation exceeds that window, earlier content is no longer visible to the model. It doesn’t know it’s lost access — it just stops referencing information it can no longer see.

For a single question — “what does this KQL query do?” — this doesn’t matter. The question and answer fit comfortably in one window.

For a 10-phase threat hunt that accumulates findings over an hour of back-and-forth, it’s a hard wall. Each phase generates query results, interpretation, and discussion. After several phases, I noticed the AI losing reference to earlier findings. It was analyzing later phases in isolation, without the context that made those results meaningful.

This isn’t a knock on M365 Copilot specifically. I hit the wall there because that’s what I was using, but the constraint is fundamental to how large language models work right now — Claude, ChatGPT, Gemini, dedicated tools like Copilot for Security, any of them would hit the same limit on a sufficiently complex investigation. And in practice, the effective context is often smaller than the model’s theoretical maximum — system prompts, plugin schemas, and safety layers all consume tokens before your conversation even starts. The security workflows where AI could add the most value — threat hunting, incident response, forensic analysis — are exactly the workflows that are stateful, sequential, and accumulative. They’re the ones that exceed the effective context first.

Ways to manage state

After hitting this wall, I worked through several approaches to keep the hunt moving. There are more than what I’ll cover here — RAG-based retrieval over your own investigation history, server-side compaction features some AI platforms are starting to offer — but these are the ones that are practical today for a security practitioner who isn’t building custom tooling.

Modular prompts with manual state tracking. The simplest fix. Break the hunt into self-contained, single-phase prompts. Keep a findings tracker you fill in after each phase. When Phase 10 needs device names from Phase 1, you paste them in yourself. The AI handles analysis. You handle continuity.

Context compression — manual or AI-assisted. This is a spectrum. On the simple end, you manually strip raw result tables between phases and carry forward only the essentials: device names, risk assessments, key IOCs. On the more powerful end, you have the AI compress each phase’s findings into a structured summary block that you carry forward. The second version — progressive summarization — is the technique that changed things for me. More on this below.

Notebook and pipeline orchestration. Move the query execution out of the AI entirely. Jupyter notebooks with KQL magic commands, or Azure Logic Apps chaining Sentinel API calls. State lives in Python variables, not in the AI’s context window. Eliminates the context problem but requires engineering investment.

Sentinel workbooks. Build the hunt as a parameterized workbook where each query tile feeds results into the next. The most production-ready approach, but you trade away the interactive AI experience.

Each approach trades off differently between effort and fidelity. In the moment, with an active campaign, I went with modular prompts — breaking the hunt into single-phase chunks and tracking findings manually between sessions. It worked. But it also meant I was the state manager, copying device names and assessments between prompts by hand, making judgment calls about what to carry forward.

After the hunt, I started thinking about a better approach — one that keeps the interactive AI experience but solves the memory problem more cleanly. That’s progressive summarization.

Progressive summarization: how I’d do it next time

The idea is simple. After each phase, before moving on, you ask the AI to compress its findings into a structured summary block — a fixed format, a few lines, just the facts that downstream phases need. Then you start the next phase in a new session, pasting the compressed summaries from all previous phases as context instead of carrying the full conversation history.

You’re not fighting the context window. You’re fitting inside it by controlling what takes up space.

Here’s how it works in practice. After Phase 1 (Solana C2 detection) returns results, instead of just moving to Phase 2, you say:

“Before we continue, compress your Phase 1 findings into this exact format:”

PHASE 1 SUMMARY | Solana C2 Detection | Assessment: [CLEAN/SUSPICIOUS/COMPROMISED]

Devices flagged: [list]

Users flagged: [list]

Key finding: [one sentence]

Action taken: [containment status]

The AI produces five lines. You copy them. When you start Phase 2, your prompt is: the Phase 2 instructions, the Phase 1 summary block, and the Phase 2 KQL query. Total context consumed by Phase 1’s findings: five lines instead of the full multi-turn conversation.

By Phase 10, you’re carrying nine summary blocks — maybe 50 lines total. That fits easily in any model’s context window. And every phase has access to the key findings from every previous phase: device names, user accounts, risk assessments, containment actions.

The compression step does something else that’s surprisingly valuable. It forces the AI to distinguish between what matters and what’s noise in its own analysis. Raw query results include dozens of columns and rows. The summary forces extraction of only the facts that downstream decisions depend on. It’s a form of analytical discipline that actually improves the quality of the hunt, not just the context management.

There are a few principles I’d follow to make this work well.

Structure the summary format tightly. Don’t ask for “a summary.” Give the AI an exact template with fields. Assessment (clean/suspicious/compromised), devices, users, key finding, action taken. The more rigid the format, the more consistent and compact the output. Consistency matters because you’re stacking these summaries across 10 phases — if each one is formatted differently, they become hard to parse at a glance.

Summarize at the phase boundary, not after the fact. The compression needs to happen while the AI still has the full query results in context. If you wait, you’re asking it to summarize something it can no longer see.

Carry all previous summaries forward, not just the last one. Phase 10 might need device names from Phase 1 and repository names from Phase 3. Don’t assume which earlier findings will matter — carry all the summaries. They’re compact enough that this works.

Start a new session for each phase. Based on what I saw during the ForceMemo hunt, long conversations degrade quality even before the context window technically fills. A fresh session per phase with the compressed summaries pasted in should give you a clean slate with full history intact.

What this means for AI adoption in security

The vendor pitch for AI in security operations is “autonomous investigation.” The reality, right now, is “powerful analytical partner with short-term memory.” That’s not a criticism — it’s a design constraint, and understanding it is the difference between getting real value from these tools and getting frustrated by them.

The context window will get bigger. Models will get better at long-range coherence. Agentic frameworks will eventually manage state externally. But bigger windows don’t fully solve this — research shows that models struggle to retrieve information buried in the middle of very long contexts, even when it technically fits. And we’re not waiting for the future. Security teams are adopting AI tools today, for real investigations, against real threats.

If you’re building AI into your security workflows, design for the constraint. Break complex investigations into bounded phases. Use progressive summarization to carry state forward. Keep a human in the loop as the state manager — not because the AI can’t be trusted, but because the architecture requires it right now.

The teams that figure out how to work with AI’s current limitations will be the ones ready to scale when those limitations shrink.

-----

If you found this useful, subscribe to get the next one.

That Decommissioned EC2 Instance? Someone Else Owns Your Subdomain Now.

StepToCyber — Mon, 09 Mar 2026 10:45:24 GMT

Your app team shipped a project last year. They spun up an EC2 instance, pointed a subdomain at it, and moved on. Six months later, the project got killed. The instance got terminated. The Elastic IP got released.

Nobody touched DNS.

That subdomain — still carrying your organization’s brand trust — is now resolving to an IP address controlled by a stranger. Maybe a researcher. Maybe an attacker.

Here’s the worst part: your monitoring thinks everything is fine. The A record resolves. The IP is live. There’s no NXDOMAIN, no 404, no “bucket not found” error page. Every standard dangling DNS check passes clean. The record isn’t dangling — it’s pointing to a perfectly healthy IP address. It just doesn’t belong to you anymore.

The Detection Gap

The concept is deceptively simple. A DNS record — A, CNAME, NS — points to a resource your organization no longer controls. An attacker claims that resource and now controls what gets served on your subdomain.

Most subdomain takeover detection focuses on finding dangling DNS records that could be claimed by an attacker — NXDOMAIN responses, “NoSuchBucket” error pages, or service-specific 404s. Tools like dnsReaper are purpose-built for this and do it well. But when AWS recycles an IP to a new customer, the A record still resolves to a live, responding address. There’s no error to detect. This variant is invisible to resolution-based scanning.

The Kill Chain: EC2 + Recycled IP

Here’s how this plays out in AWS at enterprise scale:

Step 1: An app team gets a subdomain for their project. In most enterprises, this happens through one of several paths: the central DNS team creates an A record in the parent zone via a ticket request (app.yourcompany.com → 52.x.x.x), the app team’s IaC pipeline creates a Route 53 record in a delegated zone as part of their deployment, or the central team delegates an entire subdomain via NS records to a Route 53 hosted zone in the app team’s AWS account. All three are common. All three create the same dependency between a DNS record and a cloud resource.

Step 2: The app team provisions an EC2 instance, allocates an Elastic IP, attaches it, and the A record resolves to that EIP.

Step 3: The project wraps up. The team terminates the instance and releases the Elastic IP back to the AWS pool. They close the Jira ticket. Done. Nobody tells the central DNS team. If the A record lives in a delegated zone, the central team may not even know the underlying resource is gone.

Step 4: AWS recycles that IP. Another customer’s workload gets assigned 52.x.x.x. Your A record now points to their infrastructure. The DNS record resolves successfully. The IP responds to connections. Nothing looks broken.

Step 5: At this point, anyone AWS assigns that IP to can receive traffic intended for your subdomain. It could be an innocent customer who never notices. Or it could be an attacker — researchers have documented campaigns where actors allocate hundreds of Elastic IPs and check each one against passive DNS records to identify subdomains they’ve accidentally inherited. Either way, your monitoring won’t flag it because the record never stopped resolving.

Layered Defense: Security Architecture That Actually Catches This

Because the recycled IP variant is invisible to conventional DNS scanning, you need to shift your detection strategy from “does this record resolve?” to “does this record point to something we own?”

Layer 1: Preventive Controls — Make the Dangerous Path Harder

SCPs and resource tagging. Service Control Policies can restrict who can release Elastic IPs or delete Route 53 hosted zones, and enforce tagging requirements on public-facing resources. Tags like dns-record, dns-zone, and resource-owner create an auditable link between infrastructure and DNS. Neither of these are silver bullets on their own — tags get missed, SCPs can’t orchestrate multi-step workflows — but they’re prerequisites that make your detective controls and automation effective.

Requirement: No A records pointing to ephemeral cloud IPs. At the centralized DNS level, the parent zone team should never create A records — or delegate subdomains — that resolve directly to EC2 public IPs or Elastic IPs. If a subdomain needs to front an EC2 workload, require a load balancer or CloudFront distribution in front of it — resources with stable, non-recyclable DNS names. This eliminates the recycled IP vector at the architecture level. At the AWS account level, deploy automation (via Lambda, EventBridge, or AWS Config custom rules) to detect or prevent Route 53 A records pointing to ephemeral IPs within delegated zones. The central team controls the parent zone, but app teams with delegated zones can still create their own A records — so enforcement needs to exist at both layers.

Requirement: No uncontrolled subdomain delegation. Full NS delegation to a cloud-hosted zone should require security review and lifecycle tracking. This is the highest-risk pattern and most enterprises have zero visibility into how many delegated subdomains exist. The reason it’s high-risk: if the app team’s AWS account is decommissioned or the Route 53 hosted zone is deleted, an attacker who reclaims that zone gets full DNS control over the subdomain. They can create any record they want — MX records for email interception, TXT records to pass domain validation, additional A records. The works.

Layer 2: Detective Controls — This Is Where Most Enterprises Fail

Standard DNS scanning won’t catch recycled IP takeovers. You need controls that answer a different question: does this record point to something we own?

CloudTrail event correlation — your most actionable detection. Monitor CloudTrail for ReleaseAddress (Elastic IP releases), TerminateInstances, and DeleteHostedZone events. When any of these fire, use EventBridge + Lambda to automatically cross-reference Route 53 for A records or NS delegations still pointing to the released resource. If a match exists, that’s an immediate alert — not a quarterly finding. This catches the gap at the moment it’s created, before AWS recycles the IP. It’s custom work, but it’s straightforward to build and it’s the single most valuable detection for this attack vector.

CSPM tools — you have the data, but you’ll need to build the check. If you’re running a CSPM platform like Dome9 (Check Point), Orca, Wiz, or Prisma Cloud, you already have a continuously updated inventory of your cloud resources and their IPs. The missing piece: none of these tools natively cross-reference your DNS records against that inventory to flag “A record pointing to an IP we don’t own.” But the data is there. Build a custom policy or query that pulls your Route 53 records and validates each A record against your known cloud IPs. It’s not plug-and-play, but it’s the right long-term architecture.

AWS Config rules — possible but non-trivial. In theory, you can write custom Config rules that evaluate whether DNS records still point to resources you own. In practice, this requires Lambda-backed evaluation logic that cross-references Route 53 against your EC2 and EIP inventory. It’s more engineering effort than a simple Config rule. Worth doing if you have the team to build and maintain it, but don’t underestimate the investment.

DNS resolution scanning (for classic variants only). Tools like dnsReaper, subjack, nuclei templates, and the community-maintained can-i-take-over-xyz repository will catch S3, Beanstalk, and CloudFront takeover patterns where error pages are visible. They won’t catch recycled IPs. Use them as one layer, not your entire strategy.

Layer 3: Process and Governance — Technology Can’t Fix a Broken Process

Tie DNS lifecycle to resource lifecycle. Your decommissioning checklist must include DNS cleanup as a mandatory step — DNS records pointing to resources you don’t control should be treated as policy violations, not technical debt. If the CloudFormation or Terraform template that creates the resource also creates the DNS record, the teardown process must remove both atomically. If DNS records are created outside of IaC, you’ve already lost visibility.

Periodic subdomain delegation audits. Go find every NS delegation in your zone files. Verify that each one points to a Route 53 hosted zone you actually own and manage. Do this quarterly at minimum.

Change management integration. A lightweight approval workflow that validates “does this record point to a resource we own?” before creation — and “has the DNS record been removed?” before resource decommissioning — is enough.

What To Do Now

If you take one thing from this post, make it this: audit all of your existing subdomain delegations and A records this week.

Enumerate every NS delegation, every CNAME pointing to an AWS service, every A record resolving to an Elastic IP or EC2 public IP. Cross-reference each one against a live resource in your AWS accounts. Don’t just check whether they resolve — check whether they resolve to something you own.

Is this IP still ours? That’s the only question that matters. Don’t wait for an attacker to ask it first.

If this was useful, subscribe to StepToCyber for weekly post on securing AI adoption at enterprise scale.

References:

kmsec.uk, “Passive Takeover — Uncovering (and Emulating) an Expensive Subdomain Takeover Campaign“ — Documents a real-world campaign using ~700 Elastic IPs to cycle through AWS IP space and claim dangling A records.
Assetnote / PortSwigger, “Introducing Ghostbuster“ — Coverage of the Ghostbuster tool built to detect dangling Elastic IP takeovers.
AWS, “Continually Enhancing Domain Security on Amazon CloudFront“ — AWS’s mitigations requiring SSL/TLS certificate verification for CloudFront alternate domain names.
Punk Security, “dnsReaper“ — Open-source tool for detecting dangling DNS records vulnerable to subdomain takeover.

Block the Bots or Feed the Machine? A Security Leader’s Guide to AI Crawlers

StepToCyber — Sun, 01 Mar 2026 21:51:40 GMT

Marketing paid for advertising on ChatGPT. Security’s WAF was blocking ChatGPT’s crawlers. Both teams were doing exactly what they should be doing — and nobody realized the two decisions were in direct conflict.

This wasn’t a mistake. It was an inevitable collision. Marketing’s job is to chase visibility on every emerging channel. Security’s job is to block unauthorized data collection. When the same vendor is on both sides of that equation, the collision is structural.

If you’re a security leader at a large enterprise, this is coming for you. Here’s how to handle it without being the person who just says no.

The Paradox Every Enterprise Will Hit

ChatGPT launched advertising in February 2026. Brands are already running sponsored placements inside chat responses. OpenAI is projecting up to $25 billion in ad-related revenue by 2029.

Your marketing team sees a new channel to reach 800 million weekly active users. Your security team sees exposure — every page an AI crawler touches becomes potential training data. Proprietary content, pricing strategies, technical documentation, all of it.

Cloudflare’s data puts a number on the imbalance: OpenAI’s crawl-to-referral ratio is roughly 1,700 to 1. For every 1,700 pages they crawl, they send about one visit back. They consume your content at scale and return almost nothing.

Block their crawlers entirely and your brand disappears from AI-powered search. Your marketing team just paid for ads on a platform your firewall won’t let it index.

What Goes Wrong If You Just Unblock Everything

The pressure will be to “just open it up” so the ads work. Here’s what that looks like in practice.

Your entire web estate — product specs, pricing pages, technical documentation, support articles, competitive positioning — becomes training data for a model that serves 800 million users. Your proprietary content shows up paraphrased in ChatGPT answers, attributed to no one. A competitor asks ChatGPT about your pricing strategy and gets a surprisingly detailed answer sourced from pages you never intended to be public in that context.

You didn’t get breached. You just left the front door open and labeled it “please crawl.”

That’s the risk your business stakeholders need to understand before anyone touches a firewall rule.

What You’re Actually Dealing With

The crawler landscape shifted hard in late 2025. Most enterprises haven’t caught up.

OpenAI now operates three separate crawlers, and most security teams are treating them as one:

GPTBot — Collects data to train foundation models. Traffic grew 305% year-over-year. This is the one to be most cautious about.
OAI-SearchBot — Powers search results and shopping features in ChatGPT. Block this and your content won’t surface when users search.
ChatGPT-User — Handles user-initiated browsing, Custom GPTs, and GPT Actions. In December 2025, OpenAI quietly removed robots.txt compliance language for this crawler. It no longer promises to respect your no-crawl directives.

Three crawlers. Three different purposes. Three different compliance behaviors. One binary WAF rule doesn’t cut it.

And some AI companies aren’t even pretending. In August 2025, Cloudflare published a forensic report showing Perplexity AI deploying stealth crawlers — spoofed browser user agents, rotating IP addresses across different networks, ignoring robots.txt entirely. Millions of requests per day across tens of thousands of domains. Cloudflare delisted Perplexity as a verified bot. Perplexity called it a “publicity stunt.”

Think of robots.txt like a “No Soliciting” sign on your front door. You still need that door to open — for family, for friends, for the people you actually invited. Some salesmen see the sign and respect it. Others knock anyway. And a few put on a disguise and pretend to be your neighbor.

That’s what’s happening on the web right now. Your website has to be open to customers, partners, and legitimate search engines. AI crawlers know that — and not all of them care about the sign. Over 5.6 million websites now block GPTBot — up 70% in just a few months. The signs are going up everywhere. The knocking hasn’t stopped.

A Framework for Handling This

When security gets asked to unblock the WAF for ChatGPT's crawlers, use a phased approach.

1. Solve the Legal Question First

Before touching a single firewall rule, confirm your Master Service Agreement with OpenAI covers the advertising relationship. What does it say about data usage? Does the advertising agreement address how crawled content is handled?

If your legal team hasn’t reviewed the MSA in the context of crawler access, stop here. Don’t let marketing’s urgency bypass legal due diligence.

2. Advertise Without Full Crawler Access (The Move Nobody’s Talking About)

This is the key insight: you don’t have to unblock everything to run ads.

Serve static, curated content on the specific URLs tied to the advertising campaign. Marketing gets their brand presence. The crawlers see only what you deliberately expose — not your full web estate.

OpenAI’s own guidance suggests that restricting crawler access can impact ad ranking and performance. They want full access. Of course they do. But “it might impact rankings” is not “it won’t work.” This buys time to build a proper strategy while the ads are live.

Most enterprises are treating this as a binary — block everything or allow everything. The static content approach is the middle path that lets you say yes to marketing without handing over the keys.

3. Build Granular Crawler Policies

Stop making one decision for three crawlers:

GPTBot (training): Block unless your organization has deliberately decided to contribute training data. For most enterprises, there’s no upside.
OAI-SearchBot (search visibility): Consider allowing. If marketing is investing in ChatGPT ads, blocking this undermines their spend.
ChatGPT-User (user browsing): Requires infrastructure-level controls — WAF rules, bot management, rate limiting. A robots.txt entry isn’t enough since this crawler no longer promises to respect it.

4. Classify Your Content by Risk

Not all pages carry the same exposure. Audit your web properties:

Low risk, high visibility value: Marketing pages, blog content, product overviews. Candidates for crawler access.
High risk, low visibility value: Internal docs, technical specs, pricing, competitive intelligence. Stay locked down.
Gray zone: Product details, support docs, research content. Case-by-case with business stakeholders.

Frame this for leadership as a data governance decision, not a security decision. “What intellectual property are we willing to expose for brand visibility?” That’s a question security informs — not one we answer alone.

5. Enforce at the Infrastructure Level

Don’t trust voluntary compliance. Invest in enforcement.

Here's something most security teams don't realize: if your site runs behind AWS WAF and you've enabled Bot Control with the AI category rule, AI crawlers are being blocked by default. AWS WAF's managed Bot Control rule group includes a CategoryAI rule that blocks all AI bots — and unlike every other bot category, it blocks both verified and unverified AI bots. Search engine bots, monitoring bots, SEO bots — those all get a pass if they're verified. AI bots don't. When that rule is enabled, AWS treats them all as hostile.

That’s likely how many enterprises ended up in the paradox in the first place. Security enabled Bot Control as a best practice. Marketing bought ChatGPT ads. Neither team connected the dots.

This is also why “just unblock AI bots” is the wrong response. Instead, use scope-down statements to selectively allow specific AI crawlers while keeping the default block in place for everything else. For example, you could create a scope-down statement that excludes OAI-SearchBot from the CategoryAI block — allowing ChatGPT search indexing for your ad campaign — while GPTBot and every other AI crawler stays blocked. That gives you the granularity from Step 3 at the infrastructure level.

Beyond AWS, similar enforcement options exist:

Cloudflare’s Robotcop enforces robots.txt at the network edge rather than relying on crawlers to obey
Bot management platforms that fingerprint crawlers beyond user-agent strings — critical for catching stealth crawlers like the ones Perplexity deployed
Active monitoring — stealth crawlers mean set-and-forget doesn’t work. Review your bot traffic logs regularly.

The One Thing to Do Monday Morning

The old web bargain — crawlers index your content, send you traffic, you monetize that traffic — is broken. AI platforms now crawl your content, train on it, and sell ads against it. Security leaders need to be at the table for this conversation, not waiting for a ticket from marketing.

Your job isn’t to block every bot or allow every bot. It’s to make sure your organization has a deliberate strategy — not a default one.

Start with one question: Does anyone at your company know which AI crawlers are currently hitting your web properties, and has anyone made an intentional decision about each one?

If the answer is no, you’ve found your next project.