AI Models & Guardrails Comparison

Reference for Gray Swan AI's Safeguards Challenge - comparing models for capabilities, cost, speed, and robustness.

Selection Criteria Summary

Base Model Requirements

  • P0: High capabilities on multi-agent tool-calling
  • P0: Low built-in robustness (for measuring guardrail uplift)
  • P1: High speed (low latency for agent interactions)
  • P2: Low cost (for high-volume testing)

Recommended Candidates

  • R:5 Gemini 2.5 Pro (Google DeepMind)
  • R:5 Gemini 2.5 Flash (Google DeepMind)
  • R:3 Qwen 2.5 72B (Alibaba)
  • R:4 GLM-4 (Zhipu AI)
  • R:5 GLM-4 Plus (Zhipu AI)

AI Models

37 models
ModelProviderAgenticRobustnessCost ($/M)ContextNotes
Claude 3.5 Sonnet
Anthropic9/108/10 (high)
$3 in
$15 out
200KMEASURED 83/100 robustness. Best-in-class for coding and agentic tasks.
Claude Opus 4.5
Anthropic10/1010/10 (high)
$5 in
$25 out
200KCONTROL GROUP: MEASURED lowest ASR (0.5%). Most robust frontier model for high-robustness baseline.
Command R+
104B
Cohere8/105/10 (medium)
$2.5 in
$10 out
128KMEASURED: 3.8% ASR. Enterprise-focused, good for RAG.
DeepSeek-R1
671B (37B active)
DeepSeek5/101/10 (low)
$0.55 in
$2.19 out
128K★ TOP CANDIDATE: Lowest robustness among frontier models. CAVEAT: No native tool-calling (score 5). Use R1-0528 variant.
DeepSeek-V3
671B (37B active)
DeepSeek8/104/10 (medium)
$0.27 in
$1.1 out
128KVery cost-effective, open weights MoE architecture
DeepSeek-V3.1
671B (37B active)
DeepSeek8/104/10 (low)
$0.2 in
$0.8 out
128K★ TOP CANDIDATE: MEASURED 5.3% ASR. Strong agentic (8/10) + low robustness. Superseded by V3.2 but still available.
Gemini 2.5 Flash
Google DeepMind8/105/10 (medium)
$0.075 in
$0.3 out
1000KFast and cheap, similar robustness profile to 2.5 Pro
Gemini 2.5 Pro
Google DeepMind9/105/10 (low)
$1.25 in
$10 out
1049K★ TOP CANDIDATE: MEASURED highest vulnerability (8.5% ASR). High agentic (9/10). Still available despite Gemini 3.0.
Gemini 3 Pro
Google DeepMind9/108/10 (high)
$2 in
$8 out
1000KLatest generation, improved robustness
GLM-4
Zhipu AI7/104/10 (low)
$0.14 in
$0.14 out
128KChinese model with lower robustness profile
GLM-4 Plus
Zhipu AI8/105/10 (low)-128KLatest GLM version to evaluate
GLM-4.7
355B (32B active)
Z.ai (Zhipu AI)9/103/10 (low)
$0.4 in
$1.5 out
200K★ TOP CANDIDATE: Open-source SOTA tool-calling (87.4% tau-bench) + low robustness. MIT license. 85% Promptfoo pass rate (88 failed probes).
GPT-4.1
OpenAI9/108/10 (high)
$2 in
$8 out
128KLatest GPT-4 series
GPT-4o
OpenAI9/108/10 (high)
$2.5 in
$10 out
128KFlagship multimodal model
GPT-5
OpenAI9/109/10 (high)
$2 in
$10 out
400KCONTROL GROUP: MEASURED 2% ASR. High robustness baseline.
GPT-5.1
OpenAI9/109/10 (high)
$1.75 in
$12 out
400KCONTROL GROUP: MEASURED 2.5% ASR.
IBM Granite 3.1 8B
8B
IBM6/106/10 (medium)-128KEnterprise-focused, Apache 2.0 license
Grok 2
xAI8/105/10 (medium)
$2 in
$10 out
131KMEASURED: 4.4% ASR.
Grok 4
xAI9/108/10 (high)
$3 in
$15 out
131KCONTROL GROUP: MEASURED 3% ASR. Strong agentic + good robustness.
Kimi K2
1T (32B active)
Moonshot AI9/104/10 (low)
$0.6 in
$2.5 out
131K★ TOP CANDIDATE: MEASURED 4.8% ASR. Excellent agentic (9/10, 200-300 tools). Use Instruct variant for max vulnerability.
Kimi K2 Thinking
1T (32B active)
Moonshot AI10/106/10 (medium)
$0.75 in
$3 out
256KReasoning variant with better safety alignment
Llama 3.1 405B
405B
Meta8/104/10 (low)-128K★ CANDIDATE: MEASURED 5.9% ASR. Good agentic capability + moderate vulnerability.
Llama 3.3 70B Instruct
70B
Meta8/103/10 (low)-128K★ TOP CANDIDATE: MEASURED 28/100 robustness, 6.7% ASR. Good for differential testing.
Llama 4 Maverick
400B (17B active)
Meta8/106/10 (medium)
$0.31 in
$0.85 out
1000K1400+ LMArena Elo, very fast inference
Llama 4 Scout
17B active (MoE)
Meta8/107/10 (medium)-10000KMoE architecture, extremely long context
MiniMax M2
230B (10B active)
MiniMax9/105/10 (medium)
$0.3 in
$1.2 out
197KMIT license, 92% cheaper than Claude. Strong coding/agentic.
MiniMax M2 (Thinking)
230B (10B active)
MiniMax9/1010/10 (high)
$0.3 in
$1.2 out
204KCONTROL GROUP: Verified 100% resistance. 8% cost of Claude. Open weights. Ideal high-robustness baseline.
Mistral Large 2
123B
Mistral AI8/106/10 (medium)
$2 in
$6 out
128KPreviously tested Mistral-7B and Small variants
Amazon Nova Pro
Amazon7/107/10 (medium)
$0.8 in
$3.2 out
300KAvailable through Bedrock
o1
OpenAI-9/10 (high)
$15 in
$60 out
200KCONTROL GROUP: MEASURED 2.7% ASR. Reasoning-focused model.
o3
OpenAI-9/10 (high)
$10 in
$40 out
200KCONTROL GROUP: MEASURED 2.9% ASR. Advanced reasoning model.
o3-mini
OpenAI-5/10 (medium)
$1.1 in
$4.4 out
200KMEASURED: 4.3% ASR. Efficient reasoning model.
Qwen 2.5 72B
72B
Alibaba8/103/10 (low)
$0.4 in
$1.2 out
131K★ CANDIDATE: Open weights. Qwen 2.5 family has low robustness.
Qwen 2.5 7B Instruct
7B
Alibaba6/102/10 (low)
$0.1 in
$0.3 out
131K★ TOP CANDIDATE: MEASURED lowest robustness (18/100). Limited agentic capability.
Qwen 3 72B
72B
Alibaba9/107/10 (medium)
$0.5 in
$1.5 out
131KImproved robustness - less suitable for adversarial testing
Qwen 3-VL
Alibaba8/106/10 (medium)-131KMultimodal model for vision tasks
QwQ-32B
32B
Alibaba8/104/10 (medium)
$0.075 in
$0.15 out
131K★ TOP CANDIDATE: Strong tool-calling (BFCL 66.4), documented suffix attack vulnerability. Superseded by Qwen3 but good for testing.

Guardrails & Classifiers

Critical distinction: Content safety classifiers (LlamaGuard, ShieldGemma) do NOT detect prompt injection. For the Safeguards Challenge, prioritize dedicated PI detectors or solutions that handle both.

NameProviderTypeSourceSizeAccuracyGen. GapNotes
Meta Prompt Guard 2
Metaprompt injectionOpen86M (also 22M variant)82%-
★ RECOMMENDED with caveats. Strong PI detector but has documented encoding bypasses. Use as ONE layer in defense-in-depth.
Only detects prompt injection, NOT content safety
Qwen3Guard-8B
Previously tested
AlibababothOpen8B85.3%57.2%
⚠️ NOT RECOMMENDED: Catastrophic generalization gap makes it unreliable against novel attacks. Consider Granite Guardian instead.
⚠️ CRITICAL: 57.2-point generalization gap - LARGEST among all models
Granite Guardian 3.3 8B
Previously tested
IBMbothOpen8B81%6.5%
★ TOP RECOMMENDED: Best generalization for novel attacks. #1 on REVEAL, #3 on LLM-AggreFact.
English-only training
WildGuard 7B
AI2 / AllenAIbothOpen7B (Mistral-7B-v0.3 base)82.8%-
★ RECOMMENDED for single-turn. 2.4% jailbreak success verified. Combine with multi-turn defenses for production.
⚠️ Vulnerable to multi-turn attacks: >90% ASR with X-Teaming adaptive attacks
GPT-OSS Safeguard 20B
Previously tested
ROOST InitiativebothOpen20B82%-
★ RECOMMENDED: Custom policy testing capability
Larger model = slower inference
LlamaFirewall
Previously tested
MetabothOpen-50%-
Meta's comprehensive firewall - has documented bypasses
Only ~50% bypass rate in testing
Cygnal 8B
Previously tested
Gray Swan AIbothClosed8B88%-
Internal Gray Swan model - use as baseline
Llama Guard 4 12B
Previously tested
Metacontent safetyOpen12B85%40%
Content safety only - too permissive for adversarial testing. Pair with Prompt Guard 2 for PI.
⚠️ Does NOT detect prompt injection!
ShieldGemma 9B
Previously tested
Googlecontent safetyOpen2B/4B/9B/27B variants54.7%-
Content safety only - not for PI testing. Smaller models may be better.
⚠️ Does NOT detect prompt injection!
Azure Prompt Shields
Previously tested
Microsoftprompt injectionClosed-89%-
★ RECOMMENDED for Azure ecosystem. Good accuracy but has documented character injection bypasses. Layer with other defenses.
⚠️ Character injection attacks reduce detection from 89% to 7%
AWS Bedrock Guardrails
Previously tested
AmazonbothClosed-78%-
★ RECOMMENDED: Industry standard, broad enterprise use
Performance lags behind specialized solutions
Google Model Armor
Previously tested
Google CloudbothClosed---
★ RECOMMENDED: Infrastructure defense layer
Tied to Google Cloud infrastructure
Lakera Guard
Lakera (Check Point)bothClosed-92.5%-
★ TOP RECOMMENDED: Strongest real-time threat intel. Best for evolving attacks. Consider layering with static classifier.
<90% on NotInject benchmark (vs LlamaGuard3 at 99.71%)
OpenAI Moderation API
OpenAIcontent safetyClosed---
Content safety only - not suitable for PI testing
⚠️ Does NOT include prompt injection detection!
ProtectAI Guardian
Previously tested
ProtectAIbothClosed---
Enterprise ML security platform
NeMo Guardrails
NVIDIAbothOpen-27.46%-
⚠️ NOT RECOMMENDED as sole protection. Documented critical bypasses. Use only as ONE layer with input normalization.
⚠️ CRITICAL: 72.54% jailbreak bypass rate (ASR)
Amazon Nova Micro (Classifier)
Amazonoutput classifierClosed---
★ RECOMMENDED: Amazon's submission for the challenge
GPT-4.1 (as guardrail)
Previously tested
OpenAIbothClosed-86%-
Baseline for LLM-as-classifier approach
Higher cost than specialized models
GPT-4.1-mini (as guardrail)
Previously tested
OpenAIbothClosed-80%-
Lower cost LLM classifier option
o4-mini (as guardrail)
Previously tested
OpenAIbothClosed---
Reasoning model for complex safety decisions