AI Models & Guardrails Comparison

Reference for Gray Swan AI's Safeguards Challenge - comparing models for capabilities, cost, speed, and robustness.

Selection Criteria Summary

Base Model Requirements

  • P0: High capabilities on multi-agent tool-calling
  • P0: Low built-in robustness (for measuring guardrail uplift)
  • P1: High speed (low latency for agent interactions)
  • P2: Low cost (for high-volume testing)

Recommended Candidates

  • R:5 Gemini 2.5 Pro (Google DeepMind)
  • R:5 Gemini 2.5 Flash (Google DeepMind)
  • R:4 Qwen 2.5 72B (Alibaba)
  • R:4 GLM-4 (Zhipu AI)
  • R:5 GLM-4 Plus (Zhipu AI)

AI Models

24 models
ModelProviderAgenticRobustnessCost ($/M)ContextNotes
Claude 3.5 Sonnet
Anthropic9/109/10 (high)
$3 in
$15 out
200KBest-in-class for coding and agentic tasks
Claude Opus 4.5
Anthropic10/1010/10 (high)
$15 in
$75 out
200KFlagship model, highest capabilities
Command R+
104B
Cohere8/106/10 (medium)
$2.5 in
$10 out
128KEnterprise-focused, good for RAG
DeepSeek-R1
DeepSeek9/106/10 (medium)
$0.55 in
$2.19 out
128KReasoning model, competitive with o1
DeepSeek-V3
671B (MoE)
DeepSeek8/105/10 (medium)
$0.27 in
$1.1 out
128KVery cost-effective, open weights MoE architecture
Gemini 2.5 Flash
Google DeepMind8/105/10 (medium)
$0.075 in
$0.3 out
1000KFast and cheap, similar robustness profile to 2.5 Pro
Gemini 2.5 Pro
Google DeepMind9/105/10 (medium)
$1.25 in
$5 out
1000K★ CANDIDATE: High capability, lower robustness, good speed/cost
Gemini 3 Pro
Google DeepMind9/108/10 (high)
$2 in
$8 out
1000KLatest generation, improved robustness
GLM-4
Zhipu AI7/104/10 (low)
$0.14 in
$0.14 out
128KChinese model with lower robustness profile
GLM-4 Plus
Zhipu AI8/105/10 (low)-128KLatest GLM version to evaluate
GPT-4.1
OpenAI9/108/10 (high)
$2 in
$8 out
128KLatest GPT-4 series
GPT-4o
OpenAI9/108/10 (high)
$2.5 in
$10 out
128KFlagship multimodal model
IBM Granite 3.1 8B
8B
IBM6/106/10 (medium)-128KEnterprise-focused, Apache 2.0 license
Grok-3
xAI9/106/10 (medium)
$3 in
$15 out
131KStrong agentic capabilities
Kimi K2
Moonshot AI8/107/10 (high)-1000KLong context specialist, but higher robustness
Llama 3.1 405B
405B
Meta8/107/10 (medium)-128KLargest open model, previously tested 8B variant
Llama 4 Scout
17B active (MoE)
Meta8/107/10 (medium)-10000KMoE architecture, extremely long context
Mistral Large 2
123B
Mistral AI8/106/10 (medium)
$2 in
$6 out
128KPreviously tested Mistral-7B and Small variants
Amazon Nova Pro
Amazon7/107/10 (medium)
$0.8 in
$3.2 out
300KAvailable through Bedrock
o1
OpenAI-9/10 (high)
$15 in
$60 out
200KReasoning-focused model
o3-mini
OpenAI-8/10 (high)
$1.1 in
$4.4 out
200KEfficient reasoning model
Qwen 2.5 72B
72B
Alibaba8/104/10 (low)
$0.4 in
$1.2 out
131KOpen weights available. Recommended by xiaohan for lower robustness.
Qwen 3 72B
72B
Alibaba9/107/10 (medium)
$0.5 in
$1.5 out
131KImproved robustness - less suitable for adversarial testing
Qwen 3-VL
Alibaba8/106/10 (medium)-131KMultimodal model for vision tasks

Guardrails & Classifiers

Critical distinction: Content safety classifiers (LlamaGuard, ShieldGemma) do NOT detect prompt injection. For the Safeguards Challenge, prioritize dedicated PI detectors or solutions that handle both.

NameProviderTypeSourceSizeAccuracyGen. GapNotes
Meta Prompt Guard 2
Metaprompt injectionOpen86M78%-
★ RECOMMENDED: True PI detector, not just content safety
Only detects prompt injection, not content safety
Qwen3Guard-8B
Previously tested
AlibababothOpen8B85.3%51.5%
★ RECOMMENDED: High bar for known attacks, punishes lazy red teams
Huge generalization gap - poor on novel attacks
Granite Guardian 3.3 8B
Previously tested
IBMbothOpen8B78%6.5%
★ RECOMMENDED: Best generalization - the real test for novel attacks
Lower peak accuracy than Qwen3Guard
WildGuard 7B
AI2 / AllenAIbothOpen7B--
★ RECOMMENDED: Calibration baseline for usability tradeoffs
Research model, may need adaptation for production
GPT-OSS Safeguard 20B
Previously tested
ROOST InitiativebothOpen20B82%-
★ RECOMMENDED: Custom policy testing capability
Larger model = slower inference
LlamaFirewall
Previously tested
MetabothOpen---
Meta's comprehensive firewall solution
Cygnal 8B
Previously tested
Gray Swan AIbothClosed8B88%-
Internal Gray Swan model - use as baseline
Llama Guard 4 12B
Previously tested
Metacontent safetyOpen12B85%-
Content safety only - include only if multimodal track needed
⚠️ Does NOT detect prompt injection!
ShieldGemma 9B
Previously tested
Googlecontent safetyOpen9B83%-
Content safety only - not for PI testing
⚠️ Does NOT detect prompt injection!
Azure Prompt Shields
Previously tested
Microsoftprompt injectionClosed-85%-
★ RECOMMENDED: The specialist for prompt injection
AWS Bedrock Guardrails
Previously tested
AmazonbothClosed-78%-
★ RECOMMENDED: Industry standard, broad enterprise use
Performance lags behind specialized solutions
Google Model Armor
Previously tested
Google CloudbothClosed---
★ RECOMMENDED: Infrastructure defense layer
Tied to Google Cloud infrastructure
Lakera Guard
Lakeraprompt injectionClosed-88%-
★ RECOMMENDED: Zero-day catcher with live threat intel
OpenAI Moderation API
OpenAIcontent safetyClosed---
Content safety only - not suitable for PI testing
⚠️ Does NOT include prompt injection detection!
ProtectAI Guardian
Previously tested
ProtectAIbothClosed---
Enterprise ML security platform
Amazon Nova Micro (Classifier)
Amazonoutput classifierClosed---
★ RECOMMENDED: Amazon's submission for the challenge
GPT-4.1 (as guardrail)
Previously tested
OpenAIbothClosed-86%-
Baseline for LLM-as-classifier approach
Higher cost than specialized models
GPT-4.1-mini (as guardrail)
Previously tested
OpenAIbothClosed-80%-
Lower cost LLM classifier option
o4-mini (as guardrail)
Previously tested
OpenAIbothClosed---
Reasoning model for complex safety decisions