AI Models Comparison | Gray Swan AI

Selection Criteria Summary

Base Model Requirements

P0: High capabilities on multi-agent tool-calling
P0: Low built-in robustness (for measuring guardrail uplift)
P1: High speed (low latency for agent interactions)
P2: Low cost (for high-volume testing)

Recommended Candidates

R:5 Gemini 2.5 Pro (Google DeepMind)
R:5 Gemini 2.5 Flash (Google DeepMind)
R:4 Qwen 2.5 72B (Alibaba)
R:4 GLM-4 (Zhipu AI)
R:5 GLM-4 Plus (Zhipu AI)

AI Models

24 models

Model	Provider	Agentic	Robustness	Cost ($/M)	Context	Notes
Claude 3.5 Sonnet	Anthropic	9/10	9/10 (high)	$3 in $15 out	200K	Best-in-class for coding and agentic tasks
Claude Opus 4.5	Anthropic	10/10	10/10 (high)	$15 in $75 out	200K	Flagship model, highest capabilities
Command R+ 104B	Cohere	8/10	6/10 (medium)	$2.5 in $10 out	128K	Enterprise-focused, good for RAG
DeepSeek-R1	DeepSeek	9/10	6/10 (medium)	$0.55 in $2.19 out	128K	Reasoning model, competitive with o1
DeepSeek-V3 671B (MoE)	DeepSeek	8/10	5/10 (medium)	$0.27 in $1.1 out	128K	Very cost-effective, open weights MoE architecture
Gemini 2.5 Flash	Google DeepMind	8/10	5/10 (medium)	$0.075 in $0.3 out	1000K	Fast and cheap, similar robustness profile to 2.5 Pro
Gemini 2.5 Pro	Google DeepMind	9/10	5/10 (medium)	$1.25 in $5 out	1000K	★ CANDIDATE: High capability, lower robustness, good speed/cost
Gemini 3 Pro	Google DeepMind	9/10	8/10 (high)	$2 in $8 out	1000K	Latest generation, improved robustness
GLM-4	Zhipu AI	7/10	4/10 (low)	$0.14 in $0.14 out	128K	Chinese model with lower robustness profile
GLM-4 Plus	Zhipu AI	8/10	5/10 (low)	-	128K	Latest GLM version to evaluate
GPT-4.1	OpenAI	9/10	8/10 (high)	$2 in $8 out	128K	Latest GPT-4 series
GPT-4o	OpenAI	9/10	8/10 (high)	$2.5 in $10 out	128K	Flagship multimodal model
IBM Granite 3.1 8B 8B	IBM	6/10	6/10 (medium)	-	128K	Enterprise-focused, Apache 2.0 license
Grok-3	xAI	9/10	6/10 (medium)	$3 in $15 out	131K	Strong agentic capabilities
Kimi K2	Moonshot AI	8/10	7/10 (high)	-	1000K	Long context specialist, but higher robustness
Llama 3.1 405B 405B	Meta	8/10	7/10 (medium)	-	128K	Largest open model, previously tested 8B variant
Llama 4 Scout 17B active (MoE)	Meta	8/10	7/10 (medium)	-	10000K	MoE architecture, extremely long context
Mistral Large 2 123B	Mistral AI	8/10	6/10 (medium)	$2 in $6 out	128K	Previously tested Mistral-7B and Small variants
Amazon Nova Pro	Amazon	7/10	7/10 (medium)	$0.8 in $3.2 out	300K	Available through Bedrock
o1	OpenAI	-	9/10 (high)	$15 in $60 out	200K	Reasoning-focused model
o3-mini	OpenAI	-	8/10 (high)	$1.1 in $4.4 out	200K	Efficient reasoning model
Qwen 2.5 72B 72B	Alibaba	8/10	4/10 (low)	$0.4 in $1.2 out	131K	Open weights available. Recommended by xiaohan for lower robustness.
Qwen 3 72B 72B	Alibaba	9/10	7/10 (medium)	$0.5 in $1.5 out	131K	Improved robustness - less suitable for adversarial testing
Qwen 3-VL	Alibaba	8/10	6/10 (medium)	-	131K	Multimodal model for vision tasks

Guardrails & Classifiers

Critical distinction: Content safety classifiers (LlamaGuard, ShieldGemma) do NOT detect prompt injection. For the Safeguards Challenge, prioritize dedicated PI detectors or solutions that handle both.

Name	Provider	Type	Source	Size	Accuracy	Gen. Gap	Notes
Meta Prompt Guard 2	Meta	prompt injection	Open	86M	78%	-	★ RECOMMENDED: True PI detector, not just content safety Only detects prompt injection, not content safety
Qwen3Guard-8B Previously tested	Alibaba	both	Open	8B	85.3%	51.5%	★ RECOMMENDED: High bar for known attacks, punishes lazy red teams Huge generalization gap - poor on novel attacks
Granite Guardian 3.3 8B Previously tested	IBM	both	Open	8B	78%	6.5%	★ RECOMMENDED: Best generalization - the real test for novel attacks Lower peak accuracy than Qwen3Guard
WildGuard 7B	AI2 / AllenAI	both	Open	7B	-	-	★ RECOMMENDED: Calibration baseline for usability tradeoffs Research model, may need adaptation for production
GPT-OSS Safeguard 20B Previously tested	ROOST Initiative	both	Open	20B	82%	-	★ RECOMMENDED: Custom policy testing capability Larger model = slower inference
LlamaFirewall Previously tested	Meta	both	Open	-	-	-	Meta's comprehensive firewall solution
Cygnal 8B Previously tested	Gray Swan AI	both	Closed	8B	88%	-	Internal Gray Swan model - use as baseline
Llama Guard 4 12B Previously tested	Meta	content safety	Open	12B	85%	-	Content safety only - include only if multimodal track needed ⚠️ Does NOT detect prompt injection!
ShieldGemma 9B Previously tested	Google	content safety	Open	9B	83%	-	Content safety only - not for PI testing ⚠️ Does NOT detect prompt injection!
Azure Prompt Shields Previously tested	Microsoft	prompt injection	Closed	-	85%	-	★ RECOMMENDED: The specialist for prompt injection
AWS Bedrock Guardrails Previously tested	Amazon	both	Closed	-	78%	-	★ RECOMMENDED: Industry standard, broad enterprise use Performance lags behind specialized solutions
Google Model Armor Previously tested	Google Cloud	both	Closed	-	-	-	★ RECOMMENDED: Infrastructure defense layer Tied to Google Cloud infrastructure
Lakera Guard	Lakera	prompt injection	Closed	-	88%	-	★ RECOMMENDED: Zero-day catcher with live threat intel
OpenAI Moderation API	OpenAI	content safety	Closed	-	-	-	Content safety only - not suitable for PI testing ⚠️ Does NOT include prompt injection detection!
ProtectAI Guardian Previously tested	ProtectAI	both	Closed	-	-	-	Enterprise ML security platform
Amazon Nova Micro (Classifier)	Amazon	output classifier	Closed	-	-	-	★ RECOMMENDED: Amazon's submission for the challenge
GPT-4.1 (as guardrail) Previously tested	OpenAI	both	Closed	-	86%	-	Baseline for LLM-as-classifier approach Higher cost than specialized models
GPT-4.1-mini (as guardrail) Previously tested	OpenAI	both	Closed	-	80%	-	Lower cost LLM classifier option
o4-mini (as guardrail) Previously tested	OpenAI	both	Closed	-	-	-	Reasoning model for complex safety decisions