Selection Criteria Summary
Base Model Requirements
- P0: High capabilities on multi-agent tool-calling
- P0: Low built-in robustness (for measuring guardrail uplift)
- P1: High speed (low latency for agent interactions)
- P2: Low cost (for high-volume testing)
Recommended Candidates
- R:5 Gemini 2.5 Pro (Google DeepMind)
- R:5 Gemini 2.5 Flash (Google DeepMind)
- R:4 Qwen 2.5 72B (Alibaba)
- R:4 GLM-4 (Zhipu AI)
- R:5 GLM-4 Plus (Zhipu AI)
AI Models
24 models| Model | Provider | Agentic | Robustness | Cost ($/M) | Context | Notes |
|---|---|---|---|---|---|---|
Claude 3.5 Sonnet | Anthropic | 9/10 | 9/10 (high) | $3 in $15 out | 200K | Best-in-class for coding and agentic tasks |
Claude Opus 4.5 | Anthropic | 10/10 | 10/10 (high) | $15 in $75 out | 200K | Flagship model, highest capabilities |
Command R+ 104B | Cohere | 8/10 | 6/10 (medium) | $2.5 in $10 out | 128K | Enterprise-focused, good for RAG |
DeepSeek-R1 | DeepSeek | 9/10 | 6/10 (medium) | $0.55 in $2.19 out | 128K | Reasoning model, competitive with o1 |
DeepSeek-V3 671B (MoE) | DeepSeek | 8/10 | 5/10 (medium) | $0.27 in $1.1 out | 128K | Very cost-effective, open weights MoE architecture |
Gemini 2.5 Flash | Google DeepMind | 8/10 | 5/10 (medium) | $0.075 in $0.3 out | 1000K | Fast and cheap, similar robustness profile to 2.5 Pro |
Gemini 2.5 Pro | Google DeepMind | 9/10 | 5/10 (medium) | $1.25 in $5 out | 1000K | ★ CANDIDATE: High capability, lower robustness, good speed/cost |
Gemini 3 Pro | Google DeepMind | 9/10 | 8/10 (high) | $2 in $8 out | 1000K | Latest generation, improved robustness |
GLM-4 | Zhipu AI | 7/10 | 4/10 (low) | $0.14 in $0.14 out | 128K | Chinese model with lower robustness profile |
GLM-4 Plus | Zhipu AI | 8/10 | 5/10 (low) | - | 128K | Latest GLM version to evaluate |
GPT-4.1 | OpenAI | 9/10 | 8/10 (high) | $2 in $8 out | 128K | Latest GPT-4 series |
GPT-4o | OpenAI | 9/10 | 8/10 (high) | $2.5 in $10 out | 128K | Flagship multimodal model |
IBM Granite 3.1 8B 8B | IBM | 6/10 | 6/10 (medium) | - | 128K | Enterprise-focused, Apache 2.0 license |
Grok-3 | xAI | 9/10 | 6/10 (medium) | $3 in $15 out | 131K | Strong agentic capabilities |
Kimi K2 | Moonshot AI | 8/10 | 7/10 (high) | - | 1000K | Long context specialist, but higher robustness |
Llama 3.1 405B 405B | Meta | 8/10 | 7/10 (medium) | - | 128K | Largest open model, previously tested 8B variant |
Llama 4 Scout 17B active (MoE) | Meta | 8/10 | 7/10 (medium) | - | 10000K | MoE architecture, extremely long context |
Mistral Large 2 123B | Mistral AI | 8/10 | 6/10 (medium) | $2 in $6 out | 128K | Previously tested Mistral-7B and Small variants |
Amazon Nova Pro | Amazon | 7/10 | 7/10 (medium) | $0.8 in $3.2 out | 300K | Available through Bedrock |
o1 | OpenAI | - | 9/10 (high) | $15 in $60 out | 200K | Reasoning-focused model |
o3-mini | OpenAI | - | 8/10 (high) | $1.1 in $4.4 out | 200K | Efficient reasoning model |
Qwen 2.5 72B 72B | Alibaba | 8/10 | 4/10 (low) | $0.4 in $1.2 out | 131K | Open weights available. Recommended by xiaohan for lower robustness. |
Qwen 3 72B 72B | Alibaba | 9/10 | 7/10 (medium) | $0.5 in $1.5 out | 131K | Improved robustness - less suitable for adversarial testing |
Qwen 3-VL | Alibaba | 8/10 | 6/10 (medium) | - | 131K | Multimodal model for vision tasks |
Guardrails & Classifiers
Critical distinction: Content safety classifiers (LlamaGuard, ShieldGemma) do NOT detect prompt injection. For the Safeguards Challenge, prioritize dedicated PI detectors or solutions that handle both.
| Name | Provider | Type | Source | Size | Accuracy | Gen. Gap | Notes |
|---|---|---|---|---|---|---|---|
Meta Prompt Guard 2 | Meta | prompt injection | Open | 86M | 78% | - | ★ RECOMMENDED: True PI detector, not just content safety Only detects prompt injection, not content safety |
Qwen3Guard-8B Previously tested | Alibaba | both | Open | 8B | 85.3% | 51.5% | ★ RECOMMENDED: High bar for known attacks, punishes lazy red teams Huge generalization gap - poor on novel attacks |
Granite Guardian 3.3 8B Previously tested | IBM | both | Open | 8B | 78% | 6.5% | ★ RECOMMENDED: Best generalization - the real test for novel attacks Lower peak accuracy than Qwen3Guard |
WildGuard 7B | AI2 / AllenAI | both | Open | 7B | - | - | ★ RECOMMENDED: Calibration baseline for usability tradeoffs Research model, may need adaptation for production |
GPT-OSS Safeguard 20B Previously tested | ROOST Initiative | both | Open | 20B | 82% | - | ★ RECOMMENDED: Custom policy testing capability Larger model = slower inference |
LlamaFirewall Previously tested | Meta | both | Open | - | - | - | Meta's comprehensive firewall solution |
Cygnal 8B Previously tested | Gray Swan AI | both | Closed | 8B | 88% | - | Internal Gray Swan model - use as baseline |
Llama Guard 4 12B Previously tested | Meta | content safety | Open | 12B | 85% | - | Content safety only - include only if multimodal track needed ⚠️ Does NOT detect prompt injection! |
ShieldGemma 9B Previously tested | content safety | Open | 9B | 83% | - | Content safety only - not for PI testing ⚠️ Does NOT detect prompt injection! | |
Azure Prompt Shields Previously tested | Microsoft | prompt injection | Closed | - | 85% | - | ★ RECOMMENDED: The specialist for prompt injection |
AWS Bedrock Guardrails Previously tested | Amazon | both | Closed | - | 78% | - | ★ RECOMMENDED: Industry standard, broad enterprise use Performance lags behind specialized solutions |
Google Model Armor Previously tested | Google Cloud | both | Closed | - | - | - | ★ RECOMMENDED: Infrastructure defense layer Tied to Google Cloud infrastructure |
Lakera Guard | Lakera | prompt injection | Closed | - | 88% | - | ★ RECOMMENDED: Zero-day catcher with live threat intel |
OpenAI Moderation API | OpenAI | content safety | Closed | - | - | - | Content safety only - not suitable for PI testing ⚠️ Does NOT include prompt injection detection! |
ProtectAI Guardian Previously tested | ProtectAI | both | Closed | - | - | - | Enterprise ML security platform |
Amazon Nova Micro (Classifier) | Amazon | output classifier | Closed | - | - | - | ★ RECOMMENDED: Amazon's submission for the challenge |
GPT-4.1 (as guardrail) Previously tested | OpenAI | both | Closed | - | 86% | - | Baseline for LLM-as-classifier approach Higher cost than specialized models |
GPT-4.1-mini (as guardrail) Previously tested | OpenAI | both | Closed | - | 80% | - | Lower cost LLM classifier option |
o4-mini (as guardrail) Previously tested | OpenAI | both | Closed | - | - | - | Reasoning model for complex safety decisions |