Selection Criteria Summary
Base Model Requirements
- P0: High capabilities on multi-agent tool-calling
- P0: Low built-in robustness (for measuring guardrail uplift)
- P1: High speed (low latency for agent interactions)
- P2: Low cost (for high-volume testing)
Recommended Candidates
- R:5 Gemini 2.5 Pro (Google DeepMind)
- R:5 Gemini 2.5 Flash (Google DeepMind)
- R:3 Qwen 2.5 72B (Alibaba)
- R:4 GLM-4 (Zhipu AI)
- R:5 GLM-4 Plus (Zhipu AI)
AI Models
37 models| Model | Provider | Agentic | Robustness | Cost ($/M) | Context | Notes |
|---|---|---|---|---|---|---|
Claude 3.5 Sonnet | Anthropic | 9/10 | 8/10 (high) | $3 in $15 out | 200K | MEASURED 83/100 robustness. Best-in-class for coding and agentic tasks. |
Claude Opus 4.5 | Anthropic | 10/10 | 10/10 (high) | $5 in $25 out | 200K | CONTROL GROUP: MEASURED lowest ASR (0.5%). Most robust frontier model for high-robustness baseline. |
Command R+ 104B | Cohere | 8/10 | 5/10 (medium) | $2.5 in $10 out | 128K | MEASURED: 3.8% ASR. Enterprise-focused, good for RAG. |
DeepSeek-R1 671B (37B active) | DeepSeek | 5/10 | 1/10 (low) | $0.55 in $2.19 out | 128K | ★ TOP CANDIDATE: Lowest robustness among frontier models. CAVEAT: No native tool-calling (score 5). Use R1-0528 variant. |
DeepSeek-V3 671B (37B active) | DeepSeek | 8/10 | 4/10 (medium) | $0.27 in $1.1 out | 128K | Very cost-effective, open weights MoE architecture |
DeepSeek-V3.1 671B (37B active) | DeepSeek | 8/10 | 4/10 (low) | $0.2 in $0.8 out | 128K | ★ TOP CANDIDATE: MEASURED 5.3% ASR. Strong agentic (8/10) + low robustness. Superseded by V3.2 but still available. |
Gemini 2.5 Flash | Google DeepMind | 8/10 | 5/10 (medium) | $0.075 in $0.3 out | 1000K | Fast and cheap, similar robustness profile to 2.5 Pro |
Gemini 2.5 Pro | Google DeepMind | 9/10 | 5/10 (low) | $1.25 in $10 out | 1049K | ★ TOP CANDIDATE: MEASURED highest vulnerability (8.5% ASR). High agentic (9/10). Still available despite Gemini 3.0. |
Gemini 3 Pro | Google DeepMind | 9/10 | 8/10 (high) | $2 in $8 out | 1000K | Latest generation, improved robustness |
GLM-4 | Zhipu AI | 7/10 | 4/10 (low) | $0.14 in $0.14 out | 128K | Chinese model with lower robustness profile |
GLM-4 Plus | Zhipu AI | 8/10 | 5/10 (low) | - | 128K | Latest GLM version to evaluate |
GLM-4.7 355B (32B active) | Z.ai (Zhipu AI) | 9/10 | 3/10 (low) | $0.4 in $1.5 out | 200K | ★ TOP CANDIDATE: Open-source SOTA tool-calling (87.4% tau-bench) + low robustness. MIT license. 85% Promptfoo pass rate (88 failed probes). |
GPT-4.1 | OpenAI | 9/10 | 8/10 (high) | $2 in $8 out | 128K | Latest GPT-4 series |
GPT-4o | OpenAI | 9/10 | 8/10 (high) | $2.5 in $10 out | 128K | Flagship multimodal model |
GPT-5 | OpenAI | 9/10 | 9/10 (high) | $2 in $10 out | 400K | CONTROL GROUP: MEASURED 2% ASR. High robustness baseline. |
GPT-5.1 | OpenAI | 9/10 | 9/10 (high) | $1.75 in $12 out | 400K | CONTROL GROUP: MEASURED 2.5% ASR. |
IBM Granite 3.1 8B 8B | IBM | 6/10 | 6/10 (medium) | - | 128K | Enterprise-focused, Apache 2.0 license |
Grok 2 | xAI | 8/10 | 5/10 (medium) | $2 in $10 out | 131K | MEASURED: 4.4% ASR. |
Grok 4 | xAI | 9/10 | 8/10 (high) | $3 in $15 out | 131K | CONTROL GROUP: MEASURED 3% ASR. Strong agentic + good robustness. |
Kimi K2 1T (32B active) | Moonshot AI | 9/10 | 4/10 (low) | $0.6 in $2.5 out | 131K | ★ TOP CANDIDATE: MEASURED 4.8% ASR. Excellent agentic (9/10, 200-300 tools). Use Instruct variant for max vulnerability. |
Kimi K2 Thinking 1T (32B active) | Moonshot AI | 10/10 | 6/10 (medium) | $0.75 in $3 out | 256K | Reasoning variant with better safety alignment |
Llama 3.1 405B 405B | Meta | 8/10 | 4/10 (low) | - | 128K | ★ CANDIDATE: MEASURED 5.9% ASR. Good agentic capability + moderate vulnerability. |
Llama 3.3 70B Instruct 70B | Meta | 8/10 | 3/10 (low) | - | 128K | ★ TOP CANDIDATE: MEASURED 28/100 robustness, 6.7% ASR. Good for differential testing. |
Llama 4 Maverick 400B (17B active) | Meta | 8/10 | 6/10 (medium) | $0.31 in $0.85 out | 1000K | 1400+ LMArena Elo, very fast inference |
Llama 4 Scout 17B active (MoE) | Meta | 8/10 | 7/10 (medium) | - | 10000K | MoE architecture, extremely long context |
MiniMax M2 230B (10B active) | MiniMax | 9/10 | 5/10 (medium) | $0.3 in $1.2 out | 197K | MIT license, 92% cheaper than Claude. Strong coding/agentic. |
MiniMax M2 (Thinking) 230B (10B active) | MiniMax | 9/10 | 10/10 (high) | $0.3 in $1.2 out | 204K | CONTROL GROUP: Verified 100% resistance. 8% cost of Claude. Open weights. Ideal high-robustness baseline. |
Mistral Large 2 123B | Mistral AI | 8/10 | 6/10 (medium) | $2 in $6 out | 128K | Previously tested Mistral-7B and Small variants |
Amazon Nova Pro | Amazon | 7/10 | 7/10 (medium) | $0.8 in $3.2 out | 300K | Available through Bedrock |
o1 | OpenAI | - | 9/10 (high) | $15 in $60 out | 200K | CONTROL GROUP: MEASURED 2.7% ASR. Reasoning-focused model. |
o3 | OpenAI | - | 9/10 (high) | $10 in $40 out | 200K | CONTROL GROUP: MEASURED 2.9% ASR. Advanced reasoning model. |
o3-mini | OpenAI | - | 5/10 (medium) | $1.1 in $4.4 out | 200K | MEASURED: 4.3% ASR. Efficient reasoning model. |
Qwen 2.5 72B 72B | Alibaba | 8/10 | 3/10 (low) | $0.4 in $1.2 out | 131K | ★ CANDIDATE: Open weights. Qwen 2.5 family has low robustness. |
Qwen 2.5 7B Instruct 7B | Alibaba | 6/10 | 2/10 (low) | $0.1 in $0.3 out | 131K | ★ TOP CANDIDATE: MEASURED lowest robustness (18/100). Limited agentic capability. |
Qwen 3 72B 72B | Alibaba | 9/10 | 7/10 (medium) | $0.5 in $1.5 out | 131K | Improved robustness - less suitable for adversarial testing |
Qwen 3-VL | Alibaba | 8/10 | 6/10 (medium) | - | 131K | Multimodal model for vision tasks |
QwQ-32B 32B | Alibaba | 8/10 | 4/10 (medium) | $0.075 in $0.15 out | 131K | ★ TOP CANDIDATE: Strong tool-calling (BFCL 66.4), documented suffix attack vulnerability. Superseded by Qwen3 but good for testing. |
Guardrails & Classifiers
Critical distinction: Content safety classifiers (LlamaGuard, ShieldGemma) do NOT detect prompt injection. For the Safeguards Challenge, prioritize dedicated PI detectors or solutions that handle both.
| Name | Provider | Type | Source | Size | Accuracy | Gen. Gap | Notes |
|---|---|---|---|---|---|---|---|
Meta Prompt Guard 2 | Meta | prompt injection | Open | 86M (also 22M variant) | 82% | - | ★ RECOMMENDED with caveats. Strong PI detector but has documented encoding bypasses. Use as ONE layer in defense-in-depth. Only detects prompt injection, NOT content safety |
Qwen3Guard-8B Previously tested | Alibaba | both | Open | 8B | 85.3% | 57.2% | ⚠️ NOT RECOMMENDED: Catastrophic generalization gap makes it unreliable against novel attacks. Consider Granite Guardian instead. ⚠️ CRITICAL: 57.2-point generalization gap - LARGEST among all models |
Granite Guardian 3.3 8B Previously tested | IBM | both | Open | 8B | 81% | 6.5% | ★ TOP RECOMMENDED: Best generalization for novel attacks. #1 on REVEAL, #3 on LLM-AggreFact. English-only training |
WildGuard 7B | AI2 / AllenAI | both | Open | 7B (Mistral-7B-v0.3 base) | 82.8% | - | ★ RECOMMENDED for single-turn. 2.4% jailbreak success verified. Combine with multi-turn defenses for production. ⚠️ Vulnerable to multi-turn attacks: >90% ASR with X-Teaming adaptive attacks |
GPT-OSS Safeguard 20B Previously tested | ROOST Initiative | both | Open | 20B | 82% | - | ★ RECOMMENDED: Custom policy testing capability Larger model = slower inference |
LlamaFirewall Previously tested | Meta | both | Open | - | 50% | - | Meta's comprehensive firewall - has documented bypasses Only ~50% bypass rate in testing |
Cygnal 8B Previously tested | Gray Swan AI | both | Closed | 8B | 88% | - | Internal Gray Swan model - use as baseline |
Llama Guard 4 12B Previously tested | Meta | content safety | Open | 12B | 85% | 40% | Content safety only - too permissive for adversarial testing. Pair with Prompt Guard 2 for PI. ⚠️ Does NOT detect prompt injection! |
ShieldGemma 9B Previously tested | content safety | Open | 2B/4B/9B/27B variants | 54.7% | - | Content safety only - not for PI testing. Smaller models may be better. ⚠️ Does NOT detect prompt injection! | |
Azure Prompt Shields Previously tested | Microsoft | prompt injection | Closed | - | 89% | - | ★ RECOMMENDED for Azure ecosystem. Good accuracy but has documented character injection bypasses. Layer with other defenses. ⚠️ Character injection attacks reduce detection from 89% to 7% |
AWS Bedrock Guardrails Previously tested | Amazon | both | Closed | - | 78% | - | ★ RECOMMENDED: Industry standard, broad enterprise use Performance lags behind specialized solutions |
Google Model Armor Previously tested | Google Cloud | both | Closed | - | - | - | ★ RECOMMENDED: Infrastructure defense layer Tied to Google Cloud infrastructure |
Lakera Guard | Lakera (Check Point) | both | Closed | - | 92.5% | - | ★ TOP RECOMMENDED: Strongest real-time threat intel. Best for evolving attacks. Consider layering with static classifier. <90% on NotInject benchmark (vs LlamaGuard3 at 99.71%) |
OpenAI Moderation API | OpenAI | content safety | Closed | - | - | - | Content safety only - not suitable for PI testing ⚠️ Does NOT include prompt injection detection! |
ProtectAI Guardian Previously tested | ProtectAI | both | Closed | - | - | - | Enterprise ML security platform |
NeMo Guardrails | NVIDIA | both | Open | - | 27.46% | - | ⚠️ NOT RECOMMENDED as sole protection. Documented critical bypasses. Use only as ONE layer with input normalization. ⚠️ CRITICAL: 72.54% jailbreak bypass rate (ASR) |
Amazon Nova Micro (Classifier) | Amazon | output classifier | Closed | - | - | - | ★ RECOMMENDED: Amazon's submission for the challenge |
GPT-4.1 (as guardrail) Previously tested | OpenAI | both | Closed | - | 86% | - | Baseline for LLM-as-classifier approach Higher cost than specialized models |
GPT-4.1-mini (as guardrail) Previously tested | OpenAI | both | Closed | - | 80% | - | Lower cost LLM classifier option |
o4-mini (as guardrail) Previously tested | OpenAI | both | Closed | - | - | - | Reasoning model for complex safety decisions |