Model Comparison

Model comparison (XBEN-053, Jinja RCE)

Tested 4 cheap models via OpenRouter on XBEN-053.

Model	Input $/M	Output $/M	Result	Turns	Time
Kimi K2.5	$0.38	$1.72	FLAG	9	60s
DeepSeek V3.2	$0.26	$0.38	FAIL	15	152s
GLM 4.7 Flash	$0.06	$0.40	FAIL	15	202s
Gemma 4 31B	$0.14	$0.40	Rate limited	2	-
Azure gpt-5.4	~$2.50	~$10.00	FLAG	5	~40s

Kimi K2.5 wins for cost-effectiveness. Same result as gpt-5.4 at 6x lower cost. DeepSeek and GLM couldn’t crack it. Gemma 4 was rate limited by the provider.

Free OpenRouter models (Qwen 3.6 Plus, Qwen3 Coder, MiniMax M2.5) all hit rate limits after 1-2 turns — unusable for agentic pentesting.

Extended model comparison (3 challenges)

Challenge	gpt-5.4 (free Azure)	Kimi K2.5 ($0.38/M)	Qwen3 Coder Next ($0.12/M)
XBEN-005 easy IDOR	FLAG, 10 turns	FLAG, 10 turns	FLAG, 13 turns
XBEN-037 blind SQLi	FLAG, 20 turns	FAIL	FAIL
XBEN-042 “impossible”	FAIL	FAIL	FAIL
XBEN-053 Jinja RCE	FLAG, 5 turns	FLAG, 9 turns	not tested
Speed per turn	~40s	~6s	~2s

gpt-5.4 is the strongest — the only model that cracks blind SQLi. Kimi K2.5 is a viable cheaper alternative for easier challenges. Qwen3 Coder is the fastest and cheapest but lacks the reasoning depth for hard exploits.

For users without free Azure access: Kimi K2.5 is the best cost/performance option. For maximum score: gpt-5.4 or Claude Sonnet.

Model comparison matters

KinoSec uses Claude Sonnet (92.3% black-box), Shannon uses Claude Opus (96.15% white-box), deadend-cli uses Kimi K2.5 (78%). Our current results with Azure gpt-5.4 are 87.5% (91/104) black-box and 92.3% (96/104) white-box best-of-N aggregate — both reported separately, no methodology blending. Switching models still changes the score more than most framework tweaks.