Research
This page is the single source of truth for “why we made these decisions and what data backs them up.” All experiments run against the XBOW benchmark (104 Docker CTF challenges). For benchmark scores and flag tables, see Benchmark.
Topics
Section titled “Topics”How benchmark runs and verified findings are converted into labeled JSONL for triage-model training.
The 45 handcrafted features exposed by extractFeatures() and how they fit
into the hybrid triage direction.
What shipped in the agent loop: planning, reflection checkpoints, context compaction, dynamic playbooks, progress handoff, and EGATS.
The full false-positive reduction stack, why the layers are ordered the way they are, and how the dataset / feature foundation supports the shipped runtime layers.
Implementation notes for reachability, consensus verify, PoV generation, memories, adversarial debate, and multi-modal agreement with foxguard.
Why bash beats structured tools for pentesting. Includes A/B test data on prompt length, reasoning effort, sub-agent spawning, tool routing, and multi-checkpoint budgets.
Head-to-head testing of gpt-5.4, Kimi K2.5, Qwen3 Coder, DeepSeek, GLM, and free OpenRouter models. Cost, speed, and flag extraction across multiple XBOW challenges.
Shannon gap analysis (96% vs our 87.5% black-box / 92.3% white-box best-of-N aggregate), competitor verification, what moves the score, white-box vs black-box results, critical bugs found, and future benchmark targets (AutoPenBench, HarmBench, JailbreakBench).
Full competitor breakdown (Shannon 96%, KinoSec 92%, Cyber-AutoAgent 84%, deadend-cli 78%, MAPTA 77%), 10 ranked improvement techniques with expected impact, key research papers, and what we’ve shipped vs what’s next.
The big picture
Section titled “The big picture”pwnkit is not a template runner or static analyzer. It’s an autonomous agent that thinks like a pentester. Pentesters use terminals, not GUIs with dropdowns.
The scanner should feel like giving a skilled pentester SSH access. One command. Full autonomy. Real findings with proof.
The conclusion: the framework should get out of the model’s way. 3 tools, a 25-line prompt, and let the model’s training do the work. The ceiling is the model, not the framework.