Research

This page is the single source of truth for “why we made these decisions and what data backs them up.” All experiments run against the XBOW benchmark (104 Docker CTF challenges). For benchmark scores and flag tables, see Benchmark.

Topics

Triage Dataset

How benchmark runs and verified findings are converted into labeled JSONL for triage-model training.

Feature Extractor

The 45 handcrafted features exposed by extractFeatures() and how they fit into the hybrid triage direction.

Agent Techniques

What shipped in the agent loop: planning, reflection checkpoints, context compaction, dynamic playbooks, progress handoff, and EGATS.

FP Reduction Moat

The full false-positive reduction stack, why the layers are ordered the way they are, and how the dataset / feature foundation supports the shipped runtime layers.

Finding Triage ML

Implementation notes for reachability, consensus verify, PoV generation, memories, adversarial debate, and multi-modal agreement with foxguard.

Shell-First Rationale

Why bash beats structured tools for pentesting. Includes A/B test data on prompt length, reasoning effort, sub-agent spawning, tool routing, and multi-checkpoint budgets.

Model Comparison

Head-to-head testing of gpt-5.4, Kimi K2.5, Qwen3 Coder, DeepSeek, GLM, and free OpenRouter models. Cost, speed, and flag extraction across multiple XBOW challenges.

XBOW Analysis

Shannon gap analysis (96% vs our 87.5% black-box / 92.3% white-box best-of-N aggregate), competitor verification, what moves the score, white-box vs black-box results, critical bugs found, and future benchmark targets (AutoPenBench, HarmBench, JailbreakBench).

Competitive Landscape

Full competitor breakdown (Shannon 96%, KinoSec 92%, Cyber-AutoAgent 84%, deadend-cli 78%, MAPTA 77%), 10 ranked improvement techniques with expected impact, key research papers, and what we’ve shipped vs what’s next.

The big picture

pwnkit is not a template runner or static analyzer. It’s an autonomous agent that thinks like a pentester. Pentesters use terminals, not GUIs with dropdowns.

The scanner should feel like giving a skilled pentester SSH access. One command. Full autonomy. Real findings with proof.

The conclusion: the framework should get out of the model’s way. 3 tools, a 25-line prompt, and let the model’s training do the work. The ceiling is the model, not the framework.