Skip to content

Competitive Landscape

Synthesis of competitive intelligence and published research on autonomous pentesting agents, benchmarked against the XBOW validation suite (104 Docker CTF challenges). Data current as of April 2026.

pwnkit status (April 2026): 87.5% black-box (91/104) and 92.3% white-box best-of-N aggregate (96/104), both reported separately. The aggregate beats MAPTA (76.9%), deadend-cli (77.6%), Cyber-AutoAgent (84.6%), XBOW’s own agent (85%), and BoxPwnr’s best single-model score of 81.7% (GLM-5). The black-box number alone still beats every one of those except KinoSec (92.3% black-box). Still behind BoxPwnr’s 97.1% best-of-N aggregate and Shannon’s 96.15% white-box score.

For pwnkit’s own benchmark scores, see the Benchmark page. For the Shannon-specific gap analysis, see XBOW Analysis.

AgentScoreModelApproachCostKey differentiator
BoxPwnr97.1% (101/104)Claude, GPT-5, othersModular shell-firstUnknownContext compaction + loop detection
Shannon96.15% (100/104)Claude Opus/Sonnet/Haiku 3-tierWhite-box, multi-agent~$50/scanSource-to-sink taint analysis
KinoSec92.3% (96/104)Claude Sonnet 4.6Black-box onlyUnknown (proprietary)50-turn hard cap, pure HTTP
Cyber-AutoAgent84.62% (88/104)Not disclosedSingle meta-agentUnknownSelf-rewriting prompts
deadend-cli77.55% (~76/98)Kimi K2.5Single-agent CLI$122/104 challengesADaPT recursive decomposition
MAPTA76.9% (80/104)GPT-53-role multi-agent$21.38 totalEvidence-gated branching

Current XBOW leader by Francisco Oca (0ca). Modular framework with four components: Orchestrator (run management), Solver (LLM interaction), Executor (Docker sandbox), and Platform (challenge interface). Six solver strategies: single_loop_xmltag (default, shell-first), single_loop, single_loop_compactation (context compaction at 60% window), claude_code delegation, codex delegation, and hacksynth (multi-agent). The default strategy uses XML-tag shell-first execution — the LLM emits bash commands inside <COMMAND> tags, which run in a full Kali Linux Docker container with all security tools pre-installed.

Key techniques: context compaction triggers at 60% window fill (summarize and continue), loop/oscillation detection catches the agent repeating failed commands, and progress handoff between attempts preserves findings across retries. Cost tracking is built into the orchestrator. Supports Claude, GPT-5, DeepSeek, Grok-4, Gemini 3, and Kimi K2.5.

Beyond XBOW, BoxPwnr has solved HTB 250/523 (47.8%), PortSwigger 163/270 (60.4%), Cybench 40/40 (100%), and picoCTF 373/509 (73.3%). The breadth of benchmark coverage across five platforms is unmatched. Notably, the same author created the patched XBOW fork that pwnkit uses for its benchmark environment.

13-agent Temporal workflow system. Runs 5 parallel vulnerability+exploit agent pairs (injection, XSS, auth, authz, SSRF), each with 200-400 line domain-specific prompts. The white-box pipeline starts with source-to-sink taint analysis — 6 pre-recon sub-agents scan architecture, entry points, security patterns, XSS sinks, SSRF sources, and data security before any exploit agent fires. Structured exploitation queues route findings between agents. The 3-tier model strategy uses Opus for planning, Sonnet for exploitation, and Haiku for classification. At ~$50 per scan and 10,000 max turns, Shannon buys accuracy with compute.

Proprietary black-box agent. Uses Claude Sonnet 4.6 with a hard 50-turn budget. No source code access, no Playwright, pure HTTP. The score is remarkable given the constraints — it implies near-perfect exploitation efficiency on every challenge it attempts. Closed-source, so architecture details are limited, but the 50-turn cap suggests extremely focused prompt engineering and tool selection.

The biggest leap on the leaderboard: 46% to 84.62% through architecture changes alone. Single meta-agent with self-rewriting prompts — the agent modifies its own system prompt based on challenge feedback. Uses a tool router hook to dynamically select tools and mem0 vector memory to persist knowledge across turns. No multi-agent coordination overhead. The self-rewriting prompt mechanism is the standout innovation: the agent literally edits its instructions mid-run.

Single-agent CLI using ADaPT (Adaptive Decomposition and Planning for Tasks) recursive decomposition. Breaks complex challenges into sub-tasks, solves them sequentially, and backtracks on failure. Custom Playwright integration with RFC-bypass for browser-based challenges. Runs on Kimi K2.5 at $122 for all 104 challenges ($1.17/challenge). Notable for solving blind SQLi challenges that trip up most agents. Proves you don’t need multi-agent to reach 78%.

Academic 3-role system: coordinator, sandbox executor, and validator. The coordinator plans attack strategy, the sandbox runs exploits in isolation, and the validator checks whether output constitutes a real flag. Evidence-gated branching means the system only pursues exploitation paths backed by concrete evidence from prior steps — no speculative tool calls. Runs on GPT-5 for $21.38 total across all 104 challenges ($0.21/challenge). Published as a research paper with full methodology.

Reachability gate matches Endor Labs’ “Code API” moat (open-source)

Section titled “Reachability gate matches Endor Labs’ “Code API” moat (open-source)”

Endor Labs’ disclosed ~95% false-positive elimination depends on a proprietary “Code API” reachability signal — they check whether a flagged sink is actually callable from an application entry point before they spend LLM tokens on it. pwnkit implements the same idea as a zero-dependency grep/pattern-based first pass in packages/core/src/triage/reachability.ts. This is the only open-source reachability gate for LLM pentest findings we are aware of.

foxguard × pwnkit cross-validation (unique)

Section titled “foxguard × pwnkit cross-validation (unique)”

Endor Labs’ triage accuracy comes from forcing neural + rules to agree. pwnkit has the open-source version: for every finding, run foxguard (Rust pattern scanner) against the same source tree and require agreement. Both fire → strong signal. Foxguard silent → likely false positive. Nobody else in the open-source pentest-agent space runs a second, independent scanner for cross-validation — this is unique to the pwnkit / foxguard / opensoar trinity.

Implementation: packages/core/src/triage/multi-modal.ts.

92.3% XBOW best-of-N aggregate — beats BoxPwnr’s best single-model by ~10.6pp

Section titled “92.3% XBOW best-of-N aggregate — beats BoxPwnr’s best single-model by ~10.6pp”

BoxPwnr’s headline 97.1% is a best-of-N aggregate across ~10 model+solver configurations (527 traces / 104 challenges ≈ 5 attempts each). Their best single model (GLM-5 + single_loop) scores 81.7%. pwnkit’s 92.3% white-box best-of-N aggregate beats that single-model number by ~10.6pp, and pwnkit’s 87.5% black-box number alone still beats it by ~5.8pp — while pwnkit’s aggregate trails BoxPwnr’s full multi-config aggregate.

Architecture matters less than tools + memory + search.

Shannon’s 13-agent system scores 96%. Cyber-AutoAgent’s single agent with self-rewriting prompts scores 84.62%. MAPTA’s 3-agent academic system scores 76.9%. deadend-cli’s single agent scores 77.55%.

The common thread across top performers is not agent count. It is:

  1. Tool quality — real security tools (sqlmap, Playwright, curl) beat structured wrappers
  2. Memory — persisting context across turns (mem0, relay, checkpoints) prevents repeated work
  3. Search — exploring multiple exploit paths (tree search, parallel pairs, backtracking) catches what linear execution misses

Shell + external memory can match multi-agent at 7.4% of the cost. Context quality drops past 40% fill — relay or reset is the fix.

Ranked by expected impact and implementation complexity. Estimates based on challenge-level gap analysis against the XBOW benchmark.

RankTechniqueExpected impactCost multiplierStatus
1Early-stop + retry at turn 20+3-5 flags1xShipped
2Blind SQLi script templates+2-4 flags1xShipped
3Patched fork for all 104 challenges+10-15 flags1xShipped
4Context compaction at 60% window+3-5 flags1xShipped
5Loop/oscillation detection+2-3 flags1xShipped
6Dynamic playbooks after recon (13 playbooks)+3-5 flags1xShipped
7EGATS tree search+5-9 flags2-3xShipped
8Best-of-N strategy racing+5-8 flags3xShipped
9Progress handoff on retry+3-5 flags1xShipped
10Reachability gateFP reduction1xShipped
11foxguard multi-modal agreementFP reduction1xShipped
12Consensus (self-consistency) verifyFP reductionNx verify costShipped
13Triage memories (Semgrep-style)FP reduction1xShipped
14PoV gateFP reduction1xShipped
15External working memory+2-3 flags1xPlanned
16RAG from prior solves+2-4 flags1xPlanned
17Adversarial debate verificationFP reduction2x verify costShipped

Early-stop + retry at turn 20. When the agent has not made progress by turn 20, kill the run and restart with a fresh context window. Prevents the agent from burning 80 turns on a dead-end approach. Based on MAPTA’s finding that 40 tool calls is the sweet spot — if you’re halfway through with nothing, reset.

Blind SQLi script templates. Pre-built exploitation scripts for time-based and boolean-based blind SQL injection. The agent injects these into the shell rather than trying to write sqlmap commands from scratch. deadend-cli’s blind SQLi solves motivated this — the challenge type has high variance without templates.

Patched XBOW fork for all 104 challenges. Several XBOW challenges had environment bugs (broken Docker configs, missing dependencies, timing issues). The patched fork fixes these so the agent isn’t fighting infrastructure. This is the single highest-impact change: +10-15 flags from challenges that were previously unsolvable due to benchmark bugs.

Loop/oscillation detection. Detects when the agent is repeating the same failed commands or oscillating between two ineffective approaches. When a loop is detected, the agent is forced to change strategy or escalate. Based on BoxPwnr’s oscillation detection mechanism, which catches the most common failure mode in long-running pentesting sessions — the agent trying the same exploit with minor variations indefinitely.

Context compaction at 60% window. When the context window reaches 60% capacity, summarize the current state (discovered endpoints, credentials, attack progress) and continue with a compacted context. Prevents the quality degradation that occurs past 40-60% fill. Based on BoxPwnr’s single_loop_compactation solver, which triggers compaction at 60% and has proven effective across hundreds of challenges. More aggressive than the originally planned 30k-token relay — compaction preserves the full conversation thread rather than doing a hard reset.

Dynamic playbooks after recon. packages/core/src/agent/playbooks.ts — 13 playbooks (sqli, ssti, idor, xss, ssrf, lfi, auth_bypass, blind_exploitation, cve_exploitation, command_injection, deserialization, request_smuggling, creative_idor). detectPlaybooks() matches tool-result text against per-type patterns and injects at most 3 into the conversation. The XSS playbook alone cracked XBEN-011 and XBEN-018.

EGATS tree search. packages/core/src/agent/egats.ts — evidence-gated attack tree search. Each node is a hypothesis; a mini agent loop gathers evidence; only branches whose evidence score clears the threshold are expanded. Beam search keeps the top-K branches per level.

Best-of-N strategy racing. packages/core/src/racing.ts — runs the same target with multiple different strategies in parallel (aggressive, methodical, creative, each with its own system-prompt override and temperature hint), takes the first vulnerability. Inspired by BoxPwnr running ~10 solver configs.

Progress handoff. packages/core/src/agent/native-loop.ts + agentic-scanner.ts — the prior attempt’s structured progress (endpoints, credentials, confirmed vulns, failed approaches, tech stack) is injected into the retry’s “Prior Attempt Results” section.

Reachability gate. packages/core/src/triage/reachability.ts — suppresses findings whose sink is not reachable from an application entry point. The open-source version of Endor Labs’ “Code API” moat.

foxguard multi-modal agreement. packages/core/src/triage/multi-modal.ts — for every pwnkit finding, run foxguard against the same source tree and require agreement before auto-accepting. Endor Labs’ rules-plus-neural architecture, open-source.

Consensus (self-consistency) verification. packages/core/src/triage/verify-pipeline.tsrunSelfConsistencyVerify runs the structured verify pipeline N times in parallel, takes the majority vote, with early termination when a verdict locks up an unreachable lead.

Triage memories. packages/core/src/triage/memories.ts — Semgrep-style per-target persistent FP context. User marks a finding as FP with a reason; the reason becomes a TriageMemory scoped to global, package, or target. Strong matches auto-reject future findings without any LLM call.

PoV gate. packages/core/src/triage/pov-gate.ts — a narrowly-scoped mini agent loop must produce a concrete executable exploit; no PoV → severity downgrade to info. Based on “All You Need Is A Fuzzing Brain” (arXiv:2509.07225).

Adversarial debate verification. packages/core/src/triage/adversarial.ts — prosecutor vs. defender agents debate each finding with fresh contexts; a skeptical judge decides. Based on Anthropic’s debate paper (arXiv:2402.06782). The point is uncorrelated error modes vs. single-pass verify.

External working memory. Persist structured notes (discovered endpoints, credentials, observed behaviors) in a memory store the agent can query. Prevents the agent from re-discovering information it already found. Inspired by Cyber-AutoAgent’s mem0 integration.

RAG from prior solves. Build a vector index of successful exploit chains from prior runs. When the agent encounters a similar challenge, retrieve relevant prior solves as context. Bootstraps experience without increasing the model’s context window.

PaperReferenceKey finding for pwnkit
Meta-analysis of AI pentesting agentsarXiv:2602.17622Architecture matters less than tools + memory + search
MAPTAarXiv:2508.208163-role system with evidence-gated branching, 40 tool calls is the sweet spot, $0.21/challenge
Co-RedTeamPublished 2025Multi-agent red teaming with shared memory improves coverage
TermiAgentPublished 2025Terminal-native agents outperform structured-tool agents on security tasks
CurriculumPTPublished 2025Curriculum learning for penetration testing — easy challenges first improves hard-challenge performance
CHAPNDSS 2026Challenge-aware heuristic attack planning, presented at top security venue

The meta-analysis (arXiv:2602.17622) is the most directly relevant. Its core claim — that the combination of tool quality, memory persistence, and search breadth predicts performance better than agent count or model choice — aligns with pwnkit’s shell-first philosophy. The paper surveyed all major agents on the XBOW benchmark and found that single-agent systems with good tools consistently outperform multi-agent systems with mediocre tools.

MAPTA’s evidence-gated branching is the clearest academic validation of “don’t speculate, verify.” Their system refuses to pursue an exploitation path unless prior steps produced concrete evidence. This is the principle behind pwnkit’s early-stop mechanism: if you haven’t found evidence of progress by turn 20, you’re speculating.

CHAP at NDSS 2026 introduces challenge-aware heuristic planning — the agent classifies the challenge type before attacking and selects a heuristic attack plan. This is the academic version of pwnkit’s planned dynamic playbooks feature.

FeatureBased onNotes
Early-stop + retryMAPTA turn budget datanative-loop.ts, agentic-scanner.ts
Blind SQLi templatesdeadend-cliprompts.ts
Patched XBOW forkChallenge-level bug analysis+10-15 flags
Shell-first architectureTermiAgent, meta-analysisFoundation — 7.4% cost of multi-agent
Loop detectionBoxPwnr oscillation detectionnative-loop.ts
Context compaction (LLM-based, multi-recompaction)BoxPwnrnative-loop.ts
Dynamic playbooks (13 types)CurriculumPT, Cyber-AutoAgentagent/playbooks.ts
EGATS attack tree searchMAPTAagent/egats.ts
Best-of-N strategy racingBoxPwnrracing.ts
Progress handoff on retryBoxPwnrnative-loop.ts, agentic-scanner.ts

Triage-stage techniques (FP reduction moat)

Section titled “Triage-stage techniques (FP reduction moat)”

See FP Reduction Moat for the full stack and published numbers.

FeatureBased onNotes
Holding-it-wrong filterInternal CVE-hunt analysistriage/holding-it-wrong.ts
Feature extractor (45 features)VulnBERT hybridtriage/feature-extractor.ts
Reachability gateEndor Labs “Code API” moattriage/reachability.ts
Per-class exploit oraclesMAPTA evidence-gated branchingtriage/oracles.ts
foxguard multi-modal agreementEndor Labs rules+neuraltriage/multi-modal.ts
Structured 4-step verifyGitHub Security Lab taskflow-agenttriage/verify-pipeline.ts
Consensus (self-consistency) verifySelf-consistency decodingtriage/verify-pipeline.ts
PoV gateFuzzing Brain (arXiv:2509.07225)triage/pov-gate.ts
Triage memoriesSemgrep Assistanttriage/memories.ts
Adversarial debateAnthropic Debate (arXiv:2402.06782)triage/adversarial.ts

Near-term:

  • External working memory — agent writes plan/creds to disk, injected at reflection checkpoints
  • Layer 2 CodeBERT fine-tune on D2A labels

Medium-term:

  • RAG from prior solves — requires a corpus of successful runs to bootstrap
  • Tree-sitter-based interprocedural reachability to replace the grep-based first pass