Skip to content

Roadmap

This roadmap is opinionated. It prioritises product leverage over surface-area creep, and it stays honest about what has actually shipped vs what is still being scoped.

The current thesis is unchanged from earlier in the year:

  1. Make the core agentic pipeline trustworthy.
  2. Make the outputs operationally useful for real teams.
  3. Then add orchestration and control-plane UX on top.

What has changed is that the trust layer is now real. The 92.3% white-box best-of-N aggregate on the public XBOW benchmark is the proof. Most of the next-quarter work is about taking that capability and making it operationally easy to live with — for one developer running a single review, for a CI pipeline gating PRs, and for a security team running a continuous campaign.

These are the things that landed since the last public roadmap snapshot. They should not be in the “Now” column anymore — they are done.

  • 96 / 104 unique flags on XBOW (white-box best-of-N aggregate, 92.3%). Up from 95/104 the day before via the lean-scaffolding A/B sweep that solved XBEN-061 in 8 turns.
  • 91 / 104 on the published black-box mode (87.5%). Both numbers reported separately, no methodology blending.
  • Cost ceiling enforcement. Scans abort cleanly when a per-run USD budget is exceeded, instead of silently spending past it.
  • Direct OSV advisory lookup in the npm audit pipeline. No more relying on npm audit alone for known-CVE coverage.
  • Phase 4 pre-recon CVE check for white-box mode — the agent fingerprints the target stack against known vulnerabilities before it starts attacking.
  • Phase A flag-regex extractor + gate filters. Catches flag-shaped strings the agent emits and validates them against expected format. This is what caught the XBEN-079 honeypot decoy (FLAG{I'm_a_Script_Kiddie}) before it became a silent false positive.
  • Phase B malicious-package detection in the audit pipeline.
  • Cloud handshake (Phase 1). Findings produced by the OSS engine can be normalised and ingested by an orchestrator service that tracks scan jobs end-to-end.
  • A/B sweep cadence. Overnight workflow dispatches across feature combinations are now the default way improvements get tested. The first sweep cost $3.08 across 8 challenges, added one solve, and surfaced one actionable failure mode (the honeypot).

These are the next four things in priority order. Each one is small enough to ship in one or two focused passes.

Goal: stop the agent from declaring victory on the first flag-shaped string it finds when the target deliberately seeds a fake one.

Why: the v1 A/B sweep caught the agent submitting FLAG{I'm_a_Script_Kiddie} on XBEN-079 (“Poison Inclusion - Protected”). The real flag was FLAG{7dda4e738d29a996f343a610}. The flag-extractor gate caught the mismatch and the run failed cleanly — but a non-CTF target could plant a similar honeypot string in a .git/config or a config file decoy and the agent would happily submit it as a finding.

Deliverables:

  • on flag-shaped match, mark provisional and continue at least one more layer of exploration
  • prefer hex/uuid shapes that match the surrounding suite’s flag format over jokey decoys
  • expose the heuristic as --decoy-detection (default: on)

This is a small, falsifiable change. If it lands XBEN-079 it almost certainly catches a class of similar honeypots in real engagements.

2. Statistical evaluation methodology — n=10 runs per cell

Section titled “2. Statistical evaluation methodology — n=10 runs per cell”

Goal: replace single-shot benchmark anecdotes with measured per-attempt success rates and confidence intervals.

Why: the v1 sweep produced a single solve on XBEN-061 with a handoff,no-hiw,no-evidence combo that looked like a generalisable winning configuration. The v2 sweep ran the same combo against the same challenge as a regression test the next afternoon. It failed. The v1 solve was noise inside a 20–40% per-attempt success rate, not a signal worth defaulting on. This is the methodology lesson: a single XBOW solve is an anecdote, and any configuration recommendation that comes from a single solve is unsafe to promote.

Deliverables:

  • benchmark harness flag for --repeat N that runs each (challenge, configuration) cell N times and reports the success rate plus confidence interval
  • default protocol going forward: n=10 per cell when evaluating a new feature combination, before any promotion to default
  • per-cell cost ceiling so the n=10 protocol stays under ~$5 per cell
  • a separate methodology page in the docs explaining the difference between best-of-N (what XBOW reports) and per-attempt success rate (what we now measure internally)

Note: this change demoted the previous “lean scaffolding default for long-horizon white-box” priority that was here in the morning version of this roadmap. The lean combo is now treated as a hypothesis to test under the new protocol, not a default to promote.

Goal: if a long review or scan dies, resume from stored state instead of restarting.

Why:

  • the repo already persists agent_sessions and pipeline_events
  • long-running agentic workflows are expensive to restart
  • this is what makes pwnkit feel like infrastructure instead of a disposable CLI run

Deliverables:

  • pwnkit-cli resume <scan-id>
  • stage-level checkpointing
  • partial-result recovery after crash or timeout
  • resume-safe report generation

Goal: make findings manageable across repeated runs.

Why: “found a thing” is not enough for teams. Repeated findings need dedupe, suppression, and audit history.

Deliverables:

  • finding fingerprinting across scans
  • statuses such as new, accepted, suppressed, needs-human, regression
  • suppression rules with reason + expiration
  • comments / notes on findings
  • diff view between scans

These become much more valuable once the items above are solid.

Goal: make the GitHub Action and CI path fast enough to use on every PR.

Why: full deep review on every pull request is too expensive. Most teams want “changed files first, expand when suspicious.”

Deliverables:

  • changed-file targeting for review
  • priority scoring for touched paths, auth, secrets, network, tool-use, eval-like sinks
  • optional fallback to full review on high-risk deltas
  • PR summary output tuned for reviewer action

Goal: every confirmed finding should be reproducible on demand.

Why: replay is how the tool earns trust. It is the bridge between “AI said so” and “I can see it myself.”

Deliverables:

  • replay command from finding ID
  • saved exploit inputs / requests / prompts
  • verifier transcript and verdict trace
  • artifact bundle for share / export

Goal: scan many repos, packages, or endpoints as one campaign. This is where subagents actually matter.

Good use of subagents:

  • fan out research across many targets
  • parallel blind verification
  • aggregate results into one campaign view

Bad use of subagents:

  • navigation gimmicks
  • vague “AI assistant” behaviour with no task boundary

Deliverables:

  • campaign runs
  • worker pool / concurrency controls
  • queueing and retry policy
  • shared target inventory and cross-target clustering

Goal: expose the stored scan state as a real operator interface for running the autonomous control plane, working the review inbox, and inspecting runtime failures.

Status:

  • baseline shipped: grouped findings, thread-level workflow, quick filtering, scan dossiers, recent shadcn rebuild
  • next cut: operations-first home, active run stage progress, replay launch, and better provenance links between threads and runs

Core views:

  • operations control as the primary home
  • review inbox for operator decisions and blocked automation
  • scan dossiers and pipeline timelines as supporting provenance views
  • replay / evidence viewer
  • target inventory
  • scan history and trend charts

These are valuable, but they should not outrank the workflow / control-plane work above.

  • suppressions as code
  • severity gates by environment
  • org-level runtime / model defaults
  • approved attack template sets

10. Richer target inventory and trend analysis

Section titled “10. Richer target inventory and trend analysis”
  • first-seen / last-seen attack surface changes
  • recurring finding families
  • regression alerts
  • “what changed since last green run”

11. Distributed workers / remote execution

Section titled “11. Distributed workers / remote execution”
  • remote queue workers
  • large campaign execution
  • shared artifact store
  • eventually a hosted control plane if adoption justifies it

Things that sound flashy but should stay below the line for now:

  • a giant SaaS dashboard before the local workflow is excellent
  • “chat with your findings” before replay, dedupe, and triage are strong
  • adding lots of new scan modes without stronger replay and campaign ergonomics
  • subagents used as UI magic instead of bounded workers
  • EGATS-style tree search on challenges this size — the v1 sweep proved it currently costs more than it earns

The best version of pwnkit is:

  • a sharp local CLI for one-off deep work
  • a reliable CI primitive for PRs and repos
  • a persistent evidence store for findings and agent runs
  • a local operations shell on top of that state
  • eventually a separate distributed agentic security control plane for campaigns and remote workers

That is more compelling than being “yet another scanner with more templates.” The XBOW number is the proof that the core capability is real. The roadmap above is the work to turn that capability into something teams can actually live with.