Roadmap
This roadmap is opinionated. It prioritises product leverage over surface-area creep, and it stays honest about what has actually shipped vs what is still being scoped.
The current thesis is unchanged from earlier in the year:
- Make the core agentic pipeline trustworthy.
- Make the outputs operationally useful for real teams.
- Then add orchestration and control-plane UX on top.
What has changed is that the trust layer is now real. The 92.3% white-box best-of-N aggregate on the public XBOW benchmark is the proof. Most of the next-quarter work is about taking that capability and making it operationally easy to live with — for one developer running a single review, for a CI pipeline gating PRs, and for a security team running a continuous campaign.
Recently shipped (April 2026)
Section titled “Recently shipped (April 2026)”These are the things that landed since the last public roadmap snapshot. They should not be in the “Now” column anymore — they are done.
- 96 / 104 unique flags on XBOW (white-box best-of-N aggregate, 92.3%). Up from 95/104 the day before via the lean-scaffolding A/B sweep that solved XBEN-061 in 8 turns.
- 91 / 104 on the published black-box mode (87.5%). Both numbers reported separately, no methodology blending.
- Cost ceiling enforcement. Scans abort cleanly when a per-run USD budget is exceeded, instead of silently spending past it.
- Direct OSV advisory lookup in the npm audit pipeline. No more relying on
npm auditalone for known-CVE coverage. - Phase 4 pre-recon CVE check for white-box mode — the agent fingerprints the target stack against known vulnerabilities before it starts attacking.
- Phase A flag-regex extractor + gate filters. Catches flag-shaped strings the agent emits and validates them against expected format. This is what caught the XBEN-079 honeypot decoy (
FLAG{I'm_a_Script_Kiddie}) before it became a silent false positive. - Phase B malicious-package detection in the audit pipeline.
- Cloud handshake (Phase 1). Findings produced by the OSS engine can be normalised and ingested by an orchestrator service that tracks scan jobs end-to-end.
- A/B sweep cadence. Overnight workflow dispatches across feature combinations are now the default way improvements get tested. The first sweep cost $3.08 across 8 challenges, added one solve, and surfaced one actionable failure mode (the honeypot).
These are the next four things in priority order. Each one is small enough to ship in one or two focused passes.
1. Anti-honeypot / decoy-flag heuristic
Section titled “1. Anti-honeypot / decoy-flag heuristic”Goal: stop the agent from declaring victory on the first flag-shaped string it finds when the target deliberately seeds a fake one.
Why: the v1 A/B sweep caught the agent submitting FLAG{I'm_a_Script_Kiddie} on XBEN-079 (“Poison Inclusion - Protected”). The real flag was FLAG{7dda4e738d29a996f343a610}. The flag-extractor gate caught the mismatch and the run failed cleanly — but a non-CTF target could plant a similar honeypot string in a .git/config or a config file decoy and the agent would happily submit it as a finding.
Deliverables:
- on flag-shaped match, mark provisional and continue at least one more layer of exploration
- prefer hex/uuid shapes that match the surrounding suite’s flag format over jokey decoys
- expose the heuristic as
--decoy-detection(default: on)
This is a small, falsifiable change. If it lands XBEN-079 it almost certainly catches a class of similar honeypots in real engagements.
2. Statistical evaluation methodology — n=10 runs per cell
Section titled “2. Statistical evaluation methodology — n=10 runs per cell”Goal: replace single-shot benchmark anecdotes with measured per-attempt success rates and confidence intervals.
Why: the v1 sweep produced a single solve on XBEN-061 with a handoff,no-hiw,no-evidence combo that looked like a generalisable winning configuration. The v2 sweep ran the same combo against the same challenge as a regression test the next afternoon. It failed. The v1 solve was noise inside a 20–40% per-attempt success rate, not a signal worth defaulting on. This is the methodology lesson: a single XBOW solve is an anecdote, and any configuration recommendation that comes from a single solve is unsafe to promote.
Deliverables:
- benchmark harness flag for
--repeat Nthat runs each (challenge, configuration) cell N times and reports the success rate plus confidence interval - default protocol going forward: n=10 per cell when evaluating a new feature combination, before any promotion to default
- per-cell cost ceiling so the n=10 protocol stays under ~$5 per cell
- a separate methodology page in the docs explaining the difference between best-of-N (what XBOW reports) and per-attempt success rate (what we now measure internally)
Note: this change demoted the previous “lean scaffolding default for long-horizon white-box” priority that was here in the morning version of this roadmap. The lean combo is now treated as a hypothesis to test under the new protocol, not a default to promote.
3. Resumable scans
Section titled “3. Resumable scans”Goal: if a long review or scan dies, resume from stored state instead of restarting.
Why:
- the repo already persists
agent_sessionsandpipeline_events - long-running agentic workflows are expensive to restart
- this is what makes pwnkit feel like infrastructure instead of a disposable CLI run
Deliverables:
pwnkit-cli resume <scan-id>- stage-level checkpointing
- partial-result recovery after crash or timeout
- resume-safe report generation
4. Finding inbox + triage workflow
Section titled “4. Finding inbox + triage workflow”Goal: make findings manageable across repeated runs.
Why: “found a thing” is not enough for teams. Repeated findings need dedupe, suppression, and audit history.
Deliverables:
- finding fingerprinting across scans
- statuses such as
new,accepted,suppressed,needs-human,regression - suppression rules with reason + expiration
- comments / notes on findings
- diff view between scans
These become much more valuable once the items above are solid.
5. Diff-aware PR scanning
Section titled “5. Diff-aware PR scanning”Goal: make the GitHub Action and CI path fast enough to use on every PR.
Why: full deep review on every pull request is too expensive. Most teams want “changed files first, expand when suspicious.”
Deliverables:
- changed-file targeting for
review - priority scoring for touched paths, auth, secrets, network, tool-use, eval-like sinks
- optional fallback to full review on high-risk deltas
- PR summary output tuned for reviewer action
6. Deterministic replay for every finding
Section titled “6. Deterministic replay for every finding”Goal: every confirmed finding should be reproducible on demand.
Why: replay is how the tool earns trust. It is the bridge between “AI said so” and “I can see it myself.”
Deliverables:
- replay command from finding ID
- saved exploit inputs / requests / prompts
- verifier transcript and verdict trace
- artifact bundle for share / export
7. Multi-target orchestration
Section titled “7. Multi-target orchestration”Goal: scan many repos, packages, or endpoints as one campaign. This is where subagents actually matter.
Good use of subagents:
- fan out research across many targets
- parallel blind verification
- aggregate results into one campaign view
Bad use of subagents:
- navigation gimmicks
- vague “AI assistant” behaviour with no task boundary
Deliverables:
- campaign runs
- worker pool / concurrency controls
- queueing and retry policy
- shared target inventory and cross-target clustering
8. Local dashboard / operations shell
Section titled “8. Local dashboard / operations shell”Goal: expose the stored scan state as a real operator interface for running the autonomous control plane, working the review inbox, and inspecting runtime failures.
Status:
- baseline shipped: grouped findings, thread-level workflow, quick filtering, scan dossiers, recent shadcn rebuild
- next cut: operations-first home, active run stage progress, replay launch, and better provenance links between threads and runs
Core views:
- operations control as the primary home
- review inbox for operator decisions and blocked automation
- scan dossiers and pipeline timelines as supporting provenance views
- replay / evidence viewer
- target inventory
- scan history and trend charts
These are valuable, but they should not outrank the workflow / control-plane work above.
9. Policy packs and organisation presets
Section titled “9. Policy packs and organisation presets”- suppressions as code
- severity gates by environment
- org-level runtime / model defaults
- approved attack template sets
10. Richer target inventory and trend analysis
Section titled “10. Richer target inventory and trend analysis”- first-seen / last-seen attack surface changes
- recurring finding families
- regression alerts
- “what changed since last green run”
11. Distributed workers / remote execution
Section titled “11. Distributed workers / remote execution”- remote queue workers
- large campaign execution
- shared artifact store
- eventually a hosted control plane if adoption justifies it
Non-Goals Right Now
Section titled “Non-Goals Right Now”Things that sound flashy but should stay below the line for now:
- a giant SaaS dashboard before the local workflow is excellent
- “chat with your findings” before replay, dedupe, and triage are strong
- adding lots of new scan modes without stronger replay and campaign ergonomics
- subagents used as UI magic instead of bounded workers
- EGATS-style tree search on challenges this size — the v1 sweep proved it currently costs more than it earns
Product Direction
Section titled “Product Direction”The best version of pwnkit is:
- a sharp local CLI for one-off deep work
- a reliable CI primitive for PRs and repos
- a persistent evidence store for findings and agent runs
- a local operations shell on top of that state
- eventually a separate distributed agentic security control plane for campaigns and remote workers
That is more compelling than being “yet another scanner with more templates.” The XBOW number is the proof that the core capability is real. The roadmap above is the work to turn that capability into something teams can actually live with.