Skip to content

Triage Dataset

packages/benchmark/src/triage-data-collector.ts converts benchmark artifacts and local verified findings into a single JSONL dataset for training true-positive / false-positive classifiers.

The output is designed to be useful for two families of models:

  • pure text classifiers over finding title / description / request / response
  • hybrid models that fuse text embeddings with pwnkit’s handcrafted 45-feature vector

This is the data pipeline behind the paper-plan tracked in issue #67.

The collector supports four input surfaces:

InputFlagGround truth source
XBOW / Cybench-style results JSON--results <file>Flag extraction
npm-bench results JSON--npm-bench <file>Package verdict
pwnkit SQLite DB--db <file>Blind verify status
Directory of scan DBs--scan-dir <dir>Blind verify status

If you run the collector with no explicit --results or --npm-bench flag, it will also auto-scan packages/benchmark/results/*.json and route files by filename:

  • *npm-bench*.json -> npm-bench path
  • every other .json -> XBOW-style path

Run against a specific benchmark artifact:

Terminal window
pnpm --filter @pwnkit/benchmark exec tsx src/triage-data-collector.ts \
--npm-bench packages/benchmark/results/npm-bench-latest.json \
--output packages/benchmark/results/triage-dataset.jsonl

Combine npm-bench with local verified findings from the SQLite DB:

Terminal window
pnpm --filter @pwnkit/benchmark exec tsx src/triage-data-collector.ts \
--npm-bench packages/benchmark/results/npm-bench-latest.json \
--db ~/.pwnkit/pwnkit.db \
--output packages/benchmark/results/triage-dataset-mixed.jsonl

Pull labels from a whole directory of scan databases:

Terminal window
pnpm --filter @pwnkit/benchmark exec tsx src/triage-data-collector.ts \
--scan-dir ./.pwnkit/scans \
--output packages/benchmark/results/triage-dataset-from-db.jsonl

Each line is one JSON object with this shape:

FieldTypeMeaning
textstringFlattened training text: title, category, severity, description, request, response, optional analysis
featuresnumber[45]Handcrafted feature vector from extractFeatures()
label0 | 1Numeric classification target
label_text"true_positive" | "false_positive"Human-readable target
sourcestringProvenance string identifying the benchmark case or verified scan
label_sourcestringHow the ground truth was assigned
confidencenumberAgent-reported confidence copied from the finding when available

The current TriageSample type exposes these values:

label_sourceMeaningEmitted by current collector?
flag_extractionThe agent got the real benchmark flag, so the finding is treated as a true positiveYes
package_verdictThe benchmark labels the package as malicious, vulnerable, or safeYes
blind_verifyThe finding status in the SQLite DB says it was verified / confirmed vs false-positive / rejectedYes
manualReserved for future hand-curated rows or external labelsNot by the built-in collector today
SourcePositive labelNegative labelNotes
XBOW / Cybench resultsflagFound = trueflagFound = falseOne benchmark result can yield many finding rows
npm-benchpackage verdict is malicious or vulnerablepackage verdict is safeCoarse package-level labeling, not per-finding labeling
SQLite DBfinding status is verified or confirmedfinding status is false_positive or rejectedSkips rows with unknown status

source is deliberately simple and human-readable:

Source familyFormatExample
XBOW / Cybench-style JSON<challenge-id>XBEN-001
npm-benchnpm-bench:<pkg>:<verdict>npm-bench:event-stream:malicious
SQLite DB<target>-<scan_id>https://example.com-scan_01HXYZ...

The row id is stricter and used only for deduplication. It is built from the benchmark case / scan id plus the finding id when available.

Pretty-printed example, abbreviated for readability. The real features array always has 45 numeric entries.

{
"text": "Title: Prototype pollution\nCategory: prototype_pollution\nSeverity: high\nDescription: Vulnerable merge path reachable from user input\nRequest: GET /api/search?q=__proto__\nResponse: HTTP/1.1 500 Internal Server Error\nAnalysis: Confirmed by benchmark ground truth",
"features": [500, 0, 0, 1, 0, 0, 0, 0, 0, 33, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 3, 0.9, 1, 1, 0, 1, 0, 0, 58, 1, 1, 0, 1, 42, 0, 1, 1, 1, 1, 2.7, 1.5, 1],
"label": 1,
"label_text": "true_positive",
"source": "npm-bench:[email protected]:vulnerable",
"label_source": "package_verdict",
"confidence": 0.9
}
  • Dedup by id first. The collector already does this before writing.
  • Stratify by both label_text and label_source, not just the binary label.
  • Keep benchmark families separated when you can. For example, don’t let all rows from the same benchmark case leak across train and test.
  • For mixed datasets, report metrics per source family as well as global averages. A model that performs well on web findings may do poorly on npm supply-chain findings.

Recommended split policy:

  1. Hold out one whole source family if you want a domain-transfer test.
  2. Otherwise split within each label_source bucket.
  3. Preserve class balance after deduplication, not before.

package_verdict is intentionally coarse. If a package is labeled safe, then every finding emitted against it becomes a false_positive row. That is useful because it gives us cheap negative labels at scale, but it is not the same thing as hand-labeling each finding individually.

That trade-off is acceptable for baseline training and ablation work, but any paper or benchmark should call out the noise floor explicitly.