Skip to content

Feature Extractor

packages/core/src/triage/feature-extractor.ts exports a 45-element numeric vector used as a fast first-pass triage signal:

  • extractFeatures(finding): number[]
  • FEATURE_NAMES: string[]

The extractor is intentionally cheap:

  • pure regex and string operations
  • no LLM calls
  • no network requests
  • no external scanners

It is the handcrafted half of pwnkit’s VulnBERT-inspired hybrid direction.

VulnBERT’s reported ablations are the key reference point:

Model familyRecallFalse positive rate
Handcrafted features alone76.8%15.9%
CodeBERT alone84.3%4.2%
Hybrid features + neural fusion92.2%1.2%

pwnkit uses the handcrafted layer today because it is deterministic, explainable, and cheap enough to run before any paid verification.

GroupCountIndices
Response features130-12
Request features1013-22
Metadata features823-30
Text quality features1031-40
Cross-field features441-44
IdxNameTypeRangeRationale
0resp_http_statusint0+Raw HTTP status extracted from evidence response
1resp_sql_errorbool0/1SQL error strings are strong exploit evidence for SQLi
2resp_stack_tracebool0/1Stack traces often indicate unexpected execution paths
3resp_error_messagebool0/1Generic error responses raise exploit plausibility
4resp_payload_exact_reflectionbool0/1Exact reflection is strong for reflected XSS / echo-driven bugs
5resp_payload_partial_reflectionbool0/1Partial reflection catches normalized or truncated echoes
6resp_sensitive_databool0/1Leaked credentials / tokens / PII are high-signal evidence
7resp_flag_patternbool0/1CTF-style flag markers are direct benchmark proof
8resp_content_type_matchbool0/1Matching content type helps align evidence with the claimed category
9resp_lengthint0+Response size often distinguishes empty errors from real leakage
10resp_waf_signaturebool0/1WAF blocks can explain false negatives and noisy probes
11resp_redirectbool0/1Redirect behavior matters for auth and route-confusion findings
12resp_5xx_statusbool0/1Server errors are noisy but still useful exploit context
IdxNameTypeRangeRationale
13req_sql_syntaxbool0/1SQL syntax in the request aligns with SQLi claims
14req_xss_payloadbool0/1Script / event-handler patterns align with XSS claims
15req_ssti_syntaxbool0/1Template delimiters align with SSTI claims
16req_path_traversalbool0/1Traversal markers align with file-read claims
17req_command_injectionbool0/1Shell metacharacters align with command-injection claims
18req_encoding_detectedbool0/1Encoded payloads often show intentional bypass attempts
19req_http_methodintsmall ordinalRequest-method encoding lets the model distinguish GET/POST/etc.
20req_auth_headerbool0/1Auth-bearing requests matter for IDOR / auth-boundary findings
21req_param_countint0+Parameter fanout helps characterize probe complexity
22req_body_lengthint0+Body size separates tiny probes from full exploit payloads
IdxNameTypeRangeRationale
23meta_severity_ordinalint0-4Encodes severity as an ordinal prior
24meta_confidencefloat0-1 typicalCarries the agent’s own confidence into triage
25meta_high_confidence_categorybool0/1Some categories are more reliable than others in practice
26meta_injection_classbool0/1Injection bugs share structural traits worth flagging
27meta_access_control_classbool0/1Access-control bugs differ materially from injection bugs
28meta_has_template_idbool0/1Template-backed findings tend to be more structured
29meta_has_cwebool0/1CWE references correlate with more mature analysis text
30meta_has_cvebool0/1CVE references often indicate external corroboration
IdxNameTypeRangeRationale
31text_description_lengthint0+Very short descriptions often correlate with low-quality findings
32text_repro_stepsbool0/1Reproduction steps are a strong signal of finding quality
33text_impact_statementbool0/1Explicit impact reasoning raises trust in the finding
34text_hedging_languagebool0/1Hedging often correlates with weak or speculative findings
35text_verification_languagebool0/1”confirmed”, “verified”, “reproduced” are strong positive cues
36text_analysis_lengthint0+Richer analysis text often means a more grounded claim
37text_code_blocksbool0/1Embedded code or PoC snippets raise exploit credibility
38text_evidence_request_nonemptybool0/1Missing request evidence is a major weakness
39text_evidence_response_nonemptybool0/1Missing response evidence is a major weakness
40text_evidence_analysis_nonemptybool0/1Missing analyst context weakens triage quality
IdxNameTypeRangeRationale
41cross_payload_category_consistentbool0/1Checks that the request payload matches the claimed bug class
42cross_severity_confidence_interactionfloat0+Multiplies severity prior by agent confidence
43cross_response_request_length_ratiofloat0+Large response / request ratios can indicate leakage or reflected amplification
44cross_evidence_completenessfloat0-1Non-empty request / response / analysis count divided by 3

The features deliberately mix five signal families:

  • response evidence
  • request payload shape
  • metadata priors
  • text-quality heuristics
  • cross-field consistency

That split reflects how triage actually works in practice: a finding is not credible because any one field looks good, but because several independent signals line up.

The extractor is meant to pair with text embeddings, not replace them.

The natural hybrid shape is:

  1. text -> encoder embedding
  2. features -> linear projection
  3. fusion head over both representations
  4. binary TP/FP classifier

That is the same broad architecture family referenced from Finding Triage ML.

These features are strongest on web and exploit-style findings because many of them look for:

  • reflected payloads
  • status codes
  • stack traces
  • SQL errors
  • request / response evidence density

On npm supply-chain findings, many of those fields are sparse or zero. That is not a bug in the extractor; it is a real domain-transfer limitation and exactly the kind of thing a paper should report honestly.

  • deterministic
  • source available
  • pure local computation
  • no model calls
  • no network dependence

That makes the feature vector suitable for ablations, baselines, and air-gapped experimentation.