Enrichment Deep Dive
This document explains VulnParse-Pin enrichment sources, runtime flow, policy controls, and how to interpret enrichment-derived output fields.
Purpose
Enrichment adds threat and context signals to scanner findings so downstream prioritization can distinguish:
- baseline scanner context,
- known exploitation and prevalence signals,
- exploitability indicators,
- confidence provenance.
Enrichment does not replace scanner truth. It augments findings with external intelligence and records source evidence.
Enrichment Sources
VulnParse-Pin enrichment currently integrates the following source families:
- CISA KEV
- EPSS
- Exploit-DB
- NVD
- GHSA
CISA KEV
What it provides:
- Presence/absence of CVE in Known Exploited Vulnerabilities catalog.
How it is represented:
finding.cisa_kevboolean.- Confidence/source contribution includes
kevwhen matched.
Implementation details:
- Loaded through feed-cache workflow with TTL and checksum sidecars.
- Supports online and offline file import modes.
Implications:
- KEV hit is a strong prioritization signal and can influence exploitability interpretation.
EPSS
What it provides:
- Probabilistic exploitation likelihood score per CVE.
How it is represented:
finding.epss_scorefloat.- Confidence/source contribution includes
epsswhen populated.
Implementation details:
- Supports streamed online
.csv.gzand offline.csv/.csv.gzimport. - Normalized via scoring policy in pass phases.
Implications:
- EPSS helps rank findings with similar base severity.
- Absence of EPSS score is explicitly tracked as a miss during enrichment telemetry.
Exploit-DB
What it provides:
- Known exploit references and exploit availability evidence.
How it is represented:
finding.exploit_referenceslist.finding.exploit_availableboolean.- Confidence/source contribution includes
exploitdbwhen exploit signal is present.
Implementation details:
- Can run from online import or offline local dataset.
- Feed cache and integrity controls apply to source ingestion.
Implications:
- Presence of exploit references is a high-value operational signal.
NVD
What it provides:
- CVSS vectors/scores and fallback metadata for CVE-driven normalization.
How it is represented:
finding.cvss_vectorfinding.cvss_score- Confidence/source contribution includes
nvdwhen lookup produces usable data.
Implementation details:
- Uses NVD cache policy with yearly/modified feed control.
- Supports targeted CVE loading and SQLite-backed optimization.
Implications:
- NVD helps normalize inconsistent scanner-provided vector data.
- Sentinel vector markers are used when no authoritative vector is available.
GHSA
What it provides:
- GitHub advisory matches by CVE and package token.
- Advisory severity and references.
- Optional exploit-signal inference from high/critical advisories.
How it is represented:
finding.enrichment_sourcesincludesghsawhen matched.finding.referencesmay be augmented with GHSA advisory links.finding.confidence_evidencecan include:ghsaghsa_bonusghsa_exploit_signalfinding.exploit_availablemay be promoted when configured high-severity signal mode is enabled.
Implementation details:
- CLI activation is required (
--ghsa); config does not auto-enable. - Online mode supports CVE prefetch budget (
--ghsa-budgetor config fallback). - Token auth uses configured env var (
enrichment.ghsa_token_env), defaulting toVP_GHSA_TK, withGITHUB_TOKENfallback. - Offline mode supports advisory database file or repo-directory ingest.
- SQLite warm-cache accelerates repeated offline loads for target CVEs.
GHSA auth hardening:
ghsa_token_envis validated as an env-var name (^[A-Za-z_][A-Za-z0-9_]{0,127}$).- Token values are rejected if they contain header-unsafe control characters (
\ror\n). - Token values must match known GitHub token prefixes:
ghp_gho_ghu_ghs_ghr_github_pat_- Invalid custom env names or rejected token values fall back to the next candidate (
GITHUB_TOKEN), and secrets are never logged.
Implications:
- GHSA expands context where CVE metadata alone is sparse.
- Signal bonus and exploit promotion behavior are policy-controlled and auditable.
Runtime Flow
At a high level, enrichment runs before pass phases and updates each finding in-place:
- Normalize confidence policy.
- Build source lookups (KEV/EPSS/NVD/GHSA indexes).
- For each finding, evaluate CVEs and enrichment matches.
- Merge source-derived fields (scores, KEV flags, references, vectors).
- Compute source-attributed confidence and evidence map.
- Emit summary telemetry and miss logs.
Confidence and Source Evidence
Confidence is derived from:
- baseline scanner signal,
- per-source weights,
- GHSA advisory count bonus (bounded),
- optional GHSA exploit-signal bonus.
Output fields:
finding.confidenceinteger score.finding.confidence_evidencemap containing component contributions and final cap.
Interpretation guidance:
- Treat confidence as provenance strength, not direct risk severity.
- Compare confidence with risk score and triage priority for final decisions.
Offline vs Online Modes
General behavior:
- Online mode favors freshness with network dependency.
- Offline mode favors deterministic reproducibility.
Operational guidance:
- Use offline mode for controlled, repeatable CI/CD and regulated workflows.
- Use online mode for near-real-time feed freshness.
Caching and Integrity
Enrichment source ingestion uses cache and integrity controls across loaders.
Common controls include:
- TTL-based freshness,
- content checksum sidecars,
- optional HMAC-integrity modes via environment variables,
- bounded response-size protections for remote fetches.
For deep implementation details, see:
Caching Deep DiveRuntime Policy Deep Dive
Field Interpretation Reference
Common output fields affected by enrichment:
cisa_kev: KEV membership signal.epss_score: exploitation probability signal.cvss_vectorandcvss_score: normalized vector/score path.exploit_available: exploitability indicator from exploit refs, KEV implication, or configured GHSA signal path.enrichment_sources: ordered source provenance list.confidenceandconfidence_evidence: source-attributed confidence model outputs.
Technical Caveats
- Source availability varies by network mode, cache freshness, and dataset completeness.
- Missing source data is expected in some environments and is logged as miss telemetry.
- Confidence contribution from sources is configurable; changing policy can alter downstream ranking behavior.
- GHSA high-severity exploit-signal promotion is opt-in by policy.
Implementation Pointers
Core implementation locations:
src/vulnparse_pin/app/enrichment.pysrc/vulnparse_pin/app/enrichment_source_loader.pysrc/vulnparse_pin/utils/enricher.pysrc/vulnparse_pin/utils/ghsa_enrichment.pysrc/vulnparse_pin/resources/config.yamlsrc/vulnparse_pin/core/schemas/config.schema.json
Key test coverage locations:
tests/test_ghsa_enrichment.pytests/test_enrichment_confidence_policy.pytests/test_config_schema_validation.pytests/test_cli_flags_matrix.py
Recommended Validation Checklist
- Validate config schema and policy bounds.
- Run focused source tests after enrichment changes.
- Run full regression suite before release.
- Inspect output artifact samples for source attribution and confidence evidence.
- Confirm run manifests reflect selected mode and source behavior.