Vulnerability risk score · Method · Tier B

One number, honestly.

Every patch team needs one number to rank a backlog of CVEs, and collapsing the three public signals — EPSS, CVSS, and SSVC — into that number is easy to do badly and hard to do honestly. This is a way to do it that a non-specialist can read and trust: a likelihood times an impact, with every coefficient traced to either a measurement or an explicitly-stated value judgement, and an account of where the naive combinations quietly reorder the queue the wrong way.

TL;DR

The score in one box

Likelihood  L = EPSS_percentile × (0.74 + 0.26 × Automatable)
Impact      I = 0.314 × TechnicalImpact + 0.686 × AssetWeight
Score       = 100 × (L × I)
then        if (KEV or EPSS ≥ 0.50) and band < High → floor at High

Measured inputs: EPSS, the Automatable signal, and technical impact from the CVSS vector.
Chosen inputs: the SSVC factor weights, the asset tiers, the band cut-offs, and the override.
What it is: decision support that maps to a patch deadline — not a probability of loss.

The problem

Severity barely predicts whether anyone will bother.

A security team does not get to patch everything and does not get to patch nothing; it gets a queue. Past a quarter-million CVEs (Common Vulnerabilities and Exposures) now exist, the joined dataset behind this analysis carries 341,899 of them with a CVSS vector, and any given environment lights up with hundreds at once, so the question is never "is this bad?" but "is this worse than the other forty things open this week?" That is a ranking problem, and a ranking problem wants one number.

The obvious number is the wrong one. CVSS (the Common Vulnerability Scoring System) gives every CVE a 0–10 severity score, and teams have sorted by it for twenty years because it was there, but CVSS severity describes how bad a flaw would be if exploited, not whether anyone will exploit it. Rank CVEs by raw CVSS base score and ask how well that ordering surfaces the bugs that actually get exploited in the wild — using CISA's KEV list (Known Exploited Vulnerabilities) as the test set — and the base score lands at an AUC of 0.77. AUC, the area under the ROC curve, is the probability that a score ranks a truly-exploited bug above a random un-exploited one, so 0.5 is a coin flip and 1.0 is perfect. An empirical exploitation model does much better: EPSS, FIRST's Exploit Prediction Scoring System, scores 0.94 on the same hold-out (a number to read with the endogeneity caveat below, since KEV and EPSS share inputs). Severity on its own is a weak predictor, and the data is blunt about why: the CVSS base score explains only about 3% of the variance in EPSS.

So the modern instinct is right — lead with EPSS, a model trained to estimate the probability that a CVE will be exploited in the next thirty days. But "just use EPSS" has its own hole, and EPSS's own authors name it: the model "measures threat only," and is explicitly "not a complete picture of risk." It says nothing about how much damage a flaw would do, or whether the affected system is a crown-jewel database or a disposable sandbox, so a queue sorted purely by exploitation probability floats a noisy, low-consequence bug above a quiet flaw that would hand an attacker a regulated system. EPSS was never asked about consequence.

The shape of an honest score

Likelihood times impact, because both have to be true.

The spine is deliberately boring: Risk = Likelihood × Impact, multiplied out to a 0–100 number. The multiplication is the load-carrying choice. A product is an AND-gate — if either side is near zero the result is near zero — so a CVE has to be both plausibly exploitable and consequential on a system you care about to score high. A flaw with no realistic path to exploitation should not consume an emergency change window however severe it reads, and a flaw on a worthless asset should not either however exploitable. I tested an additive form (a weighted sum) and rejected it because it fails this test: addition lets a high impact rescue a near-zero likelihood, so a severe-but-dormant bug climbs the queue anyway — it reorders the queue heavily (Kendall τ ≈ 0.46 against the product) while scoring worse against KEV (AUC 0.910 vs 0.959).

The four governance factors come from SSVC, CISA's Stakeholder-Specific Vulnerability Categorization, a decision tree that classifies a vulnerability by four inputs. The structural result that keeps the score legible is that those four split cleanly across the two halves of the product:

SSVC factor	Weight	Side	Instrument
Exploitation	0.366	Likelihood	EPSS (continuous, empirical)
Automatable	0.128	Likelihood	SSVC Automatable (0/1 amplifier)
Mission & Well-being	0.347	Impact	Asset weight (per deployment)
Technical Impact	0.158	Impact	CVSS impact sub-metrics

Those weights (0.37 / 0.13 / 0.16 / 0.35, renormalised to sum to 1.0) are a governance choice, not a measurement — a reverse-engineered fit of the SSVC tree, explained two sections down. The two likelihood factors sum to 0.495 and the two impact factors to 0.505, so the product treats the halves almost even-handedly. Inside the likelihood half I use the EPSS percentile rather than the raw probability, because EPSS is extremely right-skewed (a median near 0.0035 in the CVSS-v4 subset measured here, with under a quarter-percent of CVEs above 0.50), so the raw number crushes everything into a thin band near zero while the percentile spreads the field out for ranking.

The endogeneity trap

You are scoring with signals that already feed each other.

This is the idea that makes the score more than another formula, and the one most spreadsheet versions get wrong without noticing. EPSS is not an independent opinion you can check against your other signals; it is a model whose inputs include the very signals you would reach for to validate it. FIRST's documentation lists, among EPSS features, "CVSS metrics (base vector from CVSS 3.x, via NVD)," and the v4 model additionally pulls cve.org CNA/ADP enrichment along with exploit-code presence and public mentions. So when an EPSS-based score appears to "predict" which CVEs land on KEV, part of what you see is the model agreeing with its own wiring. KEV membership and "EPSS ≥ 0.5" are not neutral referees; they share information with the score, which means every empirical claim here tests ranking sanity, not ground truth. There is no clean external "risk" label to validate against, because no one observes counterfactual breaches.

The trap gets sharper with SSVC. One of its four factors, Exploitation (none / PoC / active), is by definition derived from KEV membership and public-exploit presence — the same evidence EPSS already consumes. So the natural-looking move of multiplying EPSS by an SSVC-Exploitation term counts the same evidence twice. It looks like you are combining two signals, but you are really squaring one. I built that naive form deliberately and compared it to the single-instrument version that uses EPSS alone for exploitation: the two rankings agree at only Kendall τ = 0.84, and among the most urgent 1% of CVEs the two "top" sets share a Jaccard overlap of just 0.47 — nearly half of what the double-counted score calls most urgent is an artifact of scoring the evidence a second time, and that is the half most likely to drive an emergency change window.

Measured versus chosen

The facts and the judgement calls live in separate drawers.

The honesty claim only means something if a reader can act on it, so the design keeps measured and chosen in separate code — empirical.py versus governance.py — and tags every coefficient. If you disagree with a score, the disagreement is almost always with the governance file, by design. On the measured side sit EPSS, the extraction of technical impact from the CVSS vector (the sub-metrics are a standard, though mapping each None/Low/High to 0/0.5/1 is itself a chosen, tunable encoding), and the effect sizes from the prior empirical study. The Automatable signal is the one to dwell on, because the amplifier rests on it. On the sparse CVSS-v4 Automatable field (only 790 CVEs, about 2.86%) it looked like noise and earlier work wrote it off; measured properly on CISA's SSVC Automatable field across 158,294 CVEs it is a stable correlate of EPSS (Cliff's δ = 0.39) that retains independent signal after controlling for both base score and exploitation status, with the partial rank correlation falling from 0.275 to 0.182 to only 0.164 rather than to zero. Every one of those numbers is an association in observational data, not a causal claim.

On the chosen side sit the SSVC weights, the asset tiers, the band cut-offs, and the override. The weights are easy to mistake for measurements, so they deserve a careful word: SSVC has no weights at all, it is a decision tree. I obtained them by reverse-engineering — a Monte-Carlo search (a million random weightings plus an exhaustive grid) over the 36 combinations of the CISA Coordinator tree, asking which single weighted sum best reproduces the tree's own recommended actions. A weighted sum reproduces 89% of the tree under a linear encoding and 94% under an EPSS-style one, so it comes close but never all the way, and good weightings are rare: across a million random ones, agreement clusters around 47% and fewer than one in a thousand reaches the best score. The tree encodes a real priority ordering, and the weights are a faithful additive shadow of it, but they remain a defensible default to tune rather than a coefficient anyone measured.

Where the naive versions break

Three failure modes, and one rule no formula can replace.

The first failure is the double-count already covered. The second was the single most useful correction in the whole exercise. The literal reading of the weights says the likelihood half should be a weighted sum, L = 0.74·EPSS + 0.26·Automatable, which looks reasonable until you notice what it does to the roughly 77% of CVEs that are not flagged automatable: for each of them the Automatable term is zero, so the formula replaces 1.0·EPSS with 0.74·EPSS — it discounts them. Since most known-exploited bugs are not flagged automatable, this measurably hurt the ranking, dropping KEV coverage at 5% remediation effort from 0.88 to 0.62. The multiplicative amplifier EPSS·(0.74 + 0.26·Automatable) keeps the exact same 0.26 weight but only ever raises a wormable bug's score, never discounting EPSS within a class, and restores coverage at 5% to 0.87. The third failure is the additive composition rule itself, which breaks the "no impact means no risk" intuition, so I use the product.

The fourth case is one no smooth formula handles, and the right response is to stop pretending it does. The reverse-engineering surfaced a few tree outcomes that no weighted sum can reproduce, because they are non-compensatory — they flip on a combination of factors rather than on any one factor's contribution. So instead of torturing the weights, I add one explicit rule: if a CVE is in KEV or has EPSS ≥ 0.50 and its computed band came out below High, floor it at High. That guarantees no scoreable, actively-exploited CVE gets a long deadline; it fires for only 0.52% of the scoreable population at the standard tier, so it acts as a safety rail over the smooth score rather than a scoring system of its own.

Eight real CVEs make the behaviour concrete, scored at the standard asset tier:

CVE	EPSS	Auto	TI	Score	Band
CVE-2019-0708 BlueKeep	1.00	yes	1.00	58.8	High
CVE-2018-11776 Struts	~1.00	no	1.00	43.5	High
CVE-2014-0160 Heartbleed	1.00	yes	0.33	37.9	High*
CVE-2021-45105 Log4j DoS	1.00	no	0.33	28.1	High*
CVE-2023-25573 mid-EPSS	0.50	yes	0.33	37.4	Medium
CVE-2024-0916 sleeper	0.01	yes	1.00	34.3	Medium
CVE-2022-40292 typical low	0.005	no	0.17	9.4	Info
CVE-2021-37976 Chrome KEV	0.20	no	0.33	27.2	High*

* band floored to High by the KEV / EPSS override.

BlueKeep (automatable, 58.8) outranks Struts (43.5) at identical EPSS and impact — the amplifier doing its one job. Heartbleed and the Log4j denial-of-service both sit at maximum EPSS, yet their partial impact pulls them below the total-impact remote-code-execution bugs, so impact demotes likelihood exactly where a severity-blind, EPSS-only queue would over-rank them. The sleeper, CVE-2024-0916, has a 1% EPSS and would be buried in any pure-likelihood queue, but because it is automatable with total impact it lands at Medium, not Info — the low-probability, high-consequence case triage most often misses. And the same BlueKeep moves from 41.7 (High) on a low-value asset to 100.0 (Critical) on a crown-jewel: the asset weight setting the level while likelihood sets the order.

Does it actually help?

It cannot out-predict EPSS on KEV — and that is the wrong test.

This is where it is easiest to lie, so it is worth being deflating. Ranked against the KEV hold-out at the standard tier:

Model	AUC (KEV)	cov@1%	cov@5%
Recommended (EPSS-pct × I + amplifier)	0.959	0.278	0.781
Pedestrian (EPSS-pct × I, no amplifier)	0.964	0.477	0.749
Raw EPSS	0.945	0.474	0.752
Raw CVSS base score	0.772	0.045	0.225

The composite (0.959) sits a hair below the simpler pedestrian form (0.964) and barely above raw EPSS (0.945), and that ordering is not a defeat but exactly what endogeneity predicts: EPSS already encodes KEV-like evidence, so KEV detection is near the ceiling for any EPSS-based score. CVSS base score trails badly at 0.772, the same point this opened on. The composite's value was never going to show up as a higher KEV AUC, because KEV cannot reward the two things it adds — an impact dimension and an asset dimension — and it actively penalises the behaviours that make the score useful for triage, like demoting a high-EPSS, low-impact Heartbleed or promoting a low-EPSS, high-impact sleeper. That trade is correct for patch triage and invisible to a KEV benchmark.

The caveat to say loudest is redundancy. Correlated against its own inputs at a fixed asset tier, the score tracks EPSS at Spearman 0.96, CVSS base at 0.51, technical impact at 0.40, and Automatable at 0.32 — so within a single asset tier the score is essentially EPSS in a hat, and it is not, and does not claim to be, a better likelihood model than EPSS. Its differentiation lives entirely across asset tiers. The reassuring part is stability: going to equal weights (throwing the SSVC weights away) still preserves 91% of the pairwise order (Kendall τ = 0.909), nudging any single weight by ±20% moves it under 2%, and swapping every asset tier holds τ near 0.93. A score that flipped wildly when you nudged a weight you admit you guessed would not be worth defending; this one does not.

What it is, and what it is not

Decision support, stated with its limits.

It is decision support, not a probability of loss. Half of it — the entire impact side — is a value model you chose, so the 0–100 number must never be read as a "chance of breach." Endogeneity runs through every empirical claim: EPSS consumes CVSS base vectors and exploit evidence as features, so "the score predicts KEV" is partly the model agreeing with its own inputs, which is why SSVC Exploitation is excluded and why I never claim the score independently confirms or causes exploitation. The validation labels themselves are endogenous — KEV and "EPSS ≥ 0.5" both share information with the score — so they test ranking sanity, not truth, and there is no ground-truth "risk" to validate against.

The coverage limits are real: only 52.9% of CVEs are scoreable, because the rest lack both a CVSS vector and an SSVC Technical-Impact value; the SSVC fields exist for about 46% of CVEs, assigned by CISA's ADP and skewed toward recent, CISA-prioritised vulnerabilities, so the effect sizes should not be generalised past that population; the EPSS percentile is population-relative and re-ranks if the scored set changes; and the weights are a best additive approximation of the SSVC tree, every effect size an association rather than a cause. The right disposition is to treat the score as a consistent, explicit, tunable way to order a backlog, and to keep the measurements and the judgement calls in separate drawers — so that when someone asks "why is this one first?" the answer is always either a number you can show them or a choice you can own.

The honest version is a ranking aid, not a forecast.

The method keeps the measured signals and the chosen value judgements in separate layers and tags every coefficient, so a reader can see which numbers are facts and which are policy. The research page holds the other anchor hypotheses and the contradictions log; the thesis page connects them to the program POV.

Back to research → See the lab roadmap