Security Data Works

Field-shaped case study

Migrating 800 detection rules across seven parallel ingest buses.

The question that kills more security-lakehouse projects than any benchmark or schema debate is this: how do I move 800 detection rules off Splunk SPL without missing a threat during the migration? This is the playbook I've watched work in production: tiered triage, LLM-assisted translation, 90-day regression testing, and a phased cutover that lets each business unit move at its own risk tolerance.

Reading time: roughly 19 minutes. Evidence tier: B (anonymized composite of one large federated migration plus my consulting work across twelve SOC teams totaling around 9,600 rules), with one Tier C note where I cite LLM-translation accuracy ranges and one Tier D forward-looking remark on the parallel-run cost calculus.

Why this is the blocker

Migration risk, not technology choice, is what kills these projects.

When I sit down with a security architect who has already chosen Iceberg, picked a query engine, and modeled out the cost savings, the conversation rarely ends with a technology argument. It ends with a question that sounds like this: "We have 800 detection rules in Splunk SPL. How do I move them to SQL without missing ransomware during the cutover?"

That fear is rational. A missed detection during migration carries career risk for the CISO, regulatory risk if the missing control was documented as a compliance boundary, operational risk because the SOC cannot go dark for a quarter, and a category I think of as unknown-unknowns: the possibility that the legacy SIEM was catching something nobody on the current team remembers configuring.

The result is paralysis, and it usually arrives as some version of "the modern data stack sounds great, but the migration risk is too high, so we'll stay on Splunk." I have heard that sentence often enough to treat it as the central problem the migration playbook needs to solve, which is not the technology choice but the migration mechanics that let a security leader sign off on the cutover without betting the SOC.

Translation is not the hard part

A single SPL rule is rarely a single SQL statement.

Take a typical brute-force authentication rule. In Splunk SPL it might look something like this:

index=windows sourcetype=WinEventLog:Security EventCode=4625
| stats count by src_ip, user
| where count > 10
| lookup threat_intel_ip src_ip OUTPUT threat_score
| where threat_score > 70
| eval severity=case(count>50, "critical", count>20, "high", 1=1, "medium")
| outputlookup failed_auth_summary

That one rule does seven things: search, aggregate, threshold filter, threat-intel enrichment, second threshold filter, dynamic severity scoring, and a write to a summary lookup. The SQL equivalent is rarely one statement and usually runs to three, a main aggregation query, a join against the threat-intel table, and an INSERT into the summary table, because there's no native lookup command, no outputlookup, and string-matching semantics differ in ways that matter for false-positive rates.

Multiply that translation work by 800 and the engineering estimate that comes back is usually 400 to 600 hours of manual translation. That estimate is the moment I see projects fold. "We can't do this in our timeline. We'll wait."

The estimate is correct, but the premise underneath it is wrong, because you should not be trying to translate all 800 in the first place.

Triage before translation

Not every rule deserves to live.

Across the SOC teams I've audited, the alert distribution is consistent enough that I now lead with it as a planning assumption. Roughly 20% of rules generate 80% of the alerts the SOC actually actions. Another 30% generate around 15% of alerts. The remaining 50% generate the last 5%, and most of those come with false-positive rates between 70% and 90%. Analysts have learned to ignore them.

That gives you a triage frame that does most of the work for you:

  • Tier 1 (around 20% of rules). Ransomware, lateral movement (PsExec, WMI, Pass-the-Hash), data exfiltration (DNS tunneling, anomalous egress), privileged-account abuse. These carry the SOC. Migrate first. Human review on every rule.
  • Tier 2 (around 30%). Phishing, vulnerability exploitation, insider threat. Migrate second. Sample-review for accuracy; full regression test before cutover.
  • Tier 3 (around 50%). Experimental rules nobody tuned, detections for vendor products that were decommissioned three years ago, informational rules that fire on every login. Don't migrate. Delete. The SOC isn't working those alerts anyway.

On the case I'm anonymizing here, 800 SPL rules became roughly 400 SQL rules after triage. Half the ruleset was deprecated rather than translated. The SOC manager's framing ("we don't action these alerts anyway, so don't waste engineering hours moving them") is the framing I now coach every architect through before any LLM gets involved.

The mistake to avoid is treating this as a cost-cutting exercise when it is really a coverage exercise, because you're deleting those 400 rules not to save engineering time but because the SOC has already de-facto retired them, and migrating dead rules forward creates the false impression of preserved coverage. See the field-mapping anti-pattern for the related failure mode where translation effort goes into rules that were already broken upstream.

Phase 1

LLM-assisted translation, with humans on the load-bearing rules.

The cost-saving move is to use an LLM for the first pass on every rule, then concentrate human review where the cost of a translation bug is highest. The shape that has worked for me is roughly: 40 hours of LLM work to draft SQL for all 800 rules, 80 hours of human engineering review across the 160 Tier 1 rules (about 30 minutes per rule), and 16 hours of spot-checking a 10% sample of Tier 2 and Tier 3. That's 136 hours total against the 400-hour fully-manual estimate.

Across the migration projects I've seen using GPT-4-class and Claude 3.5-class models, LLM translation accuracy lands roughly where you'd expect once you sort by rule complexity. Simple aggregations and filters translate at 92–95% accuracy on the first pass. Queries with subsearches, lookup enrichments, and outputlookup writes drop to 75–85%; the LLM gets you 80% of the way and the engineer finishes the rest. Highly custom macros and Splunk-specific functions (tstats, mvcombine) fall to 40–60% accuracy and usually need full manual rewrites. The ranges here are Tier C; they're my observation across roughly eight migration projects, not a published benchmark.

Read those ranges against the published text-to-SQL ceiling, because the BIRD-benchmark numbers are the outside view on what "the LLM got it right" actually means, and I work through them in LLM-assisted OCSF mapping. My own 92–95% figure for simple aggregations is plausibly real precisely because SPL aggregate-and-threshold rules are the easy end of that distribution; the 40–60% I see on tstats and custom macros is where my anecdote and the benchmark agree. Treat the published numbers as the ceiling and my ranges as one practitioner's slice underneath it.

Here's a concrete example of what an LLM gets right and what it gets wrong. The input rule:

index=edr sourcetype=process_telemetry
| search process_name=powershell.exe command_line=*DownloadString*
| stats count by host, user, command_line
| where count > 5

The LLM's first-pass SQL:

SELECT
    device_hostname,
    actor_user_name,
    process_cmd_line,
    COUNT(*) AS download_attempts
FROM ocsf_process_activity
WHERE event_date >= CURRENT_DATE - 1
  AND process_file_path LIKE '%powershell%'
  AND process_cmd_line LIKE '%DownloadString%'
GROUP BY device_hostname, actor_user_name, process_cmd_line
HAVING download_attempts > 5;

Human review catches three things that the LLM consistently misses. NULL handling: the SPL version implicitly drops events where command_line is empty; the SQL version will return them unless you add AND process_cmd_line IS NOT NULL. Time-window semantics: Splunk's earliest/latest is a rolling window from query time, while event_date >= CURRENT_DATE - 1 is calendar-day-boundary behavior, so the two are different rules. And field mapping: the LLM mapped host to device_hostname, which is correct against OCSF in most cases but wrong for sources where the upstream parser put the hostname in device.name instead.

Look at the shape of all three, because none of them throws an error: the NULL-handling miss, the calendar-day versus rolling-window mismatch, and the field mapped to the wrong OCSF attribute each produce valid SQL that runs clean and returns a plausible result set. This is the failure mode that matters, and it isn't the rule that fails to translate, because a rule that fails to translate announces itself and the engineer sees the error, fixes it, and moves on. The dangerous one is the silent error, where the LLM emits valid-but-semantically-wrong logic that passes a shallow eyeball check and ships looking correct, so that a rule which drops a third of its events to an unhandled NULL still produces alerts, just fewer of them, and "fewer alerts" reads like a quieter SOC rather than a coverage hole until the threat you needed it to catch slips through.

That is why the regression gate is load-bearing rather than optional. You cannot inspect your way out of a silent error; the SQL looks right by construction. The only check that surfaces it is replaying the rule against labeled events and comparing the output to a known-good baseline, which is exactly the 90-day regression in Phase 2. The field-mapping miss is the highest-frequency version of this I've seen across LLM-translated rules, and it's why I don't trust LLM translation without regression testing in front of it. See LLM-assisted OCSF mapping for the longer treatment of where LLMs help and where they hallucinate field-name correctness.

Phase 2

Regression testing against ninety days of known threats.

Regression testing answers one question: does the new SQL rule catch the same threats as the old SPL rule? The methodology is to re-run both against the same 90-day historical window and compare. Ninety days is the window I've settled on after watching shorter baselines miss seasonal threats and longer baselines add storage cost without proportional coverage improvement. A typical SOC sees 500 to 2,000 distinct threat types per quarter; that's the diversity the baseline needs to capture.

The mechanics are three steps. Export the historical Splunk alerts for the rule under test:

index=notable sourcetype=splunk_alerts rule_name="Brute Force Authentication"
  earliest=-90d latest=now
| table _time, src_ip, user, count, severity
| outputcsv brute_force_90day_baseline.csv

That baseline is your "known true positives": the alerts the migrated rule must reproduce. Run the SQL equivalent against the same window:

SELECT
    event_time,
    src_endpoint_ip,
    actor_user_name,
    COUNT(*) AS failed_auth_count
FROM ocsf_authentication_events
WHERE event_date >= CURRENT_DATE - 90
  AND activity_id = 1  -- Authentication failed
GROUP BY event_time, src_endpoint_ip, actor_user_name
HAVING failed_auth_count > 10
ORDER BY event_time;

And diff the two result sets. The diff query is the most useful artifact in the entire migration; it tells you exactly which historical alerts the new rule missed:

SELECT
    splunk.event_time, splunk.src_ip, splunk.user,
    splunk.count AS splunk_count,
    sql.failed_auth_count AS sql_count
FROM splunk_baseline splunk
LEFT JOIN sql_results sql
  ON splunk.src_ip = sql.src_endpoint_ip
  AND splunk.user = sql.actor_user_name
  AND DATE(splunk.event_time) = DATE(sql.event_time)
WHERE sql.failed_auth_count IS NULL
ORDER BY splunk.event_time;

On the migration I'm anonymizing here, the first pass against this single brute-force rule produced 1,839 SQL alerts against a Splunk baseline of 1,847; eight missed alerts, about 99.6% parity. The investigation broke down as six alerts lost to a field-mapping bug (source IP was in device.ip rather than src_endpoint.ip) and two lost to the time-window difference between Splunk's rolling 24-hour and the SQL version's calendar-day boundary. Both fixes were one-line changes, and the re-test produced 1,847 alerts, which is full parity.

Aggregated across all 160 Tier 1 rules in the case I'm working from, the first-pass parity was around 98.5%, which is close but not the 100% you need to sign off on cutover, and after fixing field-mapping bugs (the single largest source of miss) and time-zone handling (Splunk runs UTC, the new engine ran local time) parity reached 100%. That number, 100% Tier 1 parity validated against a documented 90-day baseline, is what gets a CISO to sign the cutover authorization, because in my experience anything less tends to stall.

For Tier 2 rules I accept 95–98% parity, and for Tier 3, if you migrate any, 80–90%, because those tolerances reflect the actual fidelity the SOC is operating at on those rules anyway, so demanding 100% parity on rules that already run at 80% false-positive rates is effort spent for appearances rather than coverage.

Phase 3

Parallel execution catches what regression cannot.

Regression testing validates the rule against threats that already happened. Parallel execution validates it against threats that haven't happened yet. Splunk continues running the Tier 1 ruleset and feeding alerts to the SOC; the new platform runs the SQL equivalents and feeds alerts to a validation queue that no analyst actions. A daily automated comparison surfaces any rule where the two systems disagree:

SELECT
    rule_name,
    COUNT(DISTINCT splunk_alert_id) AS splunk_alerts,
    COUNT(DISTINCT sql_alert_id) AS sql_alerts,
    COUNT(DISTINCT splunk_alert_id) - COUNT(DISTINCT sql_alert_id) AS delta
FROM alert_comparison
WHERE alert_date = CURRENT_DATE - 1
GROUP BY rule_name
HAVING delta != 0
ORDER BY ABS(delta) DESC;

Week one almost always surfaces a cluster of discrepancies. On the case I'm describing, twelve rules showed 5–15% fewer alerts on the SQL side, and the root cause was again time-zone handling, Splunk's UTC default versus the new engine's local time, which took one fix applied across all twelve rules. Weeks two through four came back with zero discrepancies, so those rules cleared for production.

The hard part of parallel execution isn't technical but financial, because each business unit is paying for two SIEM-grade environments simultaneously during their parallel window. On the federated case the seven business units settled into a wide range of parallel-window durations based on their own risk tolerance: roughly three weeks at the short end for a unit with compliance urgency that needed the Splunk contract gone, three to four months in the middle, and six months at the long end for a critical-infrastructure unit whose CISO refused to move faster.

The total double-spend across the seven units landed in the mid-six-figure range, which sounds scary in isolation, so the frame I've found persuasive is to compare it against the cost of the failure case: cutting over without parallel validation, missing a ransomware detection, and paying for incident response plus the credibility loss that ends the migration program entirely. The parallel-operation spend is insurance, and the premium is calibrated by the unit's own risk appetite rather than by a corporate timeline.

That last point is the central decision the federated coalition rests on, because each business unit controls its own parallel-window duration and forcing uniformity tends to collapse the coalition. The unit with the longest window is paying the most, but they're also the unit most likely to walk away from the migration entirely if rushed, and walking away takes the corporate-wide cost-savings case down with them.

Phase 4

Phased cutover, not big-bang.

The cutover sequence inside each business unit is consistent. Disable the Splunk Tier 1 rules first. Watch for seven days. Verify that analysts are not seeing missed detections, that escalation rates are stable, that the SOC's MTTR (mean time to respond) hasn't drifted. If all three hold, Tier 1 is done.

Splunk continues running Tier 2 and Tier 3 through this window. The reason is rollback safety: if Tier 1 fails, you have a partial-rollback option (re-enable the Splunk Tier 1 rules, leave everything else alone). That's a meaningfully different recovery profile from a big-bang cutover where a single broken rule means full rollback and a project credibility wound that the team may not come back from.

A month later, Tier 2 follows the same pattern. Tier 3 is deleted rather than migrated; the SOC doesn't work those alerts and migrating them forward would re-create the noise the team is trying to escape. On the case I'm describing, that decision eliminated something like 75 alerts per day of analyst noise. The SOC manager described it as the single most appreciated outcome of the migration, ahead of the cost savings.

For the federated context (seven units, seven parallel timelines, staggered starts) the corporate rollout took roughly twelve months from the first unit's pilot to the last unit's Splunk sunset. The alternative timeline that some leadership teams push for, "all seven cut over together to maximize the contract negotiation leverage," is the timeline that destroys the coalition. I've watched it fail twice and recommend against it categorically.

The federated wrinkle

Seven business units do not share a threat model.

The 800 rules in this case were not 800 corporate-mandated rules running uniformly across the business. They were a federation. Different units faced different threats, ran different log sources, and maintained different detection content. Forcing a shared ruleset would have collapsed the operational reality of the migration.

The structure that worked was a two-layer detection library. The first layer was a corporate-shared baseline of roughly 100 rules covering the threats every unit must detect (ransomware, lateral movement, privilege escalation), carried as a board-level compliance requirement and owned by the corporate SOC. The second was a per-unit overlay, where each business unit maintains its own detection content on top of the baseline, owned by that unit's security team, so that corporate doesn't touch the overlay and the unit doesn't touch the baseline.

Mathematically, the migration shrank from 800 unit-specific rules to 100 shared rules plus around 300 unit-specific rules across seven units (an average of 40-something per unit after Tier 3 deprecation). Corporate migrated the shared baseline once; each unit migrated its overlay independently. That's a usefully different workload from "migrate 800 rules end-to-end" and it's the structural reason the timeline closes in a year rather than two.

One unit in the federation ran Zeek with non-standard fields that did not map cleanly to OCSF. Corporate IT spent about a month in a joint working session with that unit's security engineers using LLM-assisted mapping to get to semantic correctness. The timeline impact was a two-week delay against plan. The trust impact, in the unit's own words, was substantial: the unit went into the migration worried that "corporate is going to flatten our custom sources," and came out describing the corporate IT team as collaborators rather than overlords. That's not a quantifiable outcome, but it's the outcome that makes the next federated initiative possible.

Failure modes

Five ways I've watched this go wrong.

The failure modes repeat across migrations, and what's striking is that almost none of them are technical, because they're decision-making errors that show up six months into the project, when the cost of reversing them is highest.

  • "We'll migrate all 800 rules." 400 hours of translation, 6–9 months of timeline, team burnout, no improvement in detection coverage at the end. The 80/20 triage is not optional; it's the move that makes the project finishable.
  • "We trust the LLM translation." LLMs are 92–95% accurate on simple SPL patterns and progressively worse on complex ones. That means 5–8% of rules have bugs the LLM didn't catch. Human review on Tier 1 is non-negotiable; regression testing on every migrated rule is the backstop.
  • "We'll cut over everything at once." If any rule breaks, full rollback. If the rollback works, the project credibility is wounded; if it doesn't, the SOC is dark. Phased cutover preserves the partial-rollback option.
  • "We don't need regression testing; we tested ten rules and they worked." The ten you tested are not the ten that will break. Field-mapping bugs and time-zone discrepancies are the long tail. The 90-day regression catches them; spot-checking does not.
  • "Everyone follows the corporate timeline." The unit with the most conservative CISO needs the longest parallel window. Forcing them onto a fast unit's timeline causes them to exit the coalition, and the coalition is the thing the corporate cost-savings case depends on. Let each unit pick its window.

The first four are mechanical, but the fifth is political, and it's the one I see most often mishandled, because a federated migration is a governance problem that only looks like a SQL problem. The architecture work that gets the most attention is the SPL-to-SQL translation, while the work that decides whether the project succeeds is the governance design that lets seven units move at seven speeds without breaking the shared baseline.

The numbers

What "100% Tier 1 parity" actually looked like.

Across the 160 Tier 1 rules on the case I'm working from, the regression-testing numbers landed like this. Ransomware-family rules came in four short on the first pass, all four traced to a field-mapping bug and fixed to 100% parity; every family closed to full parity after the field-mapping and time-zone fixes.

Tier 1 rule family 90-day Splunk baseline SQL first pass Parity after fix
Ransomware-family 342 338 100%
Lateral movement 219 215 100%
Data exfiltration 156 151 100%
Privileged-account abuse 284 279 100%
Phishing 523 518 100%

Aggregate Tier 1: 1,524 baseline alerts, 1,501 first-pass alerts (98.5% parity), 1,524 after fixing field mapping and time-zone bugs. That last number is the artifact the CISO signed against. That the first pass was already at 98.5% is what made the project finishable in the timeline; the 2-percentage-point gap was the work that justified the human-review investment.

On the cost side, the project's headline savings landed in the low seven-figures annually after all seven units sunset their Splunk contracts. The parallel-operation insurance premium consumed roughly 15–20% of the first-year savings: real money, but not enough to change the business case. The analysis in the hidden cost of SIEM migration covers the broader cost framing including the categories that are easy to miss in early budgeting.

Conclusion

Migration risk is solvable, but only with the discipline to triage before translating.

The reason migration is the number-one blocker to lakehouse adoption in security is that the default plan ("translate 800 rules, cut over, hope") is terrifying on its merits, and the security leader who refuses to authorize it is making the right call against that plan. The playbook in this essay replaces the default plan with one that a CISO can actually sign.

Triage first, because the 80/20 distribution is consistent enough across SOCs that you can plan around it, and roughly half the legacy ruleset is usually noise that the SOC has already retired in practice. Translate with LLM assistance but never without human review on the high-stakes rules, regression test against ninety days of historical alerts and chase first-pass parity to 100% on Tier 1, then run parallel against new threats for as long as each unit's CISO needs to sign and cut over in phases that preserve a partial-rollback option. And let federated units control their own timelines, because the coalition is the thing that makes the cost-savings case real.

None of this is a clever architecture, it's procedural discipline applied to a problem that the industry keeps trying to solve with better translation tools. The translation tools are getting better (LLMs are real and useful here) but the gating factor is the governance design rather than the translator. That's the part of the playbook I think is most under-discussed in vendor pitches, and the part I'd insist on getting right before any tool selection.