Okta on DuckDB-in-Lambda

Public production architecture teardown

Security data platform built around serverless OLAP. DuckDB runs inside AWS Lambda for normalization and operational metadata harvesting, eliminating the per-query warehouse cost that traditional ETL stacks accumulate. Mini databases per invocation, not one shared engine.

250 GB/min

Peak normalization throughput, sustained by AWS Lambda concurrency — DuckDB embedded per invocation. Daily volume swings 1.5–50 TB/day (CloudTrail + VPC Flow); 7.5 trillion records normalized over six months across 130M files at production scale.

The pipeline

Sources

AWS logs

CloudTrail · VPC Flow
→
Ingest

Kinesis / S3 raw

Buffered event streams
→
Transform

Lambda + DuckDB

SQL normalization in-function
→
Store

S3 normalized

Durable result set
→
Serve

Downstream engines

Detection + investigation

What composes, what’s brittle

7.5T records / 6 months. Cumulative production scale across 130M files.
Why DuckDB. Embedded OLAP with full SQL; no cluster to operate.
Why serverless. Auto-scales with event volume; pay-per-invocation.
What composes. Normalized result feeds other engines for query-time work.
What's distinctive. Mini databases per invocation, not one shared engine.
What's brittle. Lambda cold-start tail; DuckDB version pinning across deploys.

Sources: Data Council talk "Processing Trillions of Records at Okta with Mini Serverless Databases" · MotherDuck case study · Julien Hurault, "Okta's Multi-Engine Data Stack"

Okta on DuckDB-in-Lambda

See how the pattern lands on your workload.