Public production architecture teardown
Okta on DuckDB-in-Lambda
Security data platform built around serverless OLAP. DuckDB runs inside AWS Lambda for normalization and operational metadata harvesting, eliminating the per-query warehouse cost that traditional ETL stacks accumulate. Mini databases per invocation, not one shared engine.
Peak normalization throughput, sustained by AWS Lambda concurrency — DuckDB embedded per invocation. Daily volume swings 1.5–50 TB/day (CloudTrail + VPC Flow); 7.5 trillion records normalized over six months across 130M files at production scale.
The pipeline
-
Sources
AWS logs
CloudTrail · VPC Flow
-
Ingest
Kinesis / S3 raw
Buffered event streams
-
Transform
Lambda + DuckDB
SQL normalization in-function
-
Store
S3 normalized
Durable result set
-
Serve
Downstream engines
Detection + investigation
What composes, what’s brittle
- 7.5T records / 6 months. Cumulative production scale across 130M files.
- Why DuckDB. Embedded OLAP with full SQL; no cluster to operate.
- Why serverless. Auto-scales with event volume; pay-per-invocation.
- What composes. Normalized result feeds other engines for query-time work.
- What's distinctive. Mini databases per invocation, not one shared engine.
- What's brittle. Lambda cold-start tail; DuckDB version pinning across deploys.
Sources: Data Council talk "Processing Trillions of Records at Okta with Mini Serverless Databases" · MotherDuck case study · Julien Hurault, "Okta's Multi-Engine Data Stack"