Security Data Works

Technology deep-dive

Arrow and ADBC: a foundational pillar for security data.

Apache Arrow is one of four foundational open standards I name when describing the security lakehouse, alongside Apache Iceberg for table format, OCSF for event schema, and Sigma for detection portability. Arrow Database Connectivity (ADBC) is the piece that finally lets the wire protocol catch up with the storage format. The headline claim is a roughly 90% query-time reduction in some analytical workloads, but the honest version has more texture, and it turns out to be more useful for security architects making decisions.

Reading time: about 18 minutes. Evidence tier: B overall (industry journalism, vendor documentation, Apache project documentation), with one Tier A primary source for the Arrow project's own scope statement and one Tier D speculative note on GPU acceleration. Security-specific benchmarks for ADBC are not yet published, and I flag that throughout.

The driver problem

JDBC and ODBC are older than the workloads they now serve.

A modern SOC often runs three different query engines for three different jobs, with ClickHouse handling real-time correlation at sub-100ms latency, DuckDB doing laptop-scale threat hunting against a local cache, and Trino running federation queries that span Snowflake, S3, and PostgreSQL in a single statement. That's the pattern I see most often in reference architectures, and it's the one I recommend.

The traditional way to talk to any of them is through a JDBC (Java Database Connectivity) or ODBC (Open Database Connectivity) driver. ODBC was released in 1992. JDBC followed in 1997. Both were designed for the dominant database shape at the time: row-oriented transactional systems where the unit of work is "fetch the next row and update one of its fields." That assumption is baked into the wire protocol.

Security analytics is the opposite shape. Columnar OLAP (online analytical processing) workloads scan a few fields across billions of rows. The query "give me source IP and destination port for every event in the last hour where severity exceeded 3" wants two columns from a table that may have twenty or more. Forcing that workload through a row-oriented wire protocol creates a tax that has nothing to do with the database itself or the network. It's a serialization tax, paid on every query.

That tax is what ADBC is trying to remove, and the reason it isn't just a marketing pitch is that the mismatch is structural, having sat in the stack for roughly a quarter-century, and it's only become visible because the rest of the stack (Iceberg, Parquet, ClickHouse, DuckDB) finally went columnar end-to-end.

What Arrow actually is

A standard memory layout, not a database.

Apache Arrow is an in-memory columnar format. That phrasing matters because Arrow is not a storage format on disk (that's Parquet), not a table format (that's Iceberg or Delta), and not a query engine (that's DuckDB, ClickHouse, Trino). Arrow specifies how columnar data should be laid out in memory so that multiple processes (or multiple steps inside the same process) can read it without copying or re-encoding.

Two consequences fall out of that. First, Arrow enables zero-copy data sharing across systems. A ClickHouse result set can hand directly to a Python analytics process or a DuckDB instance without anyone serializing to JSON or CSV in the middle. Second, Arrow buffers are friendly to SIMD (Single Instruction, Multiple Data) vectorized CPU instructions, so the CPU can apply the same operation to many column values in parallel, because they're laid out contiguously in memory.

Arrow has been winning the analytical interchange war for several years now. It's the default in-memory representation for Polars, DuckDB, Dremio, and Snowflake's Snowpark; it's how pandas talks to Parquet; it's the format Iceberg readers usually emit. ADBC is the natural next step. If everything already passes Arrow buffers around internally, why should the database wire protocol stop and convert back to rows on the way out?

Row vs column wire

What the 90% number actually measures.

Consider a query: SELECT src_ip, dst_port FROM network_events WHERE severity > 3. The events table has roughly twenty columns: timestamp, source IP, destination IP, source port, destination port, protocol, severity, action, and so on.

Through a JDBC or ODBC driver, the database scans the matching rows, serializes every field of every row into a byte stream, sends them across the wire, and the client deserializes them back into row objects, and only then can the client pull out the two columns it actually asked for. So the driver moved one hundred percent of the qualifying data to use twenty percent of it, and that ratio is where the tax lives.

Through an ADBC driver, the database streams only the two requested columns, already laid out as Arrow buffers, and the client reads them directly, with no row-to-column conversion and no JSON or length-prefixed-byte gymnastics in the middle, so the query engine on the receiving end can pass those buffers into vectorized operations without touching them again.

DuckDB's documentation states ADBC provides "greater than 90% reduction in query times for many analytical applications," and that language is doing work, because the "greater than 90%" headline applies to workloads that combine three properties at once: highly selective column projection (a few fields out of many), large result sets where the serialization cost dominates total query time, and a query engine on the client side that can consume Arrow buffers natively. The "many" qualifier is where the honesty sits, and since DuckDB has not published a benchmark suite with methodology I can audit, I treat the 90% number as Tier B evidence and don't quote it without that qualifier.

For security workloads, the directionally right way to think about it is in the table below. The numbers come from the DuckDB and Apache Arrow project framing; I have not independently benchmarked these on security data, and you should treat them as a starting hypothesis to test against your own workload, not as a vendor promise.

Columns selected / total ADBC improvement (directional)
1 of 20 ( 5%) 80-90%
3 of 20 (15%) 50-70%
10 of 20 (50%) 20-30%
20 of 20 (all) minimal

The pattern is intuitive once stated: the more selective your projection, the more the row-oriented wire was hurting you, and the more ADBC helps. A SOC analyst running a threat hunt that pulls two or three fields out of a wide event table sits squarely in the "ADBC may help materially" zone. A forensic export that needs every column ends up in the "may not matter much" zone.

Adoption state

Who has shipped an ADBC driver as of 2026.

Adoption breaks along a predictable line: analytical engines moved first, transactional engines have not really moved at all, and SIEM vendors are still absent. The list of vendors with shipping ADBC support, drawn from project documentation and vendor announcements, includes:

  • DuckDB: native ADBC support, including the bulk-ingest path that lets you load Arrow buffers straight into a DuckDB table without re-encoding.
  • Snowflake: ADBC connector available; relevant for security teams already using Snowflake as a long-tail archive.
  • Google BigQuery: ADBC connector built; matters for organizations whose security data lands in BigQuery via Chronicle or similar.
  • Dremio: ADBC connector available, plus Arrow Flight as a related protocol for federation.
  • Databricks: ADBC connector available against Delta tables and SQL warehouses.
  • Microsoft PowerBI: ADBC integration, mostly relevant for dashboard layers.

The notable absences for security teams: Splunk does not ship an ADBC driver as of early 2026. Neither does IBM QRadar or Micro Focus / OpenText ArcSight. Elastic shipped Arrow as a response format over the ES|QL REST API in Elasticsearch 8.16 (October 2024), which is Arrow-as-format rather than Arrow Flight or ADBC, so you get columnar payloads but no zero-copy streaming and no JDBC/ODBC replacement, because Elastic has no Flight server, no Flight SQL endpoint, and no ADBC driver. ClickHouse Inc. shipped an official ADBC driver in February 2026 (the ClickHouse/adbc_clickhouse repo under the ClickHouse GitHub org), still preview/WIP with stubbed methods, but vendor-owned, not community-maintained. Trino's ADBC driver is maintained by the independent ADBC Driver Foundry (the adbc-drivers GitHub org), explicitly "Not affiliated with Trino"; Trino itself has not committed maintainer time (see open issue #24586). ADBC driver maturity in general varies by language binding, with Python usually the most complete while Java and Go are catching up.

For the broader landscape: Apache Arrow's first-party ADBC drivers as of mid-2026 include Flight SQL, PostgreSQL, SQLite, Snowflake, BigQuery, DuckDB, and a JDBC bridge, plus Databricks, Spark, Hive, and Impala C# drivers and a DataFusion Rust driver. The adbc-drivers Foundry is the community second-tier home for vendor-adjacent drivers, useful to know when scanning the ecosystem for what's upstream versus what's community-maintained.

Apache Iceberg itself does not have a native ADBC driver. You reach Iceberg data through a query engine (DuckDB, Trino, Dremio, Spark) and then use that engine's ADBC connector. That's a sensible architectural choice because Iceberg is a table format, not a query engine, but it's worth knowing if you're sketching architecture diagrams.

Commercial signal

Columnar Inc. and the bet on professional ADBC tooling.

One data point that helped me take ADBC seriously as a foundational pillar, rather than a curiosity: Ian Cook, a core Apache Arrow contributor, founded Columnar Inc. in 2025 with $4M in seed funding from Bessemer Venture Partners. The company's focus is commercial ADBC drivers and tooling. This is Tier B evidence (industry journalism, The New Stack's November 2025 coverage), and venture funding is not a guarantee of category success, but it suggests at least one credible founder and one credible VC believe the JDBC-to-ADBC transition is durable enough to build a company on.

Cook's framing in that piece is worth quoting verbatim, because it captures the shape of the bet:

"Arrow has been a big success story, but there's this final frontier that Arrow has just begun to cross in the last couple of years, and that is displacing the dominant data connectivity standards like ODBC and JDBC, which are growing quite outdated and are grossly inefficient for data analytics applications in particular."

For security architects, the relevant inference isn't that ADBC will replace JDBC tomorrow, but rather that the people closest to the Arrow project consider the connectivity layer the obvious next push, and that commercial tooling is starting to appear, which changes the calculus on whether to bet architecture decisions on ADBC support being there in two or three years.

Security use cases

Where ADBC may earn its keep in a SOC.

Multi-engine query patterns

The most common SOC pattern that benefits from ADBC is the "pull from one engine, push into another" flow. An analyst pulls a window of alerts from ClickHouse for real-time triage, then wants to cross-reference those alerts against a year of historical data sitting in an Iceberg lake via DuckDB. Without ADBC, each engine has its own driver, its own row-oriented wire protocol, and its own conversion step in the middle. The Python script connecting them spends most of its time serializing and deserializing, work that has nothing to do with the analyst's actual question.

With ADBC, the same script can pull Arrow buffers from ClickHouse, hand them directly to DuckDB's bulk-ingest path, and run the cross-reference query against the combined data with no serialization in the middle, and because the driver API the analyst's code talks to stays the same across engines, the script stops being three different APIs glued together with conversion code.

I want to be careful about overclaiming here. I have not personally benchmarked a multi-engine threat-hunting workflow under ADBC versus JDBC end-to-end on production-shaped security data. That's on the lab roadmap and would be Tier B evidence at best when it lands. Until then, the DuckDB-published "more than 90% in many applications" framing is the basis for the directional claim, and I treat the actual security-specific improvement as somewhere between 30% and 80% depending on column selectivity, with significant variance.

Threat intel enrichment

A second pattern: enrich event streams against a threat intelligence table. Consider ten million events per hour that you want to check against a million known malicious IPs. The textbook in-memory hash table is fastest if the threat intel set is small and stable, but threat intel rarely is. It's constantly refreshed, often joined against multiple feeds, and frequently audited.

Per-event JDBC lookups against a remote threat intel database are catastrophically slow because of the round-trip and serialization tax. Batch ADBC queries that return Arrow buffers, which the detection engine can consume directly, sit in the right shape: you ask for the malicious-IP column once, get it back as a contiguous columnar buffer, and intersect it with your event stream in a vectorized operation. The transfer cost may be roughly an order of magnitude lower than the row-oriented equivalent, which is consistent with the directional 90%-reduction framing for low-selectivity analytical queries.

SIEM-to-lake migration

The third pattern is the migration story: moving event data out of a legacy SIEM into an Iceberg lakehouse. Today that path almost always runs through CSV or JSON exports, where you dump from Splunk, parse, convert, write Parquet, and register with Iceberg, and every step costs CPU, memory, and engineer time, while the CSV intermediate is the single largest source of data-quality bugs I see in migration projects.

An ADBC-native migration path would look like this: the ADBC connector reads from the source SIEM and streams Arrow buffers directly into an Iceberg writer, with no CSV intermediate and no schema-on-read parsing tax. That's the architectural picture, but the blocker is real, because Splunk, QRadar, and ArcSight do not ship ADBC drivers today, so this is a roadmap argument for vendor pressure rather than a recipe you can deploy now. I flag it because security architects often have negotiating room during contract renewals to push for ADBC support, and that negotiating room is wasted if no one is asking.

When ADBC isn't the right tool

JDBC and ODBC still win in their original territory.

The Apache Arrow project itself says, in primary-source documentation (Tier A), that "ADBC doesn't intend to replace JDBC/ODBC in general." That phrasing has been there since the ADBC FAQ shipped, and I think it's worth taking at face value rather than dismissing it as politeness.

The places JDBC and ODBC still earn their seat:

  • OLTP transactions: row-by-row INSERT, UPDATE, DELETE workloads. A columnar wire protocol adds no value when the unit of work is a single row.
  • Legacy applications: decades of working JDBC code in enterprise security tooling. The economics of a port rarely justify the disruption unless analytical performance is the binding constraint.
  • Non-columnar databases: key-value stores (Redis), document stores (MongoDB), graph databases (Neo4j). Without a column-oriented backing store, ADBC has nothing to optimize.
  • Wide column selection: SELECT * workloads where the columnar projection advantage disappears, like forensic exports.
  • Cursor navigation: application patterns that fetch one row, decide what to do, fetch the next. The Arrow batch model is a poor fit for this style.

For a security architect, the line is roughly ADBC for analytical work against columnar engines (ClickHouse, DuckDB, Trino, Snowflake, BigQuery) and JDBC or ODBC for everything else, and the two can coexist in the same SOC, since you make the call per-workload rather than committing the whole organization to one or the other.

Technical shape

What ADBC actually exposes.

The canonical ADBC specification is defined in C (the adbc.h header), with language bindings for Python, Java, Go, R, and Ruby. The Python bindings are the most mature in early 2026; Java and Go are catching up; R and Ruby are usable but smaller surface areas.

The core operations the API exposes:

  • Connect: establish a connection to the database, with URI-style connection strings that are uniform across drivers.
  • Query: execute SQL and get results back as Arrow RecordBatches (the unit of columnar data), not as a row cursor.
  • Stream: iterative retrieval of large result sets, batch by batch, without buffering the entire result in client memory.
  • Metadata: schema discovery and catalog browsing through a uniform API rather than vendor-specific dialect.
  • Ingest: bulk load Arrow buffers into a database table, useful for the multi-engine pattern described above.

The key shift from JDBC is the result shape. A JDBC query returns a row cursor that the application steps through one row at a time. An ADBC query returns Arrow RecordBatches that the application processes column-at-a-time, with vectorized operations available throughout. For SOC analyst code that's already in pandas, Polars, or DuckDB, this is a near-drop-in change, while for code written around the row-cursor mental model, it's a refactor.

Speculative direction

GPU acceleration and the longer arc.

Ian Cook's New Stack interview also raises a longer-arc possibility: Arrow as a bridge to GPU-resident analytical processing. His framing ("the CUDA ecosystem for GPUs is built around a tabular data model, which could benefit from faster load times as well") describes a stack where security telemetry flows from storage through Arrow buffers directly into GPU memory for ML inference, with no CPU-side deserialization in the middle.

That's Tier D evidence: speculation by a credible expert, not validated production deployment. I include it because it's a plausible direction the standard may move, and security teams investing in ML-driven detection should at least be aware that the columnar-wire-to-GPU path is on the longer roadmap. I would not bet architecture decisions on it landing in 2026 or 2027.

The shorter and more concrete adjacent piece is Arrow Flight, a separate protocol (also from the Arrow project) that streams Arrow data over gRPC, and Flight is the right tool when you want a streaming columnar pipe between systems rather than a query-response driver, so Flight and ADBC complement each other rather than competing. Both are worth distinguishing from a third thing in the Arrow family that sometimes gets conflated, which is Arrow-as-response-format, the thing Elastic shipped in Elasticsearch 8.16 over the existing ES|QL REST API, and that delivers columnar payloads but not the gRPC streaming of Flight or the driver surface of ADBC, so you end up with three different layers that often get mentioned interchangeably in vendor marketing.

Honest gaps

What's still missing in early 2026.

The gaps I'd put at the top of any security architect's evaluation list, none of which I'd hide from a stakeholder evaluating ADBC for a real deployment:

  • SIEM vendor support is absent. Splunk, QRadar, and ArcSight do not ship ADBC drivers. Until that changes, ADBC sits downstream of the SIEM, not in front of it.
  • Security-specific benchmarks are not published. The 90% claim is for "many analytical applications," not for OCSF-shaped event data. I have not seen a methodology-disclosed ADBC benchmark against firewall logs, EDR telemetry, or DNS data. This is on the lab roadmap.
  • Driver maturity varies by language. Python is the most complete. Java and Go are catching up. If your security tooling is JVM-heavy, expect rough edges in the ADBC Java bindings that may not exist in Python.
  • Iceberg lacks a native ADBC connector. You reach Iceberg through a query engine. That's architecturally clean but worth noting if you're sketching an "Iceberg plus ADBC" picture.
  • The 90% number is unaudited. DuckDB's claim is internally consistent and matches the structural argument, but the absence of a published methodology means I cite it with the qualifier, not as a guaranteed performance promise.

None of these are reasons to dismiss ADBC, but they are reasons to size expectations correctly and to treat ADBC adoption as a vendor-pressure and roadmap-tracking exercise alongside the technical work.

Why this matters

Arrow plus ADBC in the foundational-standards stack.

When I describe the security lakehouse to architects, I name four foundational open standards. Two live at the data layer: Apache Arrow for in-memory representation, Apache Iceberg for the on-disk table format. One lives at the schema layer: OCSF (Open Cybersecurity Schema Framework) for portable security event definitions. One lives at the detection layer: Sigma for detection-rule portability across SIEMs and lakehouses.

Arrow is the piece that makes the others compose well, so that OCSF events written into Parquet, registered as Iceberg tables, and queried through DuckDB or Trino with ADBC drivers move through a pipeline where no step has to re-encode the data into a different shape. The schema is portable, the storage is portable, and the query engine is interchangeable, and now the wire protocol stops adding a serialization tax on top of all that.

That composability is the real argument for ADBC, where the 90% query-time reduction is the marketing headline but the structural reason to care is that the security data stack finally has the option of being columnar end-to-end (storage, table format, query engine, and now driver) without paying a row-oriented translation tax at any layer. For a SOC team weighing schema-on-read SIEM costs against a lakehouse migration, that composability is what compounds over the life of the architecture.

ADBC is not finished, and the obvious next gate is SIEM vendor support, with driver maturity in non-Python bindings as the second and security-specific benchmarks as the third. But the trajectory is real, the commercial signal (Columnar Inc., Bessemer) is real, and the structural argument (JDBC and ODBC are older than the workloads they now serve) is the kind of argument that wins on a long enough timeline.

Practical guidance

What to do in 2026.

Four moves that I'd actually recommend to a security architect today, ordered by how cheap they are to execute:

  • Track ADBC support in the engines you already run. If you're on DuckDB, Snowflake, BigQuery, Databricks, or Dremio, the ADBC driver is already available, and switching analytical workloads over is a Python-library swap, not an architecture change.
  • Evaluate ADBC for new analytical pipelines, not migrations of working JDBC code. The economics are best for greenfield analytical work where you haven't already paid the JDBC integration cost. Don't rewrite working code chasing a percentage improvement.
  • Push SIEM vendors for ADBC support during contract renewals. Splunk, QRadar, and ArcSight don't ship drivers because customers haven't asked. Add it to your vendor scorecards. Even if you never adopt it, the pressure matters for the broader ecosystem.
  • Benchmark your own workloads. The 90% number is for "many analytical applications," not for OCSF-shaped security data. If ADBC adoption is going to be load-bearing in your architecture, run the benchmark against your real workload before committing. This is what I'd ask any vendor claiming ADBC performance to back up with methodology.

JDBC is 25 years old this year, and ODBC is older, and neither was wrong for its era; they're just no longer the right shape for analytical security workloads on columnar engines, which is where ADBC comes in as the columnar-native replacement. It may not displace JDBC in every domain (it isn't trying to), but in the analytical corner of the security data stack, it's the standard worth tracking.