Technology deep-dive

Arrow Flight and Flight SQL: the columnar wire protocol for security data.

This is the second piece in my Arrow series. The first, Arrow and ADBC, covered the driver question: why JDBC and ODBC are the wrong shape for analytical workloads and what a columnar-native database driver looks like. This piece goes a layer deeper, to the wire protocol that ADBC's Flight SQL driver actually rides on: Apache Arrow Flight, and its SQL specialization, Flight SQL. Flight is the streaming-data plumbing of the modern columnar stack, and for a SOC moving toward a lakehouse architecture it's worth understanding distinctly from ADBC and from Arrow-the-format, because these are three different layers that vendor marketing routinely conflates.

Reading time: about 20 minutes. Evidence tier: B overall (Apache Arrow project documentation and vendor engineering pages), with one Tier A primary source on the Flight specification and a Tier C benchmark from a 2022 academic paper that I cite directionally rather than as a load-bearing performance promise. Security-specific Flight benchmarks remain unpublished and are flagged throughout.

The three-layer confusion

Flight vs ADBC vs Arrow-format are not interchangeable.

Before going further, it's worth disambiguating, because I see these three concepts blurred almost weekly in vendor marketing, and the conflation makes architecture conversations harder than they need to be when each layer is actually solving a distinct problem.

Arrow-as-format is an in-memory columnar layout: a specification for how columnar data is arranged in RAM so that processes can share buffers without re-encoding. It says nothing about how data moves between machines. Elastic's ES|QL Arrow response format in Elasticsearch 8.16 is at this layer: columnar payloads over the existing REST API, but no streaming protocol and no driver API.

Arrow Flight is a gRPC-based wire protocol for moving Arrow record batches between systems. It defines the methods (DoGet, DoPut, DoExchange, GetFlightInfo, ListFlights), the endpoint model (multiple parallel locations a client can fetch from), and the authentication hooks. Flight SQL is a specialization of Flight that adds SQL-specific commands (CommandStatementQuery, prepared statements, metadata queries like CommandGetTables) on top of the same RPC framework.

ADBC is a client-side API standard: the database-driver API that application code talks to. ADBC's Flight SQL driver is one implementation that wraps a Flight SQL endpoint behind the ADBC API; other ADBC drivers (PostgreSQL, SQLite, Snowflake) reach databases that do not speak Flight SQL at all. ADBC is what your Python script imports. Flight SQL is what travels over the network when the ADBC Flight SQL driver is the one in use.

A useful test: if a vendor says they "support Arrow," ask which layer, because Arrow-format means columnar payloads, Arrow Flight means streaming gRPC transport, Flight SQL means a SQL command interface on that transport, and ADBC means a client driver API that may or may not sit on Flight SQL underneath. The three can be adopted independently, and they often are, so Elasticsearch shipped Arrow-format without Flight while ClickHouse shipped Flight and Flight SQL alongside an ADBC driver, and Snowflake's ADBC driver does not ride on Flight SQL at all but goes through Snowflake's existing protocol with Arrow batches embedded.

What Flight is

A gRPC framework for streaming Arrow batches.

The Apache Arrow project's own framing is the right starting point: Flight is "an RPC framework for high-performance data services based on Arrow data, and is built on top of gRPC and the IPC format," and that canonical sentence has three things that follow from it.

First, Flight is gRPC, so the HTTP/2 underneath buys bidirectional streaming, multiplexing, and a mature TLS story without inventing those from scratch, because the Flight team didn't build a new transport but instead built a new payload convention on top of one. That matters because gRPC is broadly understood by operators, and client libraries exist in every language gRPC supports.

Second, the payloads on the wire are not Protobuf-serialized rows but Arrow IPC batches, contiguous columnar buffers placed inside the gRPC message frame, so that Protocol Buffers define the RPC contracts and metadata envelope while the data itself skips Protobuf serialization, which is the structural reason Flight is fast in the benchmarks people quote.

Third, Flight is built around the idea that a single query result may live in multiple places. A client calls GetFlightInfo, receives a list of FlightEndpoints (each with its own URI and ticket), and fetches from those endpoints in parallel, potentially from different physical hosts. That parallel-streams model is what distinguishes Flight from a conventional single-connection driver and what makes it suitable for distributed query engines where result data is partitioned across executors.

The core method surface is small. DoGet downloads a stream of record batches identified by an opaque ticket. DoPut uploads record batches with server acknowledgment. DoExchange is bidirectional for stateful operations. GetFlightInfo and PollFlightInfo retrieve metadata (PollFlightInfo targets long-running queries). ListFlights is discovery. Authentication is pluggable: token handshake, header middleware, mTLS, or gRPC's standard mechanisms. Everything else is convention layered on top.

What zero-copy actually means

The in-process path is zero-copy. The wire path is not.

"Zero-copy" is one of the most-repeated and most-misunderstood claims in the Arrow ecosystem. It is true at one layer and false at another, and the distinction matters when you're designing architecture.

Within a single process, or between processes that share memory, passing Arrow buffers around is truly zero-copy. The receiving code holds a pointer to the same bytes the sender wrote, with no serialization, no allocation. That's what Arrow's in-memory format makes possible.

Across the network, Flight doesn't bend physics. Bytes leave memory, pass through the kernel network stack, traverse the wire, and land in the destination. There are copies (the kernel does some, the gRPC runtime does some). What Flight avoids is the redundant serialization-and-deserialization step ODBC and JDBC inject between the database's internal representation and the network frame. With Flight, the wire bytes are already in Arrow IPC format. The sending side doesn't encode into a different shape; the receiving side doesn't decode back.

The honest framing: "minimum-copy on the network path, true zero-copy in-process." When a vendor pitches Flight as "zero-copy over the network," that's loose language. The 2022 TU Delft paper "Benchmarking Apache Arrow Flight" (Tier C academic, not security-specific) reports up to roughly 6000 MB/s for DoGet and 4800 MB/s for DoPut over Mellanox ConnectX-3 interconnects, using up to about 95% of link bandwidth, which is directionally useful but isn't a promise about your SOC's network.

Flight removes the serialization tax JDBC and ODBC pay on every result set, though it doesn't remove network, kernel, or TLS costs. For high-fanout federated queries across multiple security data sources, serialization is often the largest single cost and Flight may help materially, while for point lookups against a single low-latency engine the savings are smaller because there isn't much serialization to remove.

Flight SQL specifically

A SQL command surface on top of Flight.

Flight by itself is a generic data-transfer protocol with opaque tickets and opaque streams. Flight SQL is the specialization that turns it into a database protocol. It adds a set of pre-defined commands that clients and servers agree on, so that a Flight SQL client can talk to any Flight SQL server without knowing vendor-specific details.

The command surface, drawn from the Apache Arrow Flight SQL documentation:

CommandStatementQuery: execute an ad-hoc SQL string and stream back Arrow batches.
ActionCreatePreparedStatementRequest and CommandPreparedStatementQuery: prepared statements with parameter binding.
CommandStatementUpdate and CommandStatementIngest: DML and bulk ingest operations.
CommandGetCatalogs, CommandGetDbSchemas, CommandGetTables: catalog and schema discovery returning Arrow-formatted metadata.
CommandGetPrimaryKeys, CommandGetImportedKeys, CommandGetExportedKeys: relationship metadata.
CommandGetSqlInfo: server capability discovery, including which SQL features are supported.

The query execution flow is two-step. A client calls GetFlightInfo with a CommandStatementQuery wrapping the SQL string. The server plans the query, decides how many endpoints will serve the result, and returns a FlightInfo with one or more FlightEndpoints (each containing a ticket). The client then calls DoGet against each ticket (potentially in parallel against different endpoints) to retrieve the result as a stream of Arrow batches. The metadata itself comes back as Arrow data, which means the same code path handles both query results and catalog browsing.

Flight SQL also includes a JDBC driver implementation contributed by Dremio (in Apache Arrow 10.0.0 and later) that lets legacy JDBC applications talk to a Flight SQL endpoint without code changes. That driver is useful as a migration bridge. Existing BI tools and JDBC-based security tooling can point at a Flight SQL server while you decide whether to migrate them to native ADBC clients later. The cost is that the JDBC abstraction undoes some of the columnar advantage on the client side (results get materialized into the JDBC row-cursor model before the application sees them), so the bridge earns its keep as a compatibility convenience while doing little for performance.

Adoption state

Who has shipped Flight or Flight SQL as of 2026.

The adoption pattern for Flight roughly tracks the adoption pattern for ADBC (analytical engines first, transactional systems and SIEM vendors absent), but with the additional twist that Flight server implementations are more work for a vendor to ship than an ADBC client driver, so the list is shorter and the maturity varies more.

Dremio: origin and reference implementation. Dremio engineers contributed the Flight specification work and ship Flight SQL as a first-class protocol for client applications. Dremio's own internal data movement uses Flight, and the JDBC-over-Flight-SQL driver in Apache Arrow was a Dremio software grant. If you treat any vendor as the production reference for Flight, it's Dremio.
ClickHouse: ClickHouse Inc. shipped an Arrow Flight interface, including Flight SQL support, with query execution, data insertion, prepared statements, session management, and metadata queries. The documentation does not explicitly designate the interface as experimental versus stable, which I read as "shipped but treat as preview" until I see production case studies. ClickHouse also has an official ADBC driver (separate effort), so the same database is reachable two ways.
DuckDB: DuckDB does not ship Flight server or Flight SQL natively. Community extensions fill the gap. The airport extension (from Query.Farm, in the DuckDB community repository for DuckDB 1.3.0+) turns DuckDB into a Flight client that can query remote Flight servers. quackflight and duckarrow are other community efforts. A forthcoming first-party DuckDB client-server protocol called Quack (announced May 2026) may consolidate this space, but Flight remains the community-extension path today.
Apache DataFusion: DataFusion ships a Flight client and Flight server example in the project repo, and the datafusion-flight-sql-server contrib project provides a usable Flight SQL endpoint on top of DataFusion as the query engine. The DataFusion Federation project pairs Flight SQL with cross-database query planning, pushing compute down to remote engines via Flight SQL, which is an architecture worth knowing about for federated security queries.
InfluxDB 3.0: uses Flight, DataFusion, Arrow, and Parquet as its core stack (the FDAP architecture), which is production validation that Flight can be the primary query protocol for a database product rather than only a side interface.
Snowflake: Snowflake's ADBC driver uses Arrow internally and reduces serialization overhead, but it does not ride on Flight SQL. It goes through Snowflake's existing Go connector and protocol with Arrow batches embedded. Worth knowing because "Snowflake supports Arrow" is true in two different senses, and only one of them is Flight.
BigQuery: Arrow is used internally by BigQuery's client libraries, but Arrow data and the Arrow iterator are not consistently exposed externally as of early 2026. There is no Flight SQL endpoint for BigQuery. The BigQuery Storage Read API delivers Arrow batches, which is close in spirit but not a Flight protocol implementation.

The notable absences are the same as for ADBC. Splunk, IBM QRadar, and OpenText / Micro Focus ArcSight do not ship Flight or Flight SQL endpoints. Elasticsearch's Arrow support in 8.16 is response-format only, not Flight, and Microsoft Sentinel and Google Chronicle do not expose Flight SQL either. For now, Flight lives in the analytical-engine and lakehouse-engine corner of the stack, and it hasn't reached the SIEM corner.

A practical implication: today, you are most likely to encounter Flight as the protocol between query engines and downstream consumers in a lakehouse architecture (Dremio to Tableau, DataFusion to a Python notebook, ClickHouse to a Flight SQL client) rather than between a SIEM and a lake, which leaves the SIEM-side gap as the obvious next gate.

Security use cases

Where Flight may earn its keep in security infrastructure.

Streaming Arrow batches across SOC infrastructure

A modern SOC's data plane is often a graph of services: a detection engine pulls normalized events, enriches them against threat intel, hands the enriched stream to a scoring service, writes scored output to a long-term lake. Each hop today is usually JSON-over-HTTP or row-oriented database queries, and each pays a serialization tax on the way in and out.

Flight as the inter-service protocol replaces those hops with Arrow record-batch streams. The detection engine streams via DoGet; the enrichment service consumes Arrow, operates on the columns with vectorized code, and emits via DoPut to the next stage. The Python or Rust enrichment code doesn't parse JSON; the detection engine doesn't re-encode into JDBC rows for a downstream Java service to immediately re-decode. I have not built this end-to-end in a production SOC and have not seen a published security-specific case study, which makes it Tier D, a structural argument that hasn't been validated in deployment.

Federated query result transport

Federation is the security use case where Flight's parallel-endpoints model has the cleanest fit. A federation query (Trino or Dremio asking three different backends a fragment each) produces result streams from multiple physical hosts. The conventional shape is that the federation engine gathers, materializes, and ships a unified result over JDBC or ODBC. That gather-and-reship step is where the federation engine often becomes the bottleneck.

Flight's FlightInfo-with-multiple-FlightEndpoints model gives the federation engine an alternative. Rather than gather, it can return a FlightInfo whose endpoints point directly at backend executors. The client fetches result fragments in parallel, with the federation engine handling planning but not data movement. This is how Dremio uses Flight internally and what the DataFusion Federation project is building toward. See Trino as a federation layer for the broader federation discussion. For security teams, this matters because telemetry naturally lives in multiple stores: EDR in one place, network in another, identity in a third, cloud audit in a fourth. The traditional answer was to ETL all of it into one SIEM, whereas the lakehouse answer is to query in place, and Flight is the transport that makes that affordable at scale.

Distributed hunt result aggregation

A pattern that surfaces in mature threat-hunting teams: a hunter runs a query against every regional data store, then aggregates local results into a global answer. With JDBC/ODBC, each regional query returns a row-oriented result set, and the hunter's tool aggregates by deserializing everything into a local structure.

With Flight, regional engines emit Arrow batches the hunter's tool consumes directly in a vectorized aggregation step. The work becomes "concatenate Arrow buffers, group by" rather than "deserialize, append rows, group by," and for a query like "show me every process execution across all regions in the last 24 hours matching these characteristics," that shift can drop a run from minutes to seconds. The security-specific benchmark to back this claim does not exist publicly, so treat it as a hypothesis to test against your own hunt workloads.

Flight vs ADBC choice

When to reach for Flight directly vs ADBC.

A reasonable question for an architect is whether application code should talk to ADBC or to Flight SQL directly. They share most of the same wire format (the ADBC Flight SQL driver is, by construction, a thin wrapper over Flight SQL), so the choice is mostly about ergonomics and portability.

Reach for ADBC when the application is a SQL client and you want driver portability. ADBC gives you a uniform connect/query/stream API regardless of whether the backing database speaks Flight SQL, Postgres protocol, or vendor-specific. Your code doesn't care, and you can swap engines without rewriting the data-access layer. That's the common case for analyst tooling and detection logic.

Reach for Flight SQL directly when the application needs DoPut (bulk push), DoExchange (bidirectional state), or custom actions that don't map cleanly to SQL. The ADBC Flight SQL driver explicitly does not implement bulk ingestion, per the Apache Arrow ADBC documentation, because Flight SQL itself doesn't define a dedicated bulk ingest action. Reach for plain Flight (without SQL) when the data being moved isn't relational. Arrow batches streamed between two enrichment services don't need a SQL command surface.

A useful heuristic: ADBC for "I have SQL," Flight SQL for "I have SQL and need parallel-endpoint control or non-SQL operations," plain Flight for "I have data flowing between services and SQL is the wrong model." All three can coexist in the same architecture, because ADBC and Flight SQL are complementary rather than competing, and the Arrow project ships both since they sit at different layers.

Operational shape

What runs Flight in production looks like.

Operational facts that change versus a traditional JDBC-based architecture:

It's gRPC, with all that implies. Flight servers listen on a configurable port (ClickHouse defaults to 9090; vendors vary). You need HTTP/2-aware load balancers, gRPC health-check support, and observability tooling that understands gRPC trailers and streaming. If your infrastructure has standardized on REST behind ALBs, this needs new operational muscle.

Authentication is your integration problem. Flight defines hooks (token handshake, header middleware, mTLS) but prescribes no scheme. Dremio uses username/password plus personal access tokens plus TLS; ClickHouse uses username/password plus TLS. Federated identity (SAML, OIDC) for Flight services is bridge-work you do yourself.

TLS is essentially required. Arrow record batches on the wire are plaintext columnar buffers, not encrypted by the protocol. For security data crossing untrusted networks, TLS termination at the Flight layer is the default, and the grpc+tls URI scheme is the canonical signal, so treat untunneled Flight as a development convenience rather than a production posture.

Parallel endpoints implicate your network topology. Fetching from multiple FlightEndpoints in parallel only delivers throughput when the network is actually parallel: separate hosts, NICs, load-balancer capacity. In a single-AZ cluster with one ingress, the parallelism is logical rather than physical.

Backpressure exists but is your responsibility. gRPC streaming gives you the primitives for flow control, but Flight imposes no specific policy. For high-fanout SOC workloads where a slow consumer could back up a detection pipeline, designing the backpressure scheme is architectural work. Reference Flight servers handle this with reasonable defaults; understand them before betting throughput SLOs on them.

Honest gaps

What's still missing in early 2026.

The gaps I'd surface for any architect evaluating Flight or Flight SQL for a security deployment:

Server maturity varies more than client maturity. Shipping an ADBC client driver is a smaller commitment than shipping a Flight SQL server. Dremio is mature; ClickHouse is shipped but I'd characterize it as preview-grade until production case studies surface; DuckDB relies on community extensions; Snowflake doesn't expose Flight SQL at all. Verify the maturity of the server you intend to depend on, not just "the vendor supports Arrow."
No SIEM vendor ships a Flight SQL endpoint, which is the same gap as ADBC. Until Splunk, QRadar, ArcSight, Sentinel, or Chronicle ship Flight SQL, Flight sits downstream of the SIEM in a lakehouse architecture rather than in front of it.
Bulk ingestion via Flight SQL is unstandardized. The Apache Arrow ADBC Flight SQL driver documentation notes that Flight SQL "does not have a dedicated API for bulk ingestion of Arrow data into a given table," and the ADBC driver therefore does not implement bulk ingest. Vendors fill this gap with non-standard extensions (CommandStatementIngest is in the Flight SQL spec but adoption is uneven). For a security pipeline that needs to bulk-load enrichment results into a database, you may end up writing vendor-specific code.
Security-specific benchmarks are not published. The TU Delft 2022 paper and the Dremio benchmarks measure throughput on synthetic or analytics workloads, not on OCSF-shaped security event data. The directional argument for Flight in security analytics is structural; the quantitative argument has not been made publicly. This is on the lab roadmap.
Operator tooling for Flight is thinner than for HTTP/REST. APM, API gateways, and WAFs that understand REST may not understand gRPC streaming well. Observability of long-running Flight streams is improving but not at parity with HTTP-based stacks.
The JDBC-over-Flight-SQL driver loses the columnar advantage on the client. It's a useful migration bridge, but JDBC materializes results into a row cursor before the application sees them, so you get the wire benefits without the application-side vectorization benefits, and ADBC is the client API that delivers both.

None of these gaps are reasons to dismiss Flight, but they are reasons to size expectations correctly and to treat the protocol as one architectural ingredient rather than a magic solvent.

Why this matters

Flight as the wire complement to Iceberg, OCSF, and Sigma.

I name four foundational open standards for the security lakehouse: Apache Arrow (in-memory representation), Apache Iceberg (table format), OCSF (schema), Sigma (detection portability). See Foundation: source health, flow health, data quality and Iceberg vs Delta Lake for security data for the broader framing.

Flight isn't a separate fifth pillar; it's part of the Arrow standard. But it deserves its own treatment because the Arrow project ships three distinguishable things (in-memory format, Flight wire protocol, Flight SQL), and architects need to pick which to depend on. The stack composes well when layers agree on representation: OCSF events in Parquet, as Iceberg tables, queried by DuckDB, ClickHouse, or Dremio, delivered to detection logic via ADBC or Flight SQL, transported between services as Arrow batches over Flight, with no step along that path re-encoding the data.

For a SOC weighing a lakehouse migration, the part worth understanding distinctly is the wire. ADBC is the client API; Flight is the transport. If the analytical engine you're choosing ships a Flight SQL server, that's a signal about both performance and interoperability. If it does not, you're back in vendor-specific drivers even if the engine speaks Arrow internally. The trajectory mirrors ADBC's: Dremio bet on it years ago, ClickHouse added it, DataFusion ships it as the canonical query interface in InfluxDB 3.0, while the SIEM vendors haven't moved, even though the argument for them to is the same as for ADBC, because the rest of the stack is columnar end-to-end and the wire is the last row-oriented holdout in many analytical paths.

Practical guidance

What to do in 2026.

Four moves that I'd actually recommend, ordered roughly by how cheap they are to execute:

Default to ADBC for application code, not Flight directly. ADBC's Flight SQL driver gives you Flight's wire performance without coupling application code to the Flight API. If you later swap to a non-Flight backend (Postgres, Snowflake), the application doesn't change. Talk to Flight directly only when you need DoPut, DoExchange, or non-SQL data movement.
Audit your engines for Flight SQL server support, not just "Arrow support." The three-layer question is whether the engine ships Arrow-format responses (cheap), Flight as a transport (moderate), or Flight SQL as a query interface (real commitment). When evaluating analytical engines for a security lakehouse, the third tier is the one that signals long-term interoperability.
Use Flight SQL as the protocol between federation engines and backend executors, not as the primary client protocol. Dremio's architecture and DataFusion Federation show the right shape: Flight SQL is the inter-engine wire that makes federation cheap. Client applications still want a SQL interface they recognize (ADBC, sometimes JDBC), and the Flight SQL benefit is mostly invisible to them, paid in latency and throughput improvements rather than API changes. See Dremio as a semantic layer for the federation discussion in context.
Push SIEM vendors for Flight SQL during contract renewals. Same argument as for ADBC. The asks are different (ADBC is a client driver, Flight SQL is a server endpoint), but both are gaps worth closing. Vendors don't ship them because customers don't ask for them, so add them to vendor scorecards, because the buying pressure exists and spending it is the part that doesn't happen by default.

Flight is one of the under-discussed foundational standards in the columnar stack. It doesn't get the headlines that Iceberg or Arrow do, and it sits at a layer most architects don't think about until a federation query becomes a bottleneck. But it's the layer where the JDBC-and-ODBC era is most obviously aging out, and it's the layer where the gap between "Arrow-aware" and "actually columnar-native" gets exposed in vendor pitches, so it's worth knowing distinctly and worth asking about by name.