From Tap to Lake: Scalable, Resilient, AI-Ready NetFlow Observability

Enabling Scalable Network Visibility and High-Fidelity Telemetry for AI-Driven Threat Detection

Marco Graziano
13 min readApr 6, 2025
Illustration generated with OpenAI DALL·E via ChatGPT

Abstract

As network traffic grows in volume and complexity, security teams need more than legacy NetFlow — they need structured, enriched, and AI-ready telemetry. This post introduces a cloud-native architecture built for scalable, high-fidelity NetFlow collection and analytics, using tools like nProbe, Kafka, Amazon S3, and Bauplan.

By combining deep packet inspection, streaming ingestion, lakehouse analytics, and an MCP Server for AI agents, the system delivers fast, flexible, and cost-efficient visibility across enterprise networks. It’s built to handle today’s scale — and tomorrow’s threats.

1. Introduction

Modern attacks don’t just hit the perimeter — they move laterally, hide in encrypted traffic, and exfiltrate quietly. To detect them, security teams need more than logs or packet captures: they need high-fidelity network telemetry, enriched with application-layer insights.

While commercial NetFlow platforms offer visibility, they often come with steep costs and vendor lock-in. Open-source tools exist too, but many lack the scalability, flexibility, or cloud-native design needed for today’s networks — let alone tomorrow’s AI-assisted operations.

This post introduces a new approach: a modular, cloud-native pipeline for DPI-enriched NetFlow collection and analysis — built with nProbe, Kafka, S3, and Bauplan. It’s designed for long-term retention, low-latency analytics, and future-facing use cases like AI-driven triage and autonomous threat detection.

2. Architecture at a Glance

At the heart of this system is a modular, cloud-native pipeline purpose-built for high-fidelity network telemetry — from NetFlow capture at the edge to structured, queryable data in the cloud.

🟪 nProbe at the Edge: Enriched NetFlow with DPI Context
The pipeline begins with nProbe, a DPI-capable flow exporter developed by ntop. Deployed passively at a network TAP or SPAN port, nProbe inspects traffic in real time and emits NetFlow v9/IPFIX records enriched with application-layer metadata.

Thanks to its built-in nDPI engine, nProbe can classify L7 protocols (including encrypted traffic like QUIC or TLS), generate risk scores based on 56 threat categories, and extract identifiers like JA3/JA4 hashes, SNI, DNS queries, and more. This turns standard flow logs into high-fidelity security sensors — without the overhead of full packet capture.

🟨 Kafka Backbone: Streaming, Schema-Enforced Ingestion
From the edge, flows are sent to a Kafka-based ingestion layer built on Confluent Cloud. Kafka serves as the backbone for high-throughput, real-time telemetry transport. Each flow message is serialized using a JSON Schema definition and validated by Confluent’s Schema Registry, ensuring consistency and forward-compatibility across the pipeline.

🟦 S3 as the Data Lake: Compressed, Long-Term Storage
Kafka consumers persist all flows to Amazon S3 in Parquet format, compressed using Snappy for space efficiency. This forms a durable, low-cost data lake that supports:

  • Multi-year retention for compliance and audit purposes
  • Schema-on-read access for forensics

By decoupling storage from analytics, this model enables cost control without compromising availability or depth of data.

🟧 Bauplan for Analytics: Iceberg Lakehouse Tables
To enable fast, flexible querying over the telemetry, the system uses Bauplan, a data lakehouse platform. Bauplan transforms raw Parquet data from S3 into structured Apache Iceberg tables — organizing flow records into partitioned, query-optimized datasets.
This Lakehouse model powers:

  • Near real-time alerting pipeline
  • Ad hoc investigations over months of telemetry
  • Efficient AI model input for anomaly detection and correlation

Bauplan pipelines transform raw netflows in domain-specific tables like alerts, anomalies_by_host, dns_exfil_candidates, or tls_sni_usage, making it easy to plug into any detection or visualization layer.

🧠 Downstream Consumers: From Dashboards to AI Agents
The structured outputs from Iceberg tables are consumed by:

  • Generative AI agents trained to triage alerts, explain anomalies, and automate initial investigation steps
  • Dashboards and search interfaces for security analysts
  • Custom alerts and detectors triggered from flow-based indicators (e.g., suspicious JA3 + exfil volume)

This layered architecture ensures that network telemetry is not just stored — but used, explained, and acted on, in near real-time.

Schematic Architecture Diagram

3. Sensor Strategy: Enriched NetFlow with nProbe at the Edge

The effectiveness of any network telemetry pipeline starts at the edge — at the point where packets become structured, actionable data. In this architecture, that point is nProbe, a high-performance flow exporter deployed at key visibility points across the network.

🔍 Passive Deployment at SPAN or TAP
nProbe is typically deployed on a lightweight Linux VM or bare-metal appliance attached to a SPAN port (port mirroring) or network TAP. This passive setup ensures full visibility into network traffic without introducing inline latency or risk. It can observe:

  • Internal east-west traffic within data centers or cloud VPCs
  • Ingress/egress flows at network boundaries
  • Specific segments like VPN concentrators or DNS resolvers

This flexibility allows for surgical deployment of sensors in high-risk or high-value zones.

🧠 DPI Enrichment: Structured Risk-Aware Telemetry
Unlike traditional NetFlow exports, nProbe emits a rich set of IPFIX fields that go far beyond basic IPs and ports. Leveraging its internal nDPI engine, nProbe parses each flow in real time and attaches application-layer protocol identifiers, risk indicators, and metadata for encrypted and unencrypted traffic. The result is telemetry that functions as a security sensor feed.

Here are some of the key enriched fields:

L7_PROTO
The detected application-layer protocol (e.g., HTTP, QUIC, Telegram)
L7_CONFIDENCE
Confidence score (0–100) indicating certainty of L7 protocol detection
L7_PROTO_RISK
Bitfield representing known risks (e.g., SQL injection, XSS, DNS tunneling)
L7_RISK_SCORE
Integer risk score summarizing the severity of the observed behavior
L7_RISK_INFO
Decoded list of risk flags with human-readable labels
DNS_QUERY, TLS_SERVER_NAME, HTTP_URL, HTTP_USER_AGENT
Application-level metadata extracted from flows
JA4C_HASH
TCP fingerprint hash for encrypted client identification (JA4C)

nProbe supports dozens of additional fields to capture TCP behavior (e.g., retransmissions, out-of-order packets), network directionality, MAC addresses, and even user agent strings in HTTP flows. The entire schema is structured, extensible, and ideal for downstream ingestion by schema-aware systems like Kafka or Iceberg-based tables.

Together, these fields provide the foundation for risk-aware network visibility — enabling threat detection, anomaly scoring, and AI-assisted triage without needing raw packets.

🆚 Why nProbe Beats Router-Based NetFlow
While most enterprise switches, routers, and firewalls support exporting NetFlow or sFlow, these built-in exporters come with key limitations:

  • Shallow visibility: Typically limited to L3/4 headers, and no DPI
  • Sampling: Flows are often rate-limited or sampled for performance reasons
  • Lack of customization: Little control over templates or enrichment
  • Hardware lock-in: You’re dependent on what the vendor supports

4. Streaming Pipeline: Kafka as the Real-Time Core

Ingesting high-volume network telemetry — especially enriched flow data with DPI metadata — requires more than just a message bus. It needs a reliable, scalable, and schema-aware transport layer that can handle bursty traffic, fan out to multiple consumers, and enforce data contracts. For this reason, Apache Kafka sits at the center of the system’s real-time streaming pipeline.

⚙️ Why Kafka?
Kafka is built for distributed, high-throughput stream processing, making it ideal for ingesting and distributing telemetry at scale. It enables:

  • Buffering of traffic bursts during high-load periods without data loss
  • Decoupling of producers (nProbe exports) from consumers (in our case S3 connector for storage)
  • Backpressure management, allowing slow consumers to catch up without dropping flows
  • Horizontal scalability, ensuring the pipeline can grow with network volume

This makes Kafka a natural backbone for network observability in cloud-native systems.

📜 Schema Validation with Confluent Schema Registry
To ensure consistency across flow messages and downstream processors, each NetFlow message is serialized as JSON and validated against a schema defined in Confluent Schema Registry. This brings important benefits:
For example, a JSON schema might define L7_RISK_SCORE as an integer between 0–100, or ensure that JA4C_HASH is always an 8-character string. Any record that violates this contract is rejected at ingest, preventing bad data from contaminating the pipeline.

Together, Kafka and Confluent Schema Registry form the reliable, flexible backbone of this pipeline — handling real-time ingestion while keeping data structured, clean, and future-proof.

5. Storage Model: S3 and Parquet for Retention & Compliance

Once flow data is ingested, the next challenge is storage — at scale, cost-effectively, and in a format that supports both long-term compliance and rapid analytics. Rather than relying on traditional data warehouses or specialized logging platforms, this architecture takes a cloud-native, decoupled approach, storing telemetry as compressed files in Amazon S3.

💾 Why Not a Traditional Data Warehouse?
Data warehouses like BigQuery, Snowflake, or Redshift offer convenience — but they come at a cost:

  • Price scales with ingestion and query volume, not retention
  • Limited schema flexibility makes it harder to adapt as NetFlow formats evolve
  • Less transparency into performance tuning, indexing, and query paths
  • Lock-in to specific tools, user interfaces, and billing models

In contrast, storing data in open formats on S3 means we control the lifecycle, layout, and access models — independent of any single vendor or toolchain.

🧱 S3 as the Data Lake: Durable, Cost-Efficient, Compliant
Amazon S3 is ideal for multi-year retention of telemetry, offering:

  • Virtually unlimited scale
  • Durable (11 9s) storage across availability zones
  • Built-in lifecycle policies for automatic tiering (e.g., to Glacier for archive compliance)
  • Low storage costs compared to database or warehouse retention

This allows us to store all flows, even those not currently queried, without worrying about runaway costs.

📦 Parquet + Snappy: Structured, Compressed, and Fast
Flow data is written to S3 in Apache Parquet format, a columnar file format optimized for analytical workloads. Combined with Snappy compression, this approach delivers:

  • 80–90% storage savings over JSON or CSV
  • Excellent read performance for specific fields (e.g., just L7_RISK_SCORE, DST_IP)
  • Compatibility with Spark, DuckDB, Trino, Athena, and Bauplan

Each Parquet file is schema-aligned to the Kafka topic schema, enabling strong data guarantees and easy evolution over time.

🧭 Why Not Use Athena Directly?
Amazon Athena provides serverless SQL over S3, but it’s not the best fit for real-time security analytics. Issues include:

  • Query latency (cold start penalties, especially for small ad hoc queries)
  • Partitioning limitations, requiring manual tuning to avoid scanning huge files
  • Limited join performance and caching for complex correlations

Instead, this architecture uses Bauplan and Iceberg (covered next) to layer a true Lakehouse on top of S3 — delivering low-latency querying, time-travel, and schema evolution.

6. Analytics Layer: Lakehouse with Bauplan + Iceberg

Raw flow data is valuable, but without structure and context, it’s just noise. To extract real insight — at speed and scale — this architecture leverages a modern Lakehouse model, transforming semi-structured telemetry into clean, queryable datasets. At the core of this layer is Bauplan, a data lakehouse platform designed for scalable ELT and real-time analytics.

🛠️ Declarative Pipelines with Bauplan
Bauplan allows us to define transformation logic as simple declarative pipelines, rather than writing raw SQL or managing distributed jobs manually. Each pipeline:

  • Ingests Parquet data from S3
  • Applies models, mappings, joins, or enrichment logic
  • Writes to Apache Iceberg tables in-place

This makes it easy to maintain complex logic, version transformations, and onboard new data products with minimal friction — ideal for fast-moving security use cases.

🧊 Iceberg Tables: Modern Lakehouse Foundations
Apache Iceberg is the backbone format for the Lakehouse layer. Compared to traditional Hive-style tables or even warehouse-native tables, Iceberg offers:

  • Partitioning without the pain of rigid directory layouts
  • Time-travel queries, allowing analysts or models to compare states over time
  • Schema evolution with backward compatibility (e.g., adding new risk fields)
  • Support for streaming inserts and low-latency reads

This means the analytics stack can grow and evolve as the telemetry and detection logic evolves — without needing to rebuild or reload entire tables.

📊 From Raw Flow to Anomalies
For example, a simple Bauplan pipeline might:

  • Filter out internal-only traffic
  • Group flows by SRC_IP, DST_PORT, and L7_PROTO
  • Aggregate L7_RISK_SCOREover a 5-minute window
  • Flag entries that exceed a risk threshold
  • Write the results into a table called anomalies_by_host

These tables power downstream dashboards, alerts, and AI agents without any custom ETL scripts or external schedulers.

⚡ Near Real-Time Detection & Low-Latency Queries
Because the flow data is ingested continuously and Iceberg supports streaming compaction and fast query execution, the system can surface anomalies within seconds to minutes after the traffic is seen. This enables:

  • Fast triage of high-risk flows
  • Live dashboards over recent telemetry
  • Query patterns like “show all new JA3 hashes to suspicious domains in the last hour”

All while keeping historical data available for deeper forensic investigation or training ML models.

🧠 Bridging to AI: The MCP Server
To make this data actionable by generative AI agents, we wrote a custom MCP (Model Context Protocol) server that exposes Bauplan’s metadata and query interfaces over an API designed for LLM consumption. This service allows agents to:

  • Discover available Iceberg tables and fields
  • Run structured queries with guardrails
  • Retrieve anomalies or flow summaries as inputs to reasoning chains

This bridges the gap between structured telemetry and autonomous investigation — laying the groundwork for AI-assisted detection, correlation, and narrative generation.

🔍 AI Integration: Querying the Lakehouse with the MCP Server
To make structured telemetry accessible to large language models (LLMs) and autonomous agents, we introduced a lightweight Model Context Protocol (MCP) Server. This service acts as a translator between natural language intent and secure, schema-aware data access, built specifically for AI consumption.
The MCP server provides tools:

  • For discovering schema, columns, and data types from Bauplan-managed Iceberg tables
  • To query the datalake that validates, executes, and guards SQL generated by LLMs

Response formatting in clean JSON, structured for downstream parsing, visualization, or reasoning.

🔐 Why the MCP Layer Matters
This integration turns raw flow data into a structured reasoning substrate for AI, enabling use cases like:

  • Flow anomaly triage
  • Autonomous alert prioritization
  • Summarization of high-risk hosts
  • Narrative generation for analyst reports

7. Benefits of the Architecture

This system wasn’t designed as a proof-of-concept — it was built to scale, evolve, and serve both analysts and AI in real-world environments. By combining high-fidelity flow data, modern cloud infrastructure, and an AI-ready data interface, it achieves a level of visibility and flexibility that few traditional platforms can match.

Here’s what this architecture delivers:
✅ Scalability by Design

  • Handles millions of enriched flows per hour
  • Horizontal scaling via Kafka and S3-backed storage
  • No bottlenecks tied to a monolithic backend

✅ Deep Visibility without Full Packet Capture

  • DPI metadata from nProbe provides application-layer context without payload retention
  • Fields like L7_RISK_SCORE, TLS_SERVER_NAME, and JA4C_HASH support advanced detection with minimal overhead

✅ Cost-Efficient Long-Term Retention

  • Parquet + Snappy + S3 makes it possible to store months to years of telemetry
  • No per-query or per-record cost explosion typical of SaaS or warehouse billing models
  • Lifecycle policies and cold storage options ensure compliance without waste

✅ Flexible Analytics with Lakehouse Tools

  • Apache Iceberg enables schema evolution, time-travel queries, and real-time inserts
  • Bauplan simplifies pipeline logic and manages data transformations declaratively
  • No vendor lock-in — everything runs on open formats and open standards

LLM-Ready Data Access

  • The custom MCP server bridges telemetry to AI agents and tools
  • Structured, secure access to threat data supports:
  • Autonomous investigation
  • Narrative generation
  • Alert triage and summarization

✅ Vendor-Neutral, Modular, and Future-Proof

  • Each component — nProbe, Kafka, S3, Bauplan — can be swapped or extended
  • Avoids lock-in while staying aligned with modern data engineering and cloud-native best practices
  • Designed to evolve as new AI, storage, or detection techniques emerge

This architecture isn’t just a backend. It’s a platform for ongoing security intelligence, built for humans, automation, and the AI agents that increasingly augment both.

8. Lessons Learned & Future Work

Building this pipeline was as much about architecture as it was about operational maturity. While the components — nProbe, Kafka, S3, Bauplan — each served their purpose well, putting them together in a production-grade, AI-augmented workflow required hands-on iteration, performance tuning, and a lot of learning along the way.

🛠️ Operationalizing nProbe at Scale
Deploying nProbe across multiple SPAN and TAP points revealed some real-world friction points:

  • Tuning nProbe for high-throughput environments (multi-Gbps) required careful CPU pinning and PF_RING acceleration
  • A relay to stream data to Confluent Kafka using SASL/PLAIN was developed as nprobe is lacking the native ability to use authentication over TLS for encryption to implement secure authentication.

That said, nProbe’s ability to generate consistent, enriched flow records across diverse environments made it one of the most valuable and flexible components in the stack.

💰 Balancing Cost in the Cloud
With great observability comes great data volume. We quickly had to make smart choices around:

  • S3 tiering policies, moving old Parquet files into Glacier Deep Archive after 90 days
  • Parquet file compaction, to reduce small-file overhead and optimize Athena/Trino-style access
  • Kafka retention tuning, ensuring consumers could keep up without incurring unnecessary storage lag

These trade-offs were worth it: we achieved multi-month telemetry retention for less than the cost of a single commercial threat analytics license.

🤖 The Future is AI-Driven
One of the most exciting (and still evolving) parts of the system is the integration with AI agents. With structured flow data, rich metadata, and the MCP server as a control surface, we’re laying the foundation for true LLM-powered detection and triage.

This is the focus of our upcoming companion story:
📘 Flow to Insight: AI-Driven Threat Detection with Generative Agents
Where we show how LLMs can interpret anomalies, suggest remediations, and generate investigation narratives — autonomously.

9. Conclusion

By building a NetFlow pipeline from tap to lake, we unlocked a scalable, resilient, and insight-rich foundation for security observability — designed not only for today’s traffic, but also for tomorrow’s threats.

This architecture brings together high-fidelity flow data from nProbe, real-time ingestion via Kafka, cost-efficient long-term storage in S3, and fast, structured analytics powered by Bauplan and Iceberg. But more than just a telemetry pipeline, it forms the data substrate for AI — enabling agents and LLMs to reason about risk, correlate signals, and surface actionable insights automatically.

With its modular design, open standards, and AI-ready interface, the system is built to evolve. Whether it’s onboarding new sensors, integrating additional signal types like VPC flows or EDR traces, or scaling up autonomous threat investigation with generative agents — this pipeline is ready for what’s next.

Security is no longer just about collecting data. It’s about making sense of it — fast, flexibly, and intelligently. This architecture is our answer to that challenge.

--

--

Marco Graziano
Marco Graziano

Written by Marco Graziano

Engineer, apprentice renaissance man. I am the founder of technology start-ups in Palo Alto and of Graziano Labs Corp.

No responses yet