Real-time Analysis of Network Flow Data with Apache Spark on Databricks

Network monitoring at scale is a big data problem that requires a big data platform¹. This post describes how to perform real-time analysis while storing network flows metadata in a data lake on AWS S3 using Spark on Databricks.

Photo by NASA on Unsplash


Network flows composed of only a few attributes are a thing of the past. Modern network flow collectors, such as the highly performing and inexpensive nProbe from Italian company NTOP², can inspect packets as an inline tool and extract Layer-4 to Layer-7 metadata as network flows, that is where it counts, and at speed of up to 100 Gbit using commodity hardware.

This posts describe an architecture and the software used to perform real-time analysis of network flows from a number of nProbe collectors, while storing data on a data lake in S3 in such a way to allow for fast retrieval.

System Architecture

Network Flows Format

For this setup, in addition to selecting JSON as format and exporting the flows to the Kafka server, 50 fields are indicated in the nprobe.conf configuration file, including a custom field with a constant to identify the collector:

Ingestion Pipeline

Reading from Kafka

$ ./bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1  --topic netflows$ ./bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1  --topic netflows-events

The first is used by the collectors to publish the NetFlows. The second topic is used in the pipeline to publish real-time suspicious detections for downstream processing.

In the Databricks notebook, it is convenient to mount an existing, empty, S3 bucket to the Databricks File System (DBFS) allowing access to objects in object storage as if they were on the local file system:

Note: for proper S3 authentication, both the AWS key and secret need to be set as Spark environmental variables in the Databricks cluster, or added to the bucket’s path. Also, the newer s3a version of the s3n native connector, that uses Amazon’s libraries, is the preferred Spark S3 connector.

Kafka is a native data source in Spark and therefore it is fairly straightforward to read a stream from a Kafka topic into a Spark dataframe:

The schema of the jsonDF is of two attributes: an arrival timestamp added by Kafka and the value, which is a string containing the JSON NetFlow.

Data Enrichment

With the defined schema, it is then possible to parse the JSON held in the value column using the from_json() function to subsequently access the data using column names like json.IN_PKTS and json.IPV4_SRC_ADDR. There is an opportunity to flatten the schema as well as illustrated in the code below.


  • Use a projection to keep only 46 attributes to keep for storing and analysis
  • In the process, assign columns simpler names

Also note:

  • The L7_PROTO column contains a decimal number of the form XXX.YYY indicating the master protocol XXX and application protocol YYY. This is split into two distinct columns using the integer and decimal parts.
  • Addition of the Duration column as obtained by subtracting FLOW_START_MILLISECONDS from FLOW_END_MILLISECONDS.
  • The FLOW_START_MILLISECONDS and FLOW_END columns contain Epoch time, which are converted to aTimestamp.
  • Removal of the Kafka arrival timestamp, as FLOW_START_MILLISECONDS and FLOW_END_MILLISECONDS timestamps are more accurate.

Save NetFlows to S3


  • Using partitionBy() as above results in a Hive-style partitioning, which is needed to speed up queries

A short list of S3 objects resulting from the workflow:

NetFlows Ingestion Rate

The chart below was obtained by running the code above (without the watermark) as a query on four hours of NetFlows held on S3. It shows how the cluster used for the ingestion can sustain 15k of flows per second:

Real-time Analysis

Save Rollups to S3

The following is an example of creating 1 minute host statistics and saving them to S3 in a JSON file:

The following is an example of some statistics in the JSON files produced:


The examples above are representative of how to provide real-time detection, network traffic analysis, and data aggregates using a big data approach with streaming data. Given the plethora of attributes in the NetFlow format provided by modern collectors it is easy to imagine the power of this approach in giving network visibility to the security teams, and in perspective as a foundation for an AIops system as well.

Databricks is a mature platform that removes the difficulties associated with integrating libraries and deploying a Spark application in a cluster. It also provides a convenient JDBC connector out of the box. Databricks also have a relatively expensive but very good intensive online training program.

In a second follow-on article, a real-time dashboard for Network Flows data will be presented.

[1] A. D’Alconzo, I. Drago, A. Morichetta, M. Mellia and P. Casas, “A survey on big data for network traffic monitoring and analysis”, IEEE Trans. Netw. Service Manag., vol. 16, no. 3, pp. 800–813, Sep. 2019.



Engineer, apprentice renaissance man. I am the founder of technology start-ups in Palo Alto.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store