Analytics Dashboard of Network Flow Data with Apache Spark on Databricks

Marco Graziano
5 min readMay 12, 2021

This post is the continuation of:

Real-time Analysis of Network Flow Data with Apache Spark on Databricks

The first post described how to store the network flows to a data lake in S3 and perform real-time analysis on the stream. This post is about querying the data lake to build an analytics dashboard using Apache Spark on Databricks. Refer to the first post with respect to the details of the systems architecture, the data lake and the cloud infrastructure used for running these workflows.

Access the S3 bucket in Databricks

In the Databricks notebook, the easiest way to access a data lake on S3 is to mount the bucket to the Databricks File System (DBFS) allowing access to objects in object storage as if they were on the local file system:

Note: for proper S3 authentication, both the AWS key and secret need to be set as Spark environmental variables in the Databricks cluster, or added to the bucket’s path.

The dbutils.fs.ls returns the following listing:

Dataset Description

The whole dataset used for the network flows is loaded into a Spark dataframe and the schema printed:

root 
| — PROBEID: string (nullable = true)
| — IPV4_SRC_ADDR: string (nullable = true)
| — IPV4_DST_ADDR: string (nullable = true)
| — L4_SRC_PORT: long (nullable = true)
| — L4_DST_PORT: long (nullable = true)
| — IN_SRC_MAC: string (nullable = true)
| — OUT_DST_MAC: string (nullable = true)
| — IPV6_SRC_ADDR: string (nullable = true)
| — IPV6_DST_ADDR: string (nullable = true)
| — PROTOCOL: long (nullable = true)
| — IP_PROTOCOL_VERSION: long (nullable = true)
| — L7_APP: string (nullable = true)
| — L7_PROTO: string (nullable = true)
| — TLS_VERSION: long (nullable = true)
| — IN_PKTS: long (nullable = true)
| — IN_BYTES: long (nullable = true)
| — OUT_PKTS: long (nullable = true)
| — OUT_BYTES: long (nullable = true)
| — Duration: double (nullable = true)
| — startTime: timestamp (nullable = true)
| — endTime: timestamp (nullable = true)
| — SRC_VLAN: long (nullable = true)
| — CLIENT_TCP_FLAGS: long (nullable = true)
| — SERVER_TCP_FLAGS: long (nullable = true)
| — L7_PROTO_RISK: long (nullable = true)
| — CLIENT_NW_LATENCY_MS: double (nullable = true)
| — SERVER_NW_LATENCY_MS: double (nullable = true)
| — APPL_LATENCY_MS: double (nullable = true)
| — TCP_WIN_MAX_IN: long (nullable = true)
| — TCP_WIN_MAX_OUT: long (nullable = true)
| — OOORDER_IN_PKTS: long (nullable = true)
| — OOORDER_OUT_PKTS: long (nullable = true)
| — RETRANSMITTED_IN_PKTS: long (nullable = true)
| — RETRANSMITTED_OUT_PKTS: long (nullable = true)
| — SRC_FRAGMENTS: long (nullable = true)
| — DST_FRAGMENTS: long (nullable = true)
| — DNS_QUERY: string (nullable = true)
| — DNS_QUERY_TYPE: string (nullable = true)
| — DNS_RET_CODE: string (nullable = true)
| — HTTP_URL: string (nullable = true)
| — HTTP_SITE: string (nullable = true)
| — HTTP_METHOD: string (nullable = true)
| — HTTP_RET_CODE: string (nullable = true)
| — TLS_SERVER_NAME: string (nullable = true)
| — SRC_TOS: long (nullable = true)
| — DST_TOS: long (nullable = true)
| — year: integer (nullable = true)
| — month: integer (nullable = true)
| — day: integer (nullable = true)
| — hour: integer (nullable = true)

Statistics on the about four hours of network flows captured in the bucket are obtained with the following code:

This step took a little more than 30 seconds on the four workers Spark cluster described in the first post. This action is spanning the whole 102 million flows dataset, and Spark is caching all of them. Therefore, any subsequent query will take advantage of it and execute faster.

Select a Time Range

A time range can be selected using an interactive Databricks widget directly in the notebook, and a new dataframe created filtering for the specified start and end times:

Network Traffic Statistics

In the following sections, some of the most relevant network traffic statistics derived from the flows are extracted and visualized in the dashboard.

All Traffic (bytes/sec)

All traffic (packets/sec)

Top 10 Destination IP Addresses

Top 10 Source IP Addresses

Protocols Statistics

Read a table with the protocol codes:

+---+-----------+-------+----------+--------------------+ 
| id|Id Protocol|Layer_4| Breed| Category|
+---+-----------+-------+----------+--------------------+
| 0| Unknown| TCP| Unrated| Unspecified|
| 1|FTP_CONTROL| TCP| Unsafe|Download-FileTran...|
| 2| 2 POP3| TCP| Unsafe| Email|
| 3| 3 SMTP| TCP|Acceptable| Email|
| 4| 4 IMAP| TCP| Unsafe| Email|
+---+-----------+-------+----------+--------------------+
only showing top 5 rows

Top Layer 7 Protocols

Top Services

Unique Source IP Addresses (15 min intervals)

Unique Destination Addresses (15 min intervals)

Creating a Dashboard

The dashboard is a feature of the Databricks notebook that allows content to be added using the display outputs in a separate view to obtain a presentation like in the following picture:

Summary

In this article a real-time dashboard for Network Flows data has been introduced as continuation of the first article on Real-time Analysis of Network Flow Data with Apache Spark on Databricks

While creation of the dashboard is on static data, using the widget feature of Databricks, is possible to change the time frame under investigation and update the dashboard in the view.

--

--

Marco Graziano

Engineer, apprentice renaissance man. I am the founder of technology start-ups in Palo Alto and advise FIFA on blockchain.