This post is the continuation of:

Real-time Analysis of Network Flow Data with Apache Spark on Databricks

The first post described how to store the network flows to a data lake in S3 and perform real-time analysis on the stream. This post is about querying the data lake to build an analytics dashboard using Apache Spark on Databricks. Refer to the first post with respect to the details of the systems architecture, the data lake and the cloud infrastructure used for running these workflows.

Access the S3 bucket in Databricks

In the Databricks notebook, the easiest way to access a data lake on S3 is to…

Network monitoring at scale is a big data problem that requires a big data platform¹. This post describes how to perform real-time analysis while storing network flows metadata in a data lake on AWS S3 using Spark on Databricks.

Photo by NASA on Unsplash


Network visibility and analysis relies on data from the network. These days it is not uncommon for enterprise networks to carry very large amount of data in the network traffic. While packet capture has been the tool of choice for a long time, it is obvious that it is too costly to use as the preferred way of performing data analysis…

Marco Graziano

Engineer, apprentice renaissance man. I am the founder of technology start-ups in Palo Alto.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store