Several of my colleagues from Red Hat and I attended Spark Summit East 2015. Apache Spark is a powerful distributed compute engine. Spark enables users to write data pipelines using Scala, Python, Java, and R with the Resilient Distributed Data set (RDD) abstraction. Spark keeps data in-memory, which provides better performance for iterative algorithms like K-means clustering or PageRank.
However, the real power of Spark is the multi-modal programming model. Users can utilize both functional and SQL programming paradigms for processing data, interweaving the two approaches as desired. There is no overhead from needing to dump, reformat, and load data as would be necessary for other systems combining distributed systems and relational databases.
Additionally, Spark has support for REPLs, which makes it easier to “play” with Spark and data. However, the command-line REPLs do not support in-process data visualization or saving states. Users from the Python world (like myself) would be disappointed when comparing the REPLs to iPython Notebook. As data scientists, we spend a lot of time trying to understand the structure of data, discover underlying patterns, and evaluate approaches for extracting meaning.
Many of the talks at this year’s Spark Summit emphasized visualization and interactive development. Multiple start ups, including Databricks, sponsors of Spark, are building products built around Spark but targeted at data scientists and end users. It’s clear the gap between the Spark compute engine and user-needs is going to be an important problem to solve going forward.
But, are the only solutions proprietary? The answer is no, but it’s treated as an open secret. Apache Zeppelin, Spark Notebook, and Jove Jypyter are three promising open-source notebooks for use with Apache Spark with Scala. Cloudera and others have blogged about integrating the iPython Notebook with Spark.
Realizing the need to empower internal users to build and experiment independently, my team will be evaluating these options going forward. As we learn more, I’ll post updates here.