Apache Spark is the solution. What is the problem?

Using the right tool for the right job


Having worked as a data analytics business guy over the past few years, I have seen the growth of Apache Spark gather momentum, along with its components that address a broad set of analytical use-cases. Spark also supports a bunch of programming languages and has a strong ecosystem of developers and ISVs working on tools on it, especially for visual analytics and data visualization. It is open source and is available as part of all the major Hadoop distributions out-of-the-box.

If I am an enterprise trying to figure out how to upgrade to the new generation of analytics tools to work on large data volumes, this begs the question: Is Spark the one stop shop for all my analytics needs? My prediction: not yet and likely not for a long while, if ever.

Most enterprises consider Spark as the new kid on the block of analytics tools. If you look inside Spark, you will see that different components within Spark are at different stages of maturity. Lets consider where each of these components are at.

Spark SQL executes SQL scripts on data at rest but is quite far from full SQL compliance and offloading enterprise data warehouse workloads. Existing SQL-on-Hadoop products, including Apache Hive, are a better fit for implementing analytics use cases for batch-mode and interactive analytics over SQL for data stored in Hadoop.

MLLib has a core set of machine learning algorithms, but is certainly not as complete as R or other machine learning libraries such as MADLib. SparkR is in the process of combining the power of R’s interactive interface with Spark’s distributed processing, but you will still need a robust machine learning library callable from analytics applications to implement advanced analytics use cases.

Spark Streaming is likely the most mature of all the components over Spark. It processes terabytes of streams in “micro-batches”, delivers the ability to combine non-real time data and machine learning capabilities with stream processing, integrates out-of-the-box with widely used stream ingestion frameworks and provides a high-level API to use all these capabilities. It has also survived the  Netflix chaos monkey test and has seen good adoption in production and prototyping environments.

GraphX enables users to build, transform and gain insights from data structured as graphs. While GraphX has stabilized its API, it is still in the beta phase. That said, it is emerging as the major open source graph analytics package and will be better integrated with the rest of the Spark components into the future.

Another key aspect of maturity w.r.t enterprise adoption is the support for configuration and resource management of Spark stack. While Spark support on Yarn has been added recently for resource management, full configuration support using native Hadoop tools is still in its infancy.

Given the current state of Spark and its components, lets look at where Spark is likely to end up in the analytics tool chest. The figure below shows my predictions on what tools (or categories of tools) will address the various analytics use cases as the ecosystem matures.

Enterprise Data Warehouses and SQL on Hadoop engines (including Apache Hive) are here to stay to deliver SQL compliant interactive and batch-mode analytics on structured data. NoSQL systems will play a vital role in processing object data into the future. Spark Streaming and GraphX will have prominent usage and MLLib will likely be the defacto machine learning library for majority of the use cases.

What remains to be seen is how well the “Other components” listed in the figure will integrate with Spark in general and Spark streaming in particular. That would obviously depend on individual vendor strategies.

At the end of the day, it all boils down to one thing: using the right tools for the right job, translated: the right analytics use case. So while the ecosystem is all excited about Spark, it behooves the enterprise to step back and build a tool chest with more than just Spark, that’s most effective in solving the business problems today and into the future as Hadoop based tools mature.


Read next:

Working At The Boundaries Of Aesthetics And Inference