As big data has blown up over the past few years, so too has Hadoop. The position of the Hadoop Distributed File System (HDFS) as the storage platform of choice for big data is not in question, and it is extremely unlikely that it will be any time soon. However, while Hadoop MapReduce - the data processing engine powering Hadoop - has many fine qualities, it also has a number of limitations that make it vulnerable as the market leader, and it seems just a matter of time before Apache Spark overtakes it.
Apache Spark is an open source computing framework created at the AMP Lab in Berkley in 2009. Spark was originally designed to provide a clustering framework that could deal with interactive queries and iterative computation. It guarantees up to 100 times faster performance for several applications, making it ideal for machine learning.
Its take-off has been exceptional, in a sector where exceptional growth is the norm. According to a recent survey of 2,100 developers by Typesafe, awareness of Spark is still growing, with 71% of respondents saying they some experience with the framework. It has now reached more than 500 organizations of all sizes, which are committing thousands of developers and extensive resources to the project. It has even received backing from industry giant IBM, who have integrated Spark into its own products and open sourced its own machine learning technology to increase Spark's capabilities. IBM has also announced that it will commit more than 3,500 researchers and developers to Spark related projects, and on October 26 said that it is using it to enhance several of its existing software products, including the SPSS predictive analytics software which it bought for $1.2 billion in 2009.
Its rise is primarily down to its focus on plugging the holes in MapReduce, namely lack of speed and absence of in-memory queuing. Real time tests have proved Spark to sort 100 TB of data in just 23 minutes, compared to 72 minutes for Hadoop to accomplish the same results using a number of Amazon Elastic Cloud machines. Spark managed to do this using less than one tenth of the machines - 206 compared to 2100 for Hadoop. The feat saw it win the 2014 Gray Sort Benchmark (Daytona 100TB category), with a team from Databricks including Spark committers, Reynold Xin, Xiangrui Meng, and Matei Zaharia, tying for first with a Themis team from UCSD, and jointly set a new world record in sorting.
Being able to stream data on the fly and combine it with other sources in real time is perfect for business, bringing the speed of analysis much closer to the speed of thought. This enables quicker and more accurate decision making. Perhaps most importantly, it also means it is best placed to solve many of the issues that the IoT will throw up when it takes off in earnest.
One of Spark’s major selling points, particularly to businesses, is its accessibility to anyone with an understanding of databases and some scripting skills, while using interactive web interfaces and languages such as Python that data scientists and statisticians tend to use. This makes it easier to recruit people who can understand their data as well as find the tools to process it. Unlike MapReduce, Spark is also able to run and exist without Hadoop, working well with resource managers like YARN or Mesos.
All of this is not to say that Spark will make MapReduce obsolete. MapReduce is still an incredible tool for serving many functions. However, for the requirements of big data in the future, Spark seems more appropriate to solving many of the problems that will likely be thrown up.