There was a time when Hadoop was synonymous with Big Data, but that time, it appears, is nearing an end. Sri Ambati, the CEO and co-founder of H2O.ai, famously said earlier this year that 'Hadoop is dead,' and his is not a lone voice, with Gartner claiming that many organizations are re-examining its role. In Gartner's hype cycle earlier this year, the advisory giant even declared Hadoop distributions 'obsolete' due to complexity, and stated that 'the questionable usefulness of the entire Hadoop stack is causing many organizations to reconsider its role in their information infrastructure.'
The reasons for this are complex and the way forward shrouded by would-be giants who want to take its place. Firstly, a potted history. A large corporation 50/60 years ago would buy an IBM 360 Mainframe. This had a redundant power supply, it was properly engineered, had high quality components, and software that played nicely with it. Over time, as you had to deal with more and more transactions, you'd get the next model up, and then the next model, and so forth. Eventually, we reached a point where the largest organizations were - and they still are - operating a number of large data centers, much of which are only there in case something goes wrong. They are good because they are built to last and they rarely fail. They are, however, extremely expensive.
The alternative was developed by Google, who needed a tremendous amount of computing power but in the early days lacked the funds to get it. Legend has it, they would drive the streets of Silicon Valley looking for systems that had been thrown out onto the streets. There were obvious issues with this, so they came up with two technologies. The Google File System, which was based on a white paper released in 2002, and Map Reduce. The Google File System became the Hadoop Distributed File System, and Map Reduce is still Map Reduce. These two technologies are the core of Hadoop.
At the recent Big Data & Analytics Innovation Summit in Sydney, Mike Seddon, Senior Data Engineer at leading integrated energy company AGL, described his issues with Hadoop. According to Seddon, there are three myths around Hadoop that have been perpetuated over the years and contributed significantly to its growth. Firstly, that there is an avalanche of unstructured data being thrown at every organization. There is not. Most data in your organization came from a relational database, argues Seddon. If you want to run video through your Hadoop cluster, the first thing you've got to do, and the first thing you are paying machine learning guys to do, is to restructure your data. Essentially, you might have thrown away the structure in some cases, but it is still structured data, and there is no point dumping data on there and hoping that you're going to be able to sort it out later because it likely won't happen.
The second myth Seddon notes is that data types don't matter. Hadoop distributors have somehow managed to convince people they can be left, but they are something you have to get right up front. If not, you're going to get queries that don't behave as you'd expect them to, for example if your customer IDs are stored as a string and an integer, they will not behave properly. Analysts shouldn't have to know this, they should just have data that works. It is also really expensive computationally to calculate types again, back from CSV or whatever you've got.
Lastly, Seddon takes issue with the whole premise of Hadoop - that you should move your compute to your data. As it turns out, this doesn't really matter. Data locality is important, which is why RAM is important, but once you get to a certain sized cluster, the probability of having compute available next to the data drops to a very low probability. As long as you can ship your data to where you need it quickly enough, it doesn't really matter where it's stored.
The main problem we have, and that Hadoop has, is that Google built this stuff in 1997. The constraints they had are no longer the same as those faced by organizations today. There has, however, now been a shift in hardware in order to match the changing needs. Backblaze, for example, produces hard disc racks. These racks have 60 discs in them with 720 TB worth of data in there. If you put enough of them together then you start to get these massive clusters of beautifully redundant data. They publish all the stats about hard discs. These machines have 0 processing power, they're just there for storage.
Then you have machine learning boxes from Facebook that are completely useless at storing data but tremendous at processing tensors and vectors. You put a lot of them together, and you essentially have a super computer.
These two technologies combined have enabled us to choose how we compute things. Hadoop has storage and compute on top of each other, which means you have to grow them together, and that doesn't work, Seddon argues, because I might want to go and do machine learning, and I might not want to do that on the same computer I store data on. Furthermore, as you get more and more stuff running on your Hadoop clusters, you start to get huge contention and it sort of snowballs on itself and you reach a state where nothing appears to work. Then what happens is that people restart the jobs and make the problem even worse.
Apache Spark is also proving to be a tremendous boon. Spark is great because it is essentially a distributed sequel engine that is also open source and capable of doing machine learning. As you add more and more compute, you get nice speed ups thanks to more mini cores. You can choose what sort of job you want to run to meet your business requirements, so you can decide whether a task requires you to throw 200 cores of CPU at it, or can just be run on a cheap box somewhere, wherever that may be.
Another technology helping organizations move past Hadoop is Parquet, a columnar file format that allows your to store column based data disc. This means you get really good compression, you only read the data you care about. This really helps us move to cloud environments, and you only have to ship such a small amount of data.
Finally, we have containerization technology. This enables us to package all the applications up and easily deploy them. The advantage of this is that when it stops working, we can just take it down and it doesn't matter as you have everything, you can just spin off another one. While they're running, they have value, when they don't, you can just rid of them.
So we are now at a point where we have cloud infrastructure and we've got a lot of compute that's easily scaled. But do we do with it and how do we deal with large datasets?
This is still an issue in companies operating legacy systems. Over the last half century, we have seen businesses grow rapidly and their computational needs grow with them. The technology, however, often failed to keep up, leading businesses to partition their business into different parts which would then ship their information together. Now, we have so much compute that this is no longer a problem. At AGL, Seddon says they take raw data, process it in the cloud where it can be scaled dynamically, and everything is recalculated at the end of each day. This may sound convoluted, but it enables them to be more agile. They are also using Sequel DW, which is the data warehouse sequel server, on top, which allows people to have a Sequel interface. Everything they do in Spark is in Sequel. This is because Sequel is the definition of intent, so rather than telling the computer how to do it, you say you would like this thing to happen in this way. The reason this is important is because a few years ago when Hadoop and Hive were both big, you could pick up your sequel queries on Hive, run them on Spark, and get a big speed up, and the next Spark is going to be even better.
Hadoop is not dead yet, but it is definitely no longer the center of attention, and organizations are realizing that there are other options out there. Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering, says that, 'I can’t find a happy Hadoop customer. It’s sort of as simple as that. It’s very clear to me, technologically, that it’s not the technology base the world will be built on going forward.' Muglia may be overstating the fact, but it is clear it is now the turn of other technologies to drive the world of data forward.