Hadoop has become one of the key tools for those looking to utilize Big Data. It allows for the most complex data driven initiatives but also creates opportunities for companies to try it without a huge expense.
It is versatile and scalable, companies have even built multi-million dollar businesses from building on the initial product.
However,it’s versatility and the options for it have also created some confusion for companies, the main one being whether it would be better to deploy Hadoop in the cloud or on-site.
We have considered this through 4 key areas to work out which is better:
The cloud has often been the source of hacks and data breaches in recent years. The hacking of Apple’s iCloud earlier this year is testament to the fact that any company, regardless of status can be hacked and this is no different when it is data being sent to Hadoop in the cloud.
This is not to say that it is unsafe online, in fact the chances of being hacked are very remote. Security systems are constantly being updated and even though cyber criminals are upgrading and attacking with more technology every day, the truth is that security experts are generally stopping them. In fact the Apple hack happened because of weaknesses in passwords of individuals rather than any real breach of the systems from a company-wide security perspective.
The least secure time for any data in the cloud, is not in fact when it is being held in the cloud, but in the connection when uploading it. We have seen through the ways in which the NSA and GCHQ gained access to much of their data, that it is at this point, when it is outside of the source and the final destination, that it is at it most vulnerable.
On-site certainly provides a more robust security platform, especially if it is kept in a closed network. Restricting the people who have access to an internal database means that there is less chance of data being lost or cyber criminals gaining access. That being said, the burden of security then needs to be on your internal IT team, making it more complex for upgrading and more expensive for updated protection.
Winner: On-site, purely based on the potential to keep Hadoop on a closed network.
On-site systems have by far the biggest setup cost. They require considerable resources from the servers required to store the data to the new processing power needed to be able to fully run an effective query. It often requires additional time from the IT department for upkeep and potentially new hires for maintenance and security.
Upgrading on-site systems is also incredibly expensive. Often if the level of performance needs to be increased or the amount of storage space needs to grow, this can costs thousands of dollars. Even the space needed to house new servers and machines will cost more.
Cloud based solutions do not have this issue, as many of them simply require a monthly subscription fee. This increases or decreases based on the amount a company is using of the cloud based system’s resources. So scalability is easy, simply pay more and use more, it doesn’t require considerable work from any in-house team or a large initial investment to scale.
Winner: Cloud based, considerably cheaper on most metrics.
Cloud based Hadoop services allow companies access from anywhere in the world, a data analyst could be in a tent in the middle of field, but as long as they had an internet connection could at least check reports and view progress (even if uploading or downloading may be a bit optimistic). It allows for flexibility in the ways that people can work.
Often on-site systems will not have remote access because of the complexities and security concerns that this would create for any internal security systems. Often on-site systems are on-site to keep them in a closed network so the ability to access these whenever is most convenient would be counter-intuitive as it would potentially compromise the main benefit of an on-site system; security.
Winner: Cloud based, as it can be accessed from almost anywhere as long as there is an internet connection.
Overall Winner: Cloud based.
Hadoop in the cloud comes out as the winner based on it’s price, ease of access and simple scalability. On-site Hadoop implementations are not a bad option, but companies need to be aware that the costs of implementation and upkeep are considerably higher than cloud based. The benefits that it has rest mainly in security as it offers a considerably securer system, especially when being used in a closed network. Therefore, although Hadoop in the cloud is the winner, if a company is looking to use highly classified or personal data, on-site is the better option.