What Are Data Lakes?

What Do They Mean For Analytics?


Data Oceans, Data Lakes, Data Ponds, Data Puddles, Data Reservoirs, Data Fjords, Data Vernal Pools. The world of Big Data is drowning in water metaphors.

Data Lakes are very much the in thing at the moment. They have been heralded by many as the future for Big Data, but also treated with scepticism by some. For analytics, there appears to be vast potential. But how much of this is likely to be realized?

Data Lakes are enterprise-wide data management platforms that store disparate sources of data in its native format, until such time as you query it for analysis. So, as opposed to putting the data in a purpose-built data store, you move it into a Data Lake in its original format.

The Data Lake concept aims to solve a number of problems that arise with traditional storage. Firstly, there is the one of data silos. Data silos are operationally inefficient and limit the ability to cross-correlate data to drive better insights. The architecture of a Data Lake means that silos are minimized, as it combines all sources that were previous independently managed into a single unmanaged infrastructure. This, theoretically, should help increase information use and sharing. It should also lead to lower costs and remove license restraints.

Data Lakes also enable processing to take place with little resistance. Applications are no longer islands, instead existing within the Data Cloud. This allows them to exploit the high bandwidth access to data and scalable computing resource. Data itself is also not held back by initial schema decisions, which means that organizations can exploit it more freely, and time-to-value for analytics is reduced from months or weeks, down to minutes.

There are, however, a number of problems inherent in Data Lakes, as Gartner has been quick to point out. Firstly, there is the risk that it will turn into a Data Graveyard. If there is no information governance, it risks ending up a collection of disconnected data pools or information silos held in one place. There is, Gartner says, an inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. Nick Heudecker, research director at Gartner, notes that: ‘The fundamental issue with the data lake is that it makes certain assumptions about the users of information. It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a priori knowledge' and that they understand the incomplete nature of datasets, regardless of structure.’

Much of what Mr Heudecker says is true, but surely having a number of well-informed and experienced people that can access and contribute to analyzing the data is better than having just a few, or one. With strong information governance, Data Lakes can be a huge boost to an organisation’s analytics in both speed, cost, and accuracy, leveraging insights that can drive a huge competitive advantage over those stuck with siloed data in prohibitively expensive, proprietary technologies.

University lecture small

Read next:

How Are Higher Education Institutions Using Analytics?