The hype around Data Lakes increased dramatically in 2016, with Gartner finding that inquiries related to the term rose 21% year-on-year. However, while interest in data lakes may have mushroomed, so too has skepticism around whether or not they actually work, and many believe that they are due a fall from grace in 2017.
Data lakes are enterprise-wide data management platforms that store disparate sources of data in its native format, until such time as you query it for analysis. So, rather than putting the data in a purpose-built data store, you move it into a Data Lake in its original format. Its popularity is down to a belief that by consolidating data, you get rid of the information silos created by having independently managed collections of data, thereby increasing information use and sharing. Other benefits cited by its supporters include lower costs through server and license reduction, cheap scalability, flexibility for use with future systems, and the ability to keep the data until you have a use for it.
While these benefits persuaded many to look to Data Lakes as a solution, companies who took the data lake plunge are often not seeing the kind of results they wanted. Indeed, they are often creating more problems than they solve. In 2014, Andrew White, vice president and distinguished analyst at Gartner said, ‘The need for increased agility and accessibility for data analysis is the primary driver for data lakes. Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realized.’ That was in 2014, and the same is true today. Indeed, Adam Wray, CEO & President of NOSQL database Basho, has gone so far as to call data lakes ‘evil’, explaining that, ‘They’re evil because they’re unruly, they’re incredibly costly and the extraction of value is infinitesimal compared to the value promised.’
The flaws are many and risks substantial. For a start, data lakes lack semantic consistency and governed metadata, increasing the degree of skill required of users looking to find the data they want for manipulation and analysis. According to Gartner research director Nick Heudecker, ‘The fundamental issue with the data lake is that it makes certain assumptions about the users of information. It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a priori knowledge' and that they understand the incomplete nature of datasets, regardless of structure.’ Simply put, companies still lack expertise among their business users to actually use data lakes. The marketing for many data lake products often seems to suggest that all users can dip into data lakes and pull out insights as if it was an arcade game, which simply isn’t true, and this is leading to a great deal of disillusionment.
Another question mark around Data Lakes surrounds the quality of the data. The entire point of a data lake is that it pulls in any data with no governance. With no restrictions on the cleanliness of the data, there is real potential that it will eventually turn into a data swamp. This lack of governance also leads to security issues. Companies may not even know where the data they’re collecting comes from, what kind of data it is, and the regulatory requirements around its privacy. Companies cannot just store all their data where and how they please, there are rules, and the security around data lakes is often lacking. The protection of data is vital to a company’s reputation, and it requires strict governance or companies are leaving themselves wide open to all types of privacy risks.
All of this is not to say that Data Lakes are going to disappear this year, only that companies entering into data lake investment should do so only after a good deal of consideration over whether it is really the best option and not think that it is somehow going to be the answer to your business intelligence dreams. This year is, however, likely to see more onus put on certified data sets created by IT. Certified data can be shared across departments, solving the problems of data lakes while retaining the benefits.