Believe it or not, the Managing Editor of this website studied Archaeology at university. Despite having no relevance to his eventual career of writing about Big Data, it taught him one thing; you need to sift through dirt to find the gold.
It is true for everything involving data too, but equally it's something that many companies are yet to fully realize. It is why, when a company comes out and says that it has several Petabytes of data, the question should not be ‘wow you must get so much insight from that’ but instead ‘why do you have that much?’.
A large database is often not indicative of a successful data program, but instead a program that requires considerable work. This is because with a database this size, it is often going to be the case that the data gathered is just garbage collected from anywhere and everywhere. It creates inaccuracy within the data and makes accurate analysis of it almost impossible.
It is the definition of ‘garbage in, garbage out’.
Experian Data Quality claim that inaccurate data affects the bottom line of 88 percent of organizations and impacts up to 12 percent of revenues. These kind of numbers are huge and shows that the mis-collection of data is practically endemic across the business landscape today.
Having the ability to filter, categorize and analyze data comes exclusively from knowing what the data is and what it represents. This is not something that can easily by done by technology and many data science teams find that the majority of their work is taken up with cleaning things up before they can be used.
As DJ Patil notes, ‘you have to start with a very basic idea: Data is super messy, and data cleanup will always be literally 80 percent of the work. In other words, data is the problem’.
Data is like anything else, if you have too much of it and you just allow it to pile up in unorganized and messy piles, it will just be garbage, useful for nothing except the bin.
It is something that many companies do not consider when they begin a data gathering system, but knowing what is being collected and, possibly more importantly, why it is being collected, should be at the core of everything done within a data program. It is important to have this as a solid foundation, then build out, rather than trying to bring in everything at once then organizing it afterwards, as doing this just creates garbage.