Why Cleanliness Is Next To Godliness In Data

How important is clean data in an analysis


When we look at how data is viewed at the moment, the main aspect that people look at is the amount.

This could be the multitude of sources from which data is drawn, the amount of insight gained or simply the number of gigabytes of data they can store.

Many would therefore believe that these numbers are what make a successful analytics programme, but this is not the case.

In order to get the most from data, cleanliness above all else is the most important aspect to concentrate on.

This is because it is impossible to get accuracy in your analysis if it is based on flawed data. This could be anything from it being outdated, to duplicates skewing results. Essentially, if your data is not clean, then everything beyond that point is essentially a waste of time.

Although data cleaning (like almost any other cleaning) is often a slow and laborious job, especially if you are using large datasets, it is vital and the implications that it could have should not be taken lightly.

So what kind of implications can there be from using unreliable data in analysis?

Perhaps the best example is the economic crash of the late 2000’s.

This occurred because the data that was being used was not clean, it often came from unreliable sources who input data with a bias, meaning that they wanted the data to show something that it should not have shown. This was from the way that mortgage data was input to how systems were set up to accept only parts of data systems that showed maximum profits and minimum risk.

This is the case with many deals in some of the major banks, with the overall amount of the money lent being shown on balance sheets without the credit ratings of many of those who took the loans being taken into consideration. This meant that the data being shown to those who were assessing the health and performance of these institutions was not clean, meaning that the analysis showed them what they wanted to see.

Although this was an extreme example, it shows the impact that producing analyses on unreliable data can have, not only creating inaccurate results, but actually destroying companies and economies in the process.

The struggle that many companies have with cleaning their data is simply that they are required to go back through everything they have and validate it, something which is both expensive and time consuming. This is often off-putting and with the partial success that many companies are having with data, they are unwilling to change.

Startups or companies who are venturing into data for the first time almost have an advantage in this, in that they can build a programme from the ground up in order to create systems that use clean data or can highlight unclean data when it is added to the new systems.

In order to gain full success from data it is necessary to use clean data. We have seen that companies who break this rule not only fail to get the full return on their data programmes, but that they potentially run the risk of more major impacts. The way to visualize data in this context is to see it as gasoline in a car, it will run perfectly on good clean petrol, if it has dirt in it, not only will it not run as well, but could break the engine altogether. 


Read next:

Maximizing the Organization: Turning Data Into a Competitive Advantage