Data Analytics Is Not For Datasets Full Of 'White Noise'

Cleaning data is just a first step in the process of analytics


Data analytics has gained immense popularity in recent years. All businesses, whether consumer product goods, retail and e-commerce, healthcare, or manufacturing, can be better with adequate and appropriate data collection. The data is then plugged into a computer for analysis to gain insights that can reveal many useful insights, and companies are investing heavily.

The reality on the ground, however, is that is that it is not easy. The data used to derive analytics is often incomplete and therefore misleading. Also, data, particularly customer data, changes frequently - people move homes and change their jobs or phone numbers, gain or lose weight, develop new medical conditions and get rid of old diseases, and so forth. As a result, it is important that organizations regularly revise and update their databases. Data cleansing helps in preventing costly mistakes, but is often overlooked.

Dirty data, what is it? Never ever heard of it, then why suddenly?

One thing is for sure; there are no clean data sets. Datasets are always messy and full of irrelevant facts and 'white noise'. Even the most organized and tidy datasets are bound to display irregularities one way or the other. So the question here is 'how much of your data is garbage and how big is the margin of error in the analytics you draw from such a database.' Some of the leading subsets of dirty data are:

  • Data that violates business rules
  • Data without a generalized formatting
  • Duplicate data
  • Inaccurate data
  • Incorrect data
  • Incorrectly punctuated or spelled data
  • Misleading data
  • Non-integrated data

Inaccurate and incomplete data is the main woe for market research and marketing projects. In eCommerce, each store has their own way of putting information in form of records – abbreviations, keywords, detailed descriptions etc. Customers making online purchases are required to fill in their details and knowingly or unknowingly they miss out on populating particular fields as they might find them to be irrelevant. It becomes impossible to slice and dice data collected in this manner unless there are processes of revisions and standardization is in place. Before the sales, marketing, or customer care departments can use it, such data would require thorough cleansing.

How does data become 'dirty'?

Existing data sets, social media, websites, smartphones and many more, serve as data sources. Many of these platforms ask users for additional information and collect relatively reliable, self-reported or self-populated data. In such cases, users tend to make mistakes and at times neglect to enter all the required details accurately.

Existing datasets are bound to decay. We discussed some of the leading reasons for it in the beginning. Also, in order to produce additional insights, researchers with help of decision analysts merge databanks of data collected from various sources. Improper data merging in such cases leads to duplicate data. Typos also work as one of the prominent source of incorrect and misleading data.

Data becomes incorrect when field values are created outside the valid range of values. The value in a month field should range from 1 to 12, but if 13 is plugged in, the value is void. Assumptions can be made about what customers meant by plugging in 13, but there is no certain way of knowing.

Intentional mistakes cannot be neglected. Users simply aren't convinced of the benefits of sharing valid information, and often misspell their names, invent email addresses, and mess up the mobile or landline numbers and area codes in order to remain anonymous and avoid unsolicited calls and emails.

Dirty data, no need to feel vulnerable

Complicated algorithms or data cleansing experts should be deployed to analyze the data for duplicates and other mistakes. Data modeling and visualization helps a lot here, by improving the procedures and workflows of data collection, and designing forms and records with more standardized fields.

As the hygiene factor becomes more embedded in the business operations, organizations get better at identifying error/warning conditions, respond rapidly, and eliminate dirty data even before it gets added to the database.

Making the origins and history of data visible and transparent also helps. This way companies are able to trace back the epi center of every single error. If put in the correct context, such errors can work as valuable links for additional information. Simultaneously, identifying an unreliable source of information also helps the elimination of such elements from further data collection processes.

The methodology of data collection remains stays of paramount importance. It impacts the completeness, validity, and consistency of your overall database. Every business should develop and customize their data collection methodologies, which suit their customized data needs.

Data analytics is usually about looking out for irregularities, data spikes, and things that are out of the norm. This part of the job usually succeeds only if collection and cleansing of data is paid due heed to. Machine learning has made software increasingly competent at detecting and differentiating right from wrong. The more detailed and accurate are the interpretation needs, the more human intervention is required; which we all would and should agree to. At the end of day, it is we as humans, and not the computers, who draw conclusions from big data. Computers can collect, store, crunch, and sort millions of pages, but they fail miserably when it comes to providing complete answers to data problems. It certainly needs evaluation, analysis, and interpretation; but do you have the data analysts on board to do it?


Read next:

Why Blockchain Hype Must End