Four Essential Steps In Dealing With Unstructured Data Sets

Embracing unstructured data sets


The Analytics world has spent the last two decades consolidating data sets. Analytics have been synonymous with data consolidation, data integration, or as some called it ETL (Extract Transform Load). With the advent of Big Data, some aspects of these platforms have evolved and changed.

Not everything is necessarily in a relational database anymore. HDFS, or No SQL data stores have become part of the vernacular. However, there is still a push to try to fit unstructured data into this round hole. 'Put it in a table, create structures around it, tag it'. The key question I ask is: 'How natural is it to put unstructured data into a structure?'

Human conversation and language is enriched with data, whether we are discussing locations, schedules, likes or dislikes, we are constantly revealing information. We spit out as much data as any other device! So does it make sense to try to structure all of this natural language data exchange into a table? Language belongs in a document. It is meant to be written in a word document, or maybe a PDF, not in an excel file. The data store for language is an environment that supports creation of this language, i.e. text editors, word processors, audio files etc. Having said that, we should have the means to get key data sets out of these data stores. Here are some ideas on how this can be done and how we may be able to embrace unstructured data sets.

1. You need powerful tools that find hidden data sets in all the standard documents we have. Documents can be PDF’s, word documents, contracts, or even HTML files. Locating data is the first step

2. You need platforms that can then reveal the possible 'extractable' data, i.e. help me explore the possibilities within my data sets, my client documents - show me all the options available in these documents, help me find what is common and repeated across documents.

3. Now you need to pull those valuable data sets out and help treat them as data assets that can be used for other needs. I want to compare those, I want to aggregate them, or I just want to list them.

4. Help me use these data sets to plan activities for tomorrow. Load them into some of planning tools and platforms. I want to bring them into the fold of my plan for tomorrow, or help me run analytics on them.

So am I just talking about ETL for unstructured data sets, or does it get beyond that? Well, it does get beyond just ETL. ETL effectively turned into higher level programming language, where depending upon the ETL platform a certain niche skill was required: Folks were trained in Informatica, IBM Data Stage etc. The challenge here is that though ETL was good at quickly giving you a view of your 'source' data sets - that access is not so easy with unstructured data and so the platform has to be much more evolved. Understanding the parts of a document that can be turned into data is not straightforward. Words can mean multiple things. In some context they may be data and in other contexts they may not. Hence ETL for unstructured data is more complicated than it seems, way more than ETL for structured data. It is here that we use the gamut of modern techniques such as Machine Learning and Artificial Intelligence to truly farm out language into what is data and what is not.


Read next:

Why Blockchain Hype Must End