Machine Learning Is Making Unstructured Data Accessible

New technology is allowing companies to mine previously un-mineable data


In a 2013 report by IBM, the amount of data created everyday was estimated to be roughly 2,500,000TB. It very likely greatly exceeds this now, as wearables, AI, and connected devices have increasingly embedded themselves into society, gathering a veritable tidal wave of additional information for organisations to interrogate.

This data comes in three forms: unstructured, semi-structured, and structured. Since the dawn of IT, structured data has been the main resource of analysts. Even today, this is the case. In a 2015 IDG Enterprise study on big data and analytics, 83% of IT professionals said structured data initiatives were a high priority at their organizations, while just 43% said unstructured data initiatives were a top priority. Yet, it is estimated that 90% of all data is either semi-structured or unstructured. For organizations, this is a tremendous number of potential insights to be leaving off the table.

Structured data is anything that fits in a relational database that exists within a certain set of values or contained a specific set of characteristics. Semi-structured data has no data model but some kind of structure, i.e. emails, zipped files, HR records and XML data. Unstructured data, meanwhile, is everything that does not fit into relational databases. This includes videos, powerpoint presentations, company records, social media, RSS, documents, and text.

Both structured and unstructured data are necessary to use analytics to its potential, to build a full picture of a company’s health and to pinpoint areas for growth. Essentially, structured data analytics describes and explains what’s happening, while unstructured data analytics explains why it’s happening. Knowing what’s happening may enable you to form an idea of what’s going on and take action, but without understanding why you are running too high a risk that it’s wrong.

There are several reasons that companies have hitherto largely not analzyed their unstructured data in any meaningful way, central among which is simply the absence of necessary tools to do it. Advances in machine-learning have, however, meant that many now are, allowing organisations to analyze their mountains of unstructured content in ways they could not before.

Machine learning is valuable for the analysis of structured data, but indispensable when it comes to its unstructured counterpart because of the differences in scale. A human being simply cannot compute that amount of data.

We spoke to Dave Copps, CEO of Brainspace, makers of unstructured data analytics and eDiscovery software that uses machine learning. Dave noted that, ‘Before, all we really did with unstructured data was search, get a load of documents together and hack at it with keywords. Technologies like Tableau and Quickview were always good for looking at structured data, but those that tried to use unstructured data were really just taking it out and putting it into structured data platforms. Once you pull words out of a document, you destroy their context. So, say you’re analyzing resume´s. If you take the Java out of a software developers CV, you don’t know if that’s only in there because the person has said ‘I suck at Java.’ What we’re doing is, rather than just analysing words, we’re looking at the whitespace between the words - the context.’

There are a number of areas where machine learning-driven unstructured data analytics software can be applied - eDiscovery, internal discovery, and defence intelligence, among the major ones. Copps uses the example of the recent VW scandal, noting that they could have saved billions of dollars in fines if they had been able to analyze the communications earlier to identify the culprits. Marketing would be another area where there is big potential, with machine learning helping to make available the mass of public opinion from social interactions, not just whether they’ve mentioned a company, but how they’ve mentioned it, providing a far more rounded view of the customer.

Take, for example, Donald Trump. MogIA is an AI company that which analyzes data from Google, Twitter, and Facebook. They have predicted that Donald Trump will win on election day simply because he has had more public engagement — a number gathered by looking at Facebook Live and Twitter. This is, however, just the number of results. It says nothing of the context, which is often likely to be negative given the reaction to many of his irrational statements.

Essentially, however, it is a search problem. How search works is fundamentally flawed when trying to analzye unstructured data through a tool designed for structured data. When you have half a million documents indexed for search, users are just left to blindly throw words in there, guessing at the contents. New tools that use machine learning and better data visualizations to show the results mean that before searching, you can see what’s in there and search accordingly - often finding things you didn’t know would be in there.

Vision small

Read next:

Big Data Forecasting In Pharma