3 Things Every Company Needs To Do Before It Can Begin Integrating Machine Learning Into Their Operations

We spoke to John J. Thomas, Distinguished Engineer & Director at IBM, about what he thinks companies need to have in place before they even begin ML integration


Integration of ML is on the top of almost every company's to-do list nowadays. On average, we collectively produce around 2.5 quintillion bytes of data every day. That's an unconscionable amount of data and its possible applications are innumerable. However, without a system to effectively sift through it for new and exciting insights, it's all essentially useless to an enterprise.

Adoption of machine learning does not just 'happen', though. It is is a process, and if the foundations aren't laid, it will likely end in failure.

In the lead up to the Big Data & Analytics Innovation Summit this month in Hong Kong, I had a series of discussions with Distinguished Engineer & Director at IBM, John J. Thomas about the issues facing organizations in their data efforts. John has 25 years of experience in enterprise architecture, competitive strategy, Data Science and worldwide client facing engagements. His expertise spans a spectrum of technology areas including enterprise systems, Cloud, Analytics, Machine Learning, and AI.

As such, he is extremely familiar with the kind of challenges faced by those trying to integrate machine learning into their decision-making processes. This made him the ideal person to discuss this topic with, starting with the first hurdle every such company comes across: 

What are the most important things companies need to have in place before they begin integrating machine learning into day-to-day operations?


The most important thing to first realize is that there is no magic box called machine learning which will automatically give you all the answers you're looking for. We have 2 phrases we often use;

'You need data for data science'

'Before you can get to AI you need IA' (Information Architecture).

What do I mean by this? Well, you need to be very careful with the data that you're dealing with before you get to the actual building of machine learning models and weaving them into your decision-matrix processes. Remember, garbage in equals garbage out.

You need to have a structure around how you ingest data, how you understand data lineage and, how do you do governance around data. All of these things are very, very important. Answers to questions like 'who has touched and modified the data' and 'how has it been transformed along the way' are important to know before it ever gets to the data scientist. This is data lineage discussion.

Data management, lineage, and governance (data lineage is kind of a subset under the bigger umbrella of data governance) are important to understand before you can start doing real-world data science. Hence, I believe the most important thing to do if you are dealing with machine learning and data science in an enterprise setting is to think very seriously about the information architecture that will support it.

Secondly, within this domain, people now have the notion that just because they are great at some particular algorithm or are a great TensorFlow programmer or an awesome Spark ML programmer, that it automatically makes them a good data scientist. This is incorrect as, for data science to be successful, it needs to bring together multiple disciplines. Areas such as data engineering, machine learning, and visualization all need to come together to create an effective data science team.

I'm not saying that any one individual needs to have all these skills themselves. Sure, if they do, that's awesome, but teams need to have these skills come together. This relates to what I said earlier - if you have a data engineering skill set, you also need an information architecture expert to get you to AI. The data engineer is the person who works with the data pipelines, the one who cleanses the data, understands where it comes from, deals with governance, and so on. So you need that discipline combined with the discipline of building machine learning and deep learning models.

Next, how do you visualize the results of these models so that business users, customers, and end users can consume them in an efficient way? Having a model with great predictive power is not by itself useful, not until it's consumable in some business processes. It is important to understand that these are disciplines that all need to come together.

The final thing you need is what we at IBM call 'operationalizing machine learning'. What I mean by that is, you can't just leave models stuck inside your data scientist's workbench. It's not enough to build, train and test models; you need to be able to go beyond that and deploy the model so that they can be consumed easily. This means you can monitor the performance of the model not only at the point in time of deployment, but its performance over time.

What often ends up happening once the models are exposed to the real world is, those which initially performed really well, start degrading over time. We see this all the time, sometimes it is because the pattern of the data you trained with are no longer the patterns things are being done with today. Whatever the reason, model performance often degrades. Hence, the question becomes, how do you make sure that your model remains stable over time? Scheduled evaluations, the ability to retrain the model and keeping the model current over time are all very important. We call this operationalizing machine learning and it is a very important aspect of integrating ML into operations.

So there are the 3 things you need to get right first:

1. Understanding the information architecture in place

2. Bringing in multiple disciplines together

3. Being able to deploy, monitor, schedule evaluations and re-train are all needed to operationalize ML.

Keep your eye out for the rest of our interview with IBM's John Thomas. However, if you want to hear more in person from John and other data experts, attend our Big Data & Analytics Innovation Summit in Hong Kong April 18-19.


Read next:

Why We Need Data Visualization To Understand Unstructured Data