In a recent interview with us, Vijay D'Souza, Director of the Center for Enhanced Analytics at the US Government Accountability Office, noted that, ‘Regardless of the goals, it’s important to understand the quality of the data you have. The quality determines how much you can rely on the data to make good decisions.’ As it stands, bad data is ruining companies' data initiatives. This is one of the primary reasons why just 25% of businesses are successfully using their data to optimize revenue, despite the tremendous resources being pumped into them. IBM estimates that bad data is costing organizations some $3.1 billion a year in the US alone, while in Experian’s Data Quality survey, 83% of companies said their revenue is affected by inaccurate and incomplete customer or prospect data.
The issue of bad data is going to become an even bigger problem as machine learning adoption increases. Machine learning algorithms depend on data to train them. Indeed, the majority of machine learning practitioners actually consider the data to be more important than the algorithms themselves. In Crowdflower’s 2017 Data Scientist report, when asked to identify the biggest bottleneck in successfully completing AI projects, over half the respondents cited getting good quality training data or improving the training dataset. Director of Data Science & Analytics at Turo, Jérôme Selles argues that, ‘Depending on the quality of the data that is being used, automating the learning loop can be a challenge and, today, requires manual supervision. A good illustration of that is what happened with the Microsoft chatbot Tay that became racist within 24 hours. For Machine Learning to achieve its own potential, the learning process needs to be kept under control and values need to be respected. Data quality for the models is as important as education values in our society and we need more automated and systematic ways to make this happen.’
However, while data scientists may realize the importance of quality data, the same is not necessarily true of business leaders. They are more focused on adopting the technology as soon as possible to maintain competitive edge and often understand little about the practicalities. This means people are using whatever is in their datasets and hoping for the best, under the misapprehension that quantity can compensate for quality. Mistakes in the training data infect a system like urine in a swimming pool, polluting all the results and rendering any insights untrustworthy. It leads to wrong solutions, bad decision making, and potentially unethical AI. These issues could ultimately destroy the technology as organizations grow disillusioned when they don't get the return for their investment they expected and abandon ship.
Bad data is a problem that the majority of companies will, unfortunately, struggle to deal with. All datasets are flawed, some are just less flawed than others. Bad data is the result of a number of factors, the first of which is basic data decay. At any given time, as much as 70% of data sets are outdated. People's lives are constantly changing - they move house, they switch job, they change phone number, and so forth. As a result, email addresses change at a rate of about 23% a year, 20% of all postal addresses do the same, and roughly 18% of all telephone numbers. If you’re using this data, your machine learning algorithms will be coming to conclusions that may have worked yesterday, but will not represent the realities today and tomorrow.
Another cause of bad data is bias at the source. Data is liable to be guided by false assumptions and drawn from sources like ill-considered market research. And people often don't realize this until long after they have drawn their conclusions and enacted strategies based upon them. For example, when the municipal authority in charge of Boston, Massachusetts released a mobile app called Street Bump in 2011, they did so with the admirable goal of finding a more efficient way to discover roads that needed repair by crowdsourcing data. The app, sensibly, used the smartphone’s accelerometer to detect jolts as cars went over potholes and GPS to correlate it to where the jolt was felt. Then the system began to report a disproportionate number of potholes in wealthier neighborhoods - something common sense would dictate to be highly unlikely. They realized that the app was far more likely to be downloaded by younger, more affluent citizens with better digital knowledge because they were the ones most willing to download it, so the sample was heavily skewed towards this particular group. Another example was seen in a landmark paper released in 2001 that suggested legalizing abortion had reduced crime rates - a conclusion with major policy implications. In 2005, two economists at the Federal Reserve Bank of Boston showed that this correlation was actually due to a coding error in the model and a sampling mistake.
Even more concerning is bad data that may not be an accident. According to the Principal Researcher at Microsoft Research and the founder of Data & Society, Danah Boyd, 'countless actors (are) trying to develop new, strategic ways to purposefully mess with systems with an eye on messing with the training data that all of you use. They are trying to fly below the radar. And if you don't have a structure in place for strategically grappling with how those with an agenda might try to route your best laid plans, you're vulnerable.' The internet is one of the most popular sources of machine learning training data, particularly open APIs from the likes of Twitter. Developers may remove problematic content and language, but there will always be people working to skew the data and they are always finding new ways to do so, whether this is simply for fun or for more sinister reasons.
There are a number of things you can do to solve the issue of bad data. In terms of deliberately skewed data, Danah Boyd argues that, 'We need to actively and intentionally build a culture of adversarial testing, auditing and learning into our development practice. We need to build analytic approaches to accept the biases of any data set we use. And we need to build tools to monitor how the systems evolve with as much effort as we build the models in the first place.' Getting rid of decomposing data is more difficult, though. It is a costly, time-consuming process that requires talent, and resources. Data needs to be assessed frequently and the assumptions underlying analysis challenged by decision makers at every stage. If something seems wrong, you need to look back at the data you have rather than simply assuming it’s right. Your data also needs to be cleaned by a dedicated data scientist, of which there is a scarcity. Synthetic data is one potential way around this. Synthetic data is artificially produced data that mimics more or less exactly the properties of real data. For example, if you applied a generative model built on a neural network to a series images of faces for the purposes of facial recognition, it would produce fake images of faces. Synthetic data is still an extremely nascent technology, though, and it is up for debate as to whether it is actually any better. Ultimately, keeping datasets clean may be a tremendous challenge, but if it is not tackled, machine learning projects will almost certainly fail, so make sure it is always at the front of your mind.