The HeLa cell line is notorious in medical research. They were unique in being the first human cells to prove naturally ‘immortal’, meaning that they do not die after a set number of cell divisions. Subsequently, they have proven highly desirable for medical researchers in the early testing phase, as results can be easily and quickly obtained. This was crucial in the development of Jonas Salk’s polio vaccine, to name but one, and they have since been used in medical research for cancer and AIDS.
However, in 1967 a problem was discovered. They were found to have a power unlike any other cell line. If even one HeLa cell was dropped by mistake into a culture or blown across a laboratory by a current of air, this cell would begin to grow and multiply faster than its host. It was so powerful that it affected even the most seemingly impenetrable laboratories, rendering any research done with any cell line caught in its path worthless. The entire scientific community, in one way or another, was essentially working with the wrong information, getting the wrong results, and years of research went down the drain.
There are valuable lessons to be learnt from this for the AI community, many of whom have correctly realized that algorithms are less important than data, but are currently rushing out to get as much training data as possible under the mistaken belief that quantity of data is more important than quality. In the same way that the HeLa cell infected thousands of other cell lines, it is also possible for bad data to contaminate machine learning, with mistakes growing exponentially quickly and infecting an entire training set, thereby leading to the development of wrong solutions, bad decision making, unethical AI, and any number of other crippling issues.
This is not to say that the machine learning and deep learning techniques that modern AI rely on do not need huge volumes of data, of course they do. AI algorithms rely on the data in their training sets to learn. Buying a commercial Machine Learning-as-service without having a plan or a budget for training data, is like trying to make cement without water. This importance is reflected in a number of surveys. According to Crowdflower’s 2017 Data Scientist report, ‘when asked to identify the biggest bottleneck in successfully completing AI projects, over half the respondents named issues related to training data ‘Getting good quality training data or improving the training dataset’ while less than 10% identified the machine learning code as the biggest bottleneck.’ Furthermore, they asked data scientists whether they would prefer to delete their machine learning code, delete their training data or break a leg. Accidentally deleting their training data was preferred by just 21%, compared to 28% for breaking a limb and 52% for deleting an algorithm.
Getting a large quantity of data together for a training set is fairly easy, ensuring that it is of high enough quality is of far greater importance. Better data does not mean more data. Cleaning, labeling and categorizing data isn't especially sexy, but it is critical, and Data scientists are having to spend the majority of their time doing it. One solution to this being touted that is becoming increasingly viable is synthetic data.
Synthetic data is artificially produced data that mimics more or less exactly the properties of real data. There are two primary ways to generate synthetic data, either by observing real-world statistic distributions from the original data and reproducing fake data by drawing simple numbers or by creating a physical model that explains the observed behaviour, then reproducing random data using this model. For example, if you applied a generative model built on a neural network to a series images of faces for the purposes of facial recognition, it would produce fake images of faces. This could be applied to a wide range of other data, establishing patterns and then producing something that fit into the range established. In this sense, it needs real datasets to work and will never be able to replace them entirely. No model will ever be able to generate examples of things it’s never seen real examples of before.
Potentially, you could use synthetic examples from generative models alongside a small number of real examples to train a system as effectively as you could using a large number of real examples. This has a several advantages, providing more and hopefully cleaner data. Further, a large amount of synthetic data can be designed to reflect a fuller range of possibilities than may not otherwise be represented, rare situations that simply haven’t occurred in the sample.
One early stage startup making ground in the field is Automated DL. The Virginia-based company creates synthetic data by using generative models that create data that resembles or are in some way related to historical examples they’re trained on. CEO Jeff Schilling said, ‘We want the AI people to get there better, faster.’ It is currently applying it primarily to cybersecurity, but there is no reason it will not work in AI development across every field. We are still in extremely early stages when it comes to synthetic data, but with the dangers presented by contaminated datasets and the challenge inherent in cleaning them almost daily, it is a solution that definitely has legs and is one to keep an eye on.