Since the dawn of big data, privacy concerns have overshadowed every advancement and every new algorithm. This is the same for machine learning, which learns from big data to essentially think for itself. This presents an entirely new threat to privacy, opening up volumes of data for analysis on a whole new scale. Many standard applications of machine learning and statistics will, by default, compromise the privacy of individuals represented in the data sets. They are also vulnerable to hackers, who would edit the training data, compromising both the data and the final goal of the algorithm.
A recent project that demonstrates how machine learning could directly be used in the invasion of privacy was carried out by researchers at Cornell Tech in New York. Ph.D. candidate, Richard McPherson, Professor Vitaly Shmatikov, and Reza Shokri applied basic machine learning algorithms - not even specifically written for the purpose - to identify people in blurred and pixelated images. In tests where humans had no chance of identifying the person (0.19%), they say the algorithm had 71% accuracy. This went up to 83% when the computer was given five opportunities.
Blurred and pixelated images have long been used to disguise people and objects. License plate numbers are routinely blurred on television, as well as the faces of underage criminals, victims of particularly horrific crimes and tragedies, and those wishing to remain anonymous when interviewed. YouTube even offers its own facial blurring tool, developed to mask protestors and prevent potential retribution.
What’s possibly the most shocking part of the research is the ease with which the researchers were able to make it work. The team used mainstream machine learning methods where the computer is trained with a set of example data rather than programming. McPherson noted that, ‘One was almost a tutorial, the first one you download and play with when you’re learning neural nets.’ They also noted that the training data could be as simple as images on Facebook or a staff directory on a website. For numbers and letters (even handwritten), the training data is publicly available online, opening up the further risk of fraud.
The stated purpose of the research was to warn privacy and security enthusiasts of the threat from machine learning and AI to be used as a tool for identification and data collection. However, the threat may not be as bad as they claim. Machine learning cannot reverse the pixelation and recreate images, so anyone worried about blurred pictures of them can rest vaguely easy, at least for now. It can't actually recreate the pictures. It is also only successful when identifying things it's been specifically trained to look for. Although, hackers could quite easily train the system using photos taken from social media. And, in the case of protesters - YouTube’s purported - whose faces will likely already be on file, it is cold comfort.
As machine learning become increasingly powerful, algorithms could conceivably make high confidence predictions without having direct access to your private information. This was already seen to an extent when retailer Target’s predictive algorithm suggested a girl was pregnant and sent her promotional material accordingly - accidentally revealing the secret to her father, but in future it may not even be necessary to have access to any of her personal details to figure it out. In this world, there is no such thing as private information. Given the emphasis that so many put on privacy, the reaction to this is likely to be highly negative.