Five or 10 years ago, barely anyone would have used tech that can be controlled solely by voice. Simply, the technology available wasn’t developed enough to be genuinely usable. Today, most people who’ve grown up with technology as part of their lives will have used something like Siri, or Amazon Echo, or Google Voice. The technology is no longer part of a vision of the future, it’s there in our smartphones.
At the Machine Learning Innovation Summit which took place this month in San Francisco, Shubho Sengupta, Research Scientist at Facebook AI, presented on the neural networks behind speech and language in computing. To go from obscure tech to mainstream area of focus is testament to the sheer potential behind voice assisted computing. As Shubho says, ‘the talk is really “what just happened?”, “what caused this sudden explosion”’?
The first ingredient in what Shubho calls the ‘secret sauce’ is the rapid growth in the amount of data available. With access to ‘a lot of very good, labeled data’, tech companies were able to draw on structured databases to vastly improve their intelligent tech. ‘We also have very good translation data, thanks to legislation in the EU, where every document has to be translated into multiple languages,’ Shubho says. ‘The same thing is in Canada, where every official document has to be in English and French - these are very good human generated translations.’
The second vital catalyst for the voice assistant boom is encoder-decoder networks - the main facet of Shubho’s talk. He compared the networks to ‘modern day babel fishes’ - a fictitious alien fish, featured in the science fiction novel The Hitchhiker’s Guide to the Galaxy, which performs instant translations when coupled with brainwaves. ‘They’re babel fishes in a very general sense. They can take not only sound waves and generate sound waves, they can do it for any sequence they can think of, including characters, words, and things like that.’ The technology is complex, but ultimately it allows the machine to generate appropriate output sequences from the encoded input sequence that is maps itself.
The difficulty involved in developing the technology lies in the sheer scale of it. ‘These networks are some of the largest neural networks in existence,’ Shubho explains. ‘They are much larger than any network that you typically use for, say, object classification or object segmentation for images.’ Shubho developed Baidu’s system for mandarin, for example, and that is roughly anywhere between 100 to 200 million parameters, depending on what it is being used for and how much data it is being trained on. Google’s translate function uses around 200 to 400 million.
So the scale is vast, and coupled with the immense data available, it is a formidable technology. As Shubho explains, a typical industrial strength speech recognition system ‘will be trained on about 15 to 20 million utterances, which is about 10,000 hours of audio.’ Translate functions can use 30 to 40 million input samples. When these numbers are combined with a 200 to 400 million parameter net, ‘you end up with a training FLOP requirement of anywhere between 50 to 100 exaFLOPS (a billion billion calculations per second).
This is computing on an incredible scale. Technological innovations have made it possible, and Shubho then goes on to talk about the current research taking place in the field of voice assisted computing. The three key areas remain translation, text-to-speech, and speech-to-text. For translation, there are certain combinations of languages - like English to Mandarin - that need significant work. Nuance also needs developing, given how much more effective translation currently is for factual, straightforward prose. ‘God forbid poetry, songs, and things like that. We are far away from doing that in translation.’ In terms of text-to-speech Baidu’s Deep Voice, for example, is wildly impressive and there are a number of companies working to bring natural intonation to computer generated voices.
Speech-to-text is equally challenging. Shubho explains that ‘what we have made the most progress on is what we call single speaker, near field audio. So, what I mean by that is just me, talking to a phone, very close to my mouth. To do a lot of people talking at a party, where they’re far away from the microphone, that is incredibly hard.’ The idealized goal for those working in speech-to-text is to be able to place a device in the middle of a table and to have an accurate transcript of a meeting logged from the speech - or, if you like, multi-speaker far field.
According to Shubho, we are some way from multi-speaker far field technology, and though machine literacy is improving rapidly and the current capabilities are already impressive, there are still key areas in which steps forward can be made. Voice assisted tech is well on its way to becoming pronounced in the mainstream, though, and with ever-increasing computing power behind it, the capabilities will only become more mind-blowing.