Amazon is forecasted to sell 29 million and 39 million Echos in 2018 and 2019. Supposedly, Amazon surpassed its full-year shipment goal for the Echo Dot in just a few months earlier this year which, given its increasing ubiquity is not hard to imagine.
The product has come some distance since its release in 2014, when it was introduced to relatively little fanfare. However, Amazon Alexa has heralded a new dawn in voice recognition software, using extremely complex machine learning processes to revolutionize the way we conduct everyday tasks.
At the recent Machine Learning Innovation Summit in San Francisco, Francois Mairesse, Senior Machine Learning Scientist at Amazon, outlined in detail how the technology behind the Amazon Echo works and its continuing evolution as the retail giant searches for perfection.
Speech recognition software first took off in the 1990s, when significant amounts of DARPA funding was invested into a range of research projects. Universities like Cambridge and CMU developed their own recognition systems, some of which saw a degree of commercial success, particularly when integrated into desktops such as Windows. The primary issue with many of these was simply that there simply wasn’t sufficient data to run on. You needed to do speaker adaption in order to train the model, which meant shifting the parameters to suit a specific user, which is not very practical when you are operating on relatively limited processing power.
However, this has changed over the last decade, with the digital era bringing a marked increase in the amount of data at hand to train models from, as well as advancements in processing power. In 2011, this had reached a point where Apple felt confident enough to introduce Siri to their iPhone products, who were rapidly followed by the likes of Google, Samsung, and Microsoft.
These applications put speech recognition on the map, but adoption of phone-based Voice User Interfaces (UIs) has been slow. The main problem is that if they are not accurate, users will simply revert to typing as it does not take any longer. The real inflection point actually came in November 2014 with the introduction of far-field technology when Amazon launched the Echo. The device seems to realize the promise of voice as a more natural and frictionless way to interact with technology, with far-field technology really the ultimate application for speech recognition software.
Amazon had launched several products that used close-talk technology prior to the Echo. Firstly, Amazon Dash, which you would put in your kitchen. The Dash allowed you to add milk, butter, and other households products to your shopping list. It was a simple device requiring users to speak into a microphone, which is much easier to compute than a far-field task, when the user will be some distance away and the device has to distinguish between actual voice input and background noises. The user’s speech is then compared to the catalog of all the possible grocery items, which is large, but not at the scale seen in later devices. This was followed by Fire TV, which similarly used close-talk technology but extended the catalog to include films and so forth. Finally, they added voice search on the Amazon shopping app. Here the catalog is huge, as almost every word in the English language is a product, meaning there was a very large output space.
Then came the move to Amazon Echo, which was Amazon’s first far-field device. Echo is a fully fledged virtual assistant, rather than just a voice search as Amazon’s previous attempts were. It relies on machine learning, getting more accurate the more data it gets. It started with shopping, the weather, music, and so forth, and now enables phone calls and messaging - learning the more it is used. The philosophy, says Mairesse, is to keep it simple, implement all features into the cloud, and ensure the device is both application agnostic and that the applications are device agnostic, so you can easily add more.
The journey from voice input to the Echo producing a result may seem to happen quickly, but it relies on an incredibly complex process. For example, a user could say ‘Alexa, what’s the weather in Seattle?’. The first step from here is signal processing, which gives the device as many chances as possible to make sense of the audio by cleaning the signal. Signal processing is one of the most important challenges in far-field audio. The goal is to enhance the target signal, which means being able to identify ambient noise like the TV or dishwasher and minimize them. To mitigate these issues, beamforming is used - seven microphones that identify roughly where the signal is coming from so the device can focus on it. Acoustic echo cancellation knows when it’s playing and can subtract that signal so only the remaining important signal remains.
Once this is done, the next task is Wake Word Detection. This determines whether the user says one of the words the device is programmed to need to turn on, such as Alexa or Echo. This is needed to minimize false positives and false negatives, which could lead to accidental purchases and angry customers. This is further complicated as it needs to identify pronunciation differences, and it needs to do so on the device, which has limited CPU power. It also needs to do this quickly, so requires high accuracy and low latency.
If the wake word is detected, the signal is then sent to the speech recognition software in the cloud, which takes the audio and converts it to text format. This essentially moves the process from a binary classification problem to a sequence-to-sequence problem. The output space here is huge as it looks at all the words in the English language, and the cloud is the only technology capable of scaling sufficiently. Typically, the entropy of what the user has entered is very high - it is not a yes or no dilemma, rather it is every possible question you could ask. Therefore, you also need the context or it will not work. This is further complicated by the number of people who use the Echo for music - many artists use different spellings for their names than there are words.
To convert the audio into text, Alexa will extract 10-second windows of audio. It will then analyze characteristics of the user’s speech such as frequency and pitch to give you feature values. A decoder will determine what the most likely sequence of words is, given the input features and the model, which is split into two pieces. The first of these pieces is the prior, which gives you the most likely sequence based on a huge amount of existing text, without looking at the features, the other is the acoustic model, which is trained with deep learning by looking at pairings of audio and transcripts. These are combined and dynamic coding is applied, which has to happen in real time.
This is when the Natural Language Understanding (NLU) kicks in and converts the text into a meaningful representation. This is still a classification task, but the output space is smaller. This is done using discreet to discreet mapping so typically you will start with rules and regular expression, but there are so many edge cases that you need to rely on statistical models. So, say you were to ask for the weather in Seattle, so the intent would get weather and would Seattle. There are problems with cross-domain intent classification. It is similar to Chinese whispers in some ways. For example, if someone said ‘play remind me’, this is very different to remind me to go play, but could be misinterpreted if this stage falls down and the Echo could produce the wrong result. Speech recognition errors like ‘play like a prayer BY Madonna’ could be heard as ‘play like a prayer BUY Madonna, which has obvious consequences. Out-of-domain utterances are also rejected if they are nonsensical at this stage, which again prevents the device mistakenly hearing commands from televisions and the like.
The application layer would then fetch the weather and the dialog manager decides whether more information is needed to provide an accurate answer. The language generator essentially formulates the prompt for Alexa to speak, and text-to-speech technology provides the text from which Alexa needs to respond out loud. This relies on Natural Language Generation (NLG). Typically when you build a conversation agent you build a template which works until it has to scale up. Speech engines use concatenate synthesis, where audio is sliced into tiny units and the machine tries to find the optimal sequence of pieces to maximize the naturalness of the audio given the sequence of words.
The Echo is nearly there, but researchers are working constantly to improve the speech recognition software, particularly around sensing the emotion in a person’s voice. Further improvements will see Alexa better able to hold a conversation - remembering what a person has said previously, and applying that knowledge to subsequent interactions, taking the Echo from highly effective to magical.