Massachusetts Institute of Technology (MIT) computer scientists have developed a system which integrates machine-learning to identify objects within an image based on a verbal description. The system can uncover areas of the image being described in real time.
Researchers hope that with the development of the new speech–object recognition technique will reduce manual labor hour time and offer new opportunities in relation to speech and image recognition.
According to MIT computer science and technology writer, Robert Matheson, the model can be applied for language translation, removing the need for an annotator.
"Of the estimated 7,000 languages spoken worldwide, only 100 or so have enough transcription data for speech recognition," Matheson said. "Consider, however, a situation where two different-language speakers describe the same image.
"If the model learns speech signals from language A that correspond to objects in the image and learns the signals in language B that correspond to those same objects, it could assume those two signals – and matching words – are translations of one another," he noted.
Visit Innovation Enterprise's Machine Learning Innovation Summit in Dublin on November 29, 2018
MIT computer science and AI laboratory research scientist, David Harwath, said: "We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to.
"We got the idea of training a model in a manner similar to walking a child through the world and narrating what you're seeing."