Google has made its natural language processing (NLP) pre-training model, bidirectional encoder representations from transformers (BERT), available as open source for NLP researchers.
The BERT model can be used for various tasks such as "question answering and language inference, without substantial task-specific architecture modifications", a research document outlined.
According to Google AI research scientists Jacob Devlin and Ming-Wei Chang, the shortage of training data is one of the main challenges in NLP, which is a diverse and extensive field with distinct tasks, with most datasets containing only a few hundred or thousand human-labeled training examples. However, with modern deep learning-based NLP models, researchers can gain benefits from much larger amounts of data, the scientists said.
A pre-trained model, according to the scientists, will aim to close the gap in training general purpose language representation models using the vast amount of unannotated text on the web.
"The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch," the researchers said.
Visit Innovation Enterprise's Machine Learning Innovation Summit in Dublin on November 29, 2018
Furthermore, they said that the system would work to enable anyone to train their "state-of-the-art" systems within 30 minutes on a single cloud tensor processing unit (TPU), or in a few hours using a single graphics processing unit (GPU).
"BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia)," the scientists said. "Integrating a bidirectional model supports access to context from both past, future and unsupervised directions of data – it can consume data that has not yet been categorized."
According to the scientists, the BERT models, available on Github, are currently only offered in English, but the aim will be to release models pre-trained on a variety of languages in the future.