Language learners, beware; artificial intelligence is watching you.
And it does crazy things with a language. Some of these things are strange to us humans, but that just makes them more attractive for examination.
The secret lies in NLP (Natural Language Processing). It's the science at the interface of artificial intellect and computer linguistics. It's not as popular as Big Data, but we all face its performance daily. NLP is present in machine translation, smartphones autocorrect, chatbots, web scraping tools, etc.
It seems computers play with us. They explain languages in a way linguists and translators could only dream. Can you forecast election results or a flu epidemic, feeding upon your linguistic skills? This is what computers are capable of today! It's enough to make one wonder; was Stephen Hawking right when called AI the "worst event in the history of our civilization"?
So what's so magical about artificial intelligence? What can a computer do with languages that could impress us?
1. Machine translation
This technology appeared in 1954 and then lost investments. Ten years of development didn't bring results, so they decided to close the program.
The things go better today. Computers understand words and idioms in many niches, including technical guidelines. They use two approaches for it: rule-based and statistical. The first one is difficult to apply because we can't describe languages with rules only. So modern machine translators work with the second approach, comprised of these stages:
- Models training
How does it work?
First, the system compares original texts with their translations made by humans. Then, it scans texts in a particular language to come up with its model. And finally, a decoder selects the most appropriate variant from the translation model, checks it via the language model, and provides the best result.
2. Sentiment analysis
Here computers analyze information such as emotional condition or attitude toward something. Brands use this technology to analyze and manage online reputation and political experts to forecast election results by analyzing corresponding tweets and blog posts.
Sentiment analysis does polarity detection: it grades texts by positive, neutral, and negative. The only problem here is words polysemy and context. Let's say you read a tweet about a new "light-weighted" phone. This message conveys positive sentiments. But what if you say the same about a politician or a newly-published book? Not complimentary, isn't it?
It seems that computers need to know how to find synonyms. The Word2Vec technology, developed in 2013& by Tomas Mikolov from Google, can do that.
Do you remember the "you shall know a word by the company it keeps" dictum by J. R. Firth? Or, the "words that occur in the same contexts tend to have similar meanings" mantra from Roy Harris? Both principles work for the Word2Vec technology today.
It collects statistics on the words mutual appearance in phrases, using neural networks to reduce their dimension. After that, it gives vector representations that reflect words relationship in the texts. Word2Vec covers many linguistic models. It turns out that linear operations on word vectors correspond to semantic transformations!
Yet, such synonymity works for texts of the same subject only.
How to understand a text topic without reading it? The bag-of-words technology helps here.
How does it work?
Let's say you put text into some N-vector:
- N is a number of words in a language;
- A vector's each component is the word frequency in the text.
This method works for topics classification.
To improve the results, you can use N-grams (bigrams, trigrams, etc.) along with the bag-of-words. They are word combinations that occur together. And they are not just idioms or common phrases but any pairs (triplets, quads, etc.) of words that often go one after another in a text.
The bag-of-words classification is not about topics only. Depending on assigned characteristics, a computer can guess an author of the text, its genre, and more. If you don't give it a plagiarized text, to be sure.
5. Automatic language detection
Today machines not only translate but also detect the source language. But how? Does Google translator browse all dictionaries in all languages in seconds to find the only requested word in one of them?
It's N-grams' work again. Each language has a set of the most specific letters combinations, used by a computer for precise language identification.
6. LDA (Latent Dirichlet Allocation)
Let's say you have a group of texts. You want to classify them, but here's the problem: you know nothing about them. How many texts are in a group? What is their length? What are their topics?
It's about text clustering. The technology that deals with it is Latent Dirichlet Allocation (LDA). Used for recommendation letters, it sees a text as the mix of topics where every topic can generate a word. It sounds complex, but it's not so black as it's painted.
Let's take a model with subjects classified as CAT_related. It generates words such as "meow," "milk," or "kitten" that might relate to cats and then puts them to a "CAT" bag. So-called stop words with no specific meaning will have the same weight in different subjects.
In the 1950s, Alan Turing had written "Computing Machinery and Intelligence" where described the test we know today.
NLP's mission is to create AI that would allow people to receive information without programming but referring to a computer in their natural language. As we see, machines solve many subtasks efficiently. And although it is still hard to call modern chatbots intellectual, NLP is rapidly evolving. Experts believe that the future of word processing belongs to Deep Learning.