A To Z Of Analytics

A comprehensive list of data-related terminology and their definitions


Analytics has taken the world by storm and it is the powerhouse for all the digital transformation happening in every industry.

Today, everybody is generating tons of data. All of this data are stored in Big Data platform. But storing this data is not going to get us anywhere unless analytics is applied then applied to it. Hence it is extremely important to close the loop with Analytics insights.

Here is my version of A to Z for Analytics:

Artificial Intelligence: AI is the capability of a machine to imitate intelligent human behavior. BMW, Tesla, and Google are using AI for self-driving cars. AI should be used to solve real world tough problems like climate modeling to disease analysis and betterment of humanity.

Boosting and Bagging: The technique used to generate more accurate models by ensembling multiple models together

Crisp-DM: The cross-industry standard process for data mining. It was developed by a consortium of companies including SPSS, Teradata, Daimler and NCR Corporation in 1997 to bring order in developing analytics models. The main 6 steps involved are business understanding, data understanding, data preparation, modeling, evaluation and deployment.

Data preparation: In analytics deployments more than 60% of time is spent on data preparation. As a rule, garbage goes in, garbage out, hence it is important to cleanse and normalize the data and make it available for consumption by model.

Ensembling: is the technique of combining two or more algorithms to get more robust predictions. It is like combining all the marks we obtain in exams to arrive at a final overall score. Random Forest is one such example combining multiple decision trees.

Feature selection: Simply put, this means selecting only those features or variables from the data which really makes sense and remove non-relevant variables. This raises the model accuracy significantly.

Gini Coefficient: Used to measure the predictive power of the model typically used in credit scoring tools to find out who will repay and who will default on a loan.

Histogram: A graphical representation of the distribution of a set of numeric data, usually a vertical bar graph used for exploratory analytics and data preparation step.

Independent Variable: The variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable like effect of increasing the price on Sales.

Jubatus: This is an online Machine Learning Library covering Classification, Regression, Recommendation (Nearest Neighbor Search), Graph Mining, Anomaly Detection, Clustering

KNN: K nearest neighbor algorithm in Machine Learning used for classification problems based on distance or similarity between data points.

Lift Chart: These are widely used in campaign targeting problems to determine how to target customers for a specific campaign. Also, it tells you how much resonse you can expect from the new target base.

Model: There are more than 50 modeling techniques, including regressions, decision trees, SVM, GLM, Neural networks etc present in any technology platform such as SAS Enterprise miner, IBM SPSS or R. They are broadly categorized under supervised and unsupervised methods into classification, clustering, association rules.

Neural Networks: These are typically organized in layers made up of nodes. They mimic the way the brain learns. Today, Deep Learning is an emerging field based on deep neural networks.

Optimization: The use of simulations techniques to identify scenarios which will produce best results within available constraints e.g. Sale price optimization, identifying optimal inventory for maximum fulfilment and avoiding stock outs

PMML: The XML base file format developed by data mining group to transfer models between various technology platforms and it stands for predictive model markup language.

Quartile: Dividing the sorted output of model into 4 groups for further action.

R: Today, every university and even corporates are using R for statistical model building. It is freely available and there are licensed versions like Microsoft R. More than 7000 packages are now available at disposal to data scientists.

Sentiment Analytics: The process of determining whether an information or service provided by business leads to positive, negative or neutral human feelings or opinions. All the consumer product companies are measuring the sentiments 24/7 and adjusting there marketing strategies.

Text Analytics: It is used to discover & extract meaningful patterns and relationships from the text collection from social media site such as Facebook, Twitter, Linked-in, Blogs, Call center scripts.

Unsupervised Learning: Algorithms expected to find patterns even though there is only input data. Clustering and association algorithms like k-menas & apriori are the best examples.

Visualization: The method of enhanced exploratory data analysis and showing the output of modeling results with highly interactive statistical graphics. Any model output has to be presented to senior management in most compelling way. Tableau, Qlikview, Spotfire are leading visualization tools.

What-If analysis: The method to simulate various business scenarios questions like what if we increased our marketing budget by 20%, what will be impact on sales? Monte Carlo simulation is very popular.

What do think should come for X, Y, Z?


Read next:

Why Blockchain Hype Must End