FOLLOW

FOLLOW

SHARE

Automated Statistical Computation: Welcome To The Statistics Revolution

Are you going to the party?

30Apr

Right now we’re in the middle of an incredible revolution in statistics.

Well, actually that’s not entirely true.

The revolution is happening in the computation of statistics.

How Did We Get Here?

Mathematical and statistical computation has evolved over the years from finger counting, tally marks in the sand and the abacus, through pen-and-paper calculations to the first algorithms on modern computers back in the 1960s. Then various generic statistical packages popped up, such as SPSS (launched in 1968) and Minitab (1972), that allowed scientists to do their calculations faster and more accurately, reducing the potential for calculation errors.

They were a boon for scientists and analysts the world over, and they transformed the way they handled and processed their data.

These packages, though, had the drawback of being rather inflexible. By the late 1970s a raft of statistical programming languages were appearing, spearheaded by SAS and S (both launched in 1976).

These had the advantage of allowing great flexibility to the user, although you needed to be a rather accomplished computer programmer if you wanted to use them.

The '80s and '90s saw the introduction of S-PLUS (1988) and R (1993) which had programming languages that were designed to be slightly more user-friendly.

The issue with these statistical languages has mostly centred around their speed of computation. Although very much quicker in their computation than SAS and S, they were – and still are – very much slower than dedicated computer languages such as C++ and Java.

Well, what’s a few milliseconds between friends, I hear you ask?

Well, when your dataset is comparatively small it doesn’t really matter how quickly or slowly your result is returned to you. Do you really care that you get your result in 0.5 seconds rather than 0.005 seconds?

Not really, but the difference between these speeds really racks up when you have a huge dataset, complex algorithms and/or millions of computations.

Welcome to the Revolution

This is where the statistical revolution is happening. Over the past decade, there has been an enormous explosion in the amount of data that is being collected. We’re in the Big Data Age where data storage is now measured in exabytes. Since 2012, 2.5 exabytes (2.5x1018 bytes) of data has been created every day. Sure, most of it is cat videos on YouTube or 13-year-olds exchanging text messages about the next Twilight movie, but the amount of genuine research data being created is giving researchers some serious headaches.

Even Small Data can become Big Data (comparatively speaking) very quickly.

The Problem

For example, take a small patient healthcare dataset (say 2000 patients) of 100 variables and try to find all the relationships between the variables in the dataset.

This is a very common problem in healthcare and probably represents the most used family of statistical tests – correlations and associations.

We all know what these tests are, we hear them every day in the form of

  • ‘is smoking associated with lung cancer?’ or
  • ‘is there a correlation between obesity and diabetes?’

Comparing all the variables with each other in this small dataset requires you to run around 5000 statistical hypothesis tests (99+98+97+…).

I guess that isn’t too bad, but think about this for a couple of seconds. How many mouse clicks does it take to run a Student’s t-test or Chi-Squared test between two variables in SPSS? Five? Six?

OK, multiply that number by the number of hypothesis tests we need to run. That’s around 25,000 or so mouse clicks.

I don’t know about you but I’m getting carpal tunnel syndrome just thinking about it.

OK, run all those tests and take the results to your boss (3 months later), and he says ‘hmmm, very interesting. How does that break down by gender?’.

So now you have to go away and run every test again for each of the ‘male’ and ‘female’ subsets. That’s another 50,000+ mouse clicks and another 6 months.

‘But what about the populations of smokers and non-smokers?’. He asks. Another 50,000+ clicks and you’ve just lost another 6 months of your life.

‘And what about the subsets of males that smoke, males that don’t smoke, females that smoke and females that don’t smoke?’. Another 100,000+ mouse clicks, and you’ve now gone gray and have suitcases under your eyes.

Lost the will to live yet? No, don’t give up now – we’ve only just begun. There are potentially hundreds more subsets and sub-subsets that can be analyzed this way to get valuable information. If you’ll only sacrifice yourself for the cause we can really make a difference here…

Oh yes, and I forgot. When you’ve done all of that several years later, your boss tells you that the dataset is now out of date. Several thousand more patients have been included in the shiny new patient database and you’ve got to start again!

The Real Problem…

Of course, no self-respecting researcher, analyst or statistician is going to do this. So what do they do instead?

They ask what is the primary hypothesis – which pair of variables do we suspect might be related? Then they ask which handful of other variables might have some influence on our most important pair of variables, and which 3 or 4 data subsets would be most worthwhile checking for important relationships.

In other words, they cherry-pick which analyses they want to run. After all, life is just too damn short…

But is this really the big picture? Does this method help you find all the important relationships in the dataset?

Of course not!

Analysts have been grappling with just this type of problem for a few years now and realize that programs like SPSS and Minitab just won’t cut it – they may be useful for doing a few stats here and there, but won’t really help when you’re faced with several months or years worth of analysis. They are generic and manual.

They are fossils!

SAS and R, similarly have drawbacks. Although you can create programs to automate your analyses they just run too slowly.

The Solution

So what is the solution?

Well, you need a seriously fast computing language that is amenable to statistical analysis. Fortunately, many of the quicker languages – C++, Java, Python, Julia (and many others) now have many plug-in stats packages that are free to use.

Wow, that’s great! But what if you can’t program in any of those languages? Ah well, I guess you’re stuck then…

And here is where we’re getting to the crux of the problem. There are very few dedicated, automated packages that will run your stats and analyses with a minimum of fuss and just let you get on with your research. What researchers and analysts really need is a program that gives them the story of their data in real time, not in several months.

With the advent of Big Data, innovative ways of handling, storing and visualising data are being found, but stats seems to be falling behind a little.

Why? Because they’re hard!

Don’t get me wrong, there is a revolution happening in statistical computation, you just don’t hear much about it because stats isn’t sexy and it’s happening slower than in other areas of Big Data handling.

And anyway, that’s not really the problem. How many companies have got Big Data? Not many.

How many people need to handle and analyze Big Data? Same answer.

Big Data is just the tip of the iceberg. There is far more Small Data in the world that needs analyzing and there’s almost nobody creating innovative automated stats packages for the PhD researchers, medical interns and research nurses. These are the people with the biggest problem. They know their data is valuable and lives can be saved if only they can get real insights from the data – but there just isn’t enough time to do all the analyses with fossilized generic manual programs!

Big Data, Meet Small Data

There are a few small, innovative companies though that are starting to bring Big Data solutions to the increasing problem of Small Data, and Chi-Squared Innovations is one of them.

Programs like CorrelViz – gives you an intuitive interactive 3D visualization of the story of your data in minutes, not months, and with just one click. They give you all the valuable correlations and associations in your data and helps you to discover the answers to your ‘what if…’ questions in real time.

We know it’s what researchers need because they’ve told us so, but it’s interesting to hear the dissent from some other statisticians about what we’re doing.

Here’s a typical conversation that I’ve had many times:

Me: We’re automating correlations and associations.

Statistician: You can’t do that.

Me: Why not?

Statistician: Because of A...

Me: We’ve solved that problem (give full explanation).

Statistician: Well, you still can’t do it.

Me: Why not?

Statistician: Because of B...

Me: We’ve solved that too (explain).

Continue through C, D, E and F.

Statistician: Well, I still don’t think you can do it.

Me: Why not?

Statistician: It’s arrogant to think that you can automate stats, and anyway, just because you CAN, it doesn’t mean you SHOULD...

There are lots of statisticians that get what we’re trying to do, and understand the automated stats revolution in general, but there are still lots that would rather put barriers up than see the opportunities.

We’re in the auto-stat space and have been building automated stats products for over a decade, so we know it can be done (with great care, obviously...).

Are You Going to the Party?

I had an interesting conversation on social media recently with Mark Montgomery, Founder & CEO of Kyield, who said that:

'We need far more automation now to survive as a species'.

This is a really good point.

Have a look at the World Population Graph below.

(Alternative graph can be found at: http://www.worldometers.info/world-population/ ,about 1/5 down the page).

The rate of growth of the world population over the last 200 years has been staggering and started at the beginning of the industrial revolution – the dawn of automation.

The world has been changed forever – and to the betterment of mankind – by automation and it’s now the time of data, analytics, and statistics.

No doubt there were people at the beginning of the industrial revolution that were fiercely opposed to automating industry.

They were wrong.

The data-doubters are also wrong. I’ve given up trying to convince them – they’ll arrive at the party late, kicking and screaming no doubt, but they will be there because they will have no choice.

The world of analytics is changing and we’ll try and do our little bit to help it on its way...

Sources

About the Author

Lee Baker is an award-winning software creator with a passion for turning data into a story.

A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!

Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress - but 100 times the fun!

Comments

comments powered byDisqus
Data culture small

Read next:

Building A Culture Of Data

i