Want to Cure Cancer?

Think Small Data, not Big Data


It’s an old tech tale - boy meets boy, creates start-up, fails, creates 2nd start-up, succeeds, sells to Google for $80 million, creates 3rd start-up, raises $138 million in venture capital and cures cancer.

I’m sure we’ve all heard this before… Oh, hold on a minute. Cures cancer? Two techies? I think I must have missed that memo.

Well, actually it hasn’t happened yet, but that’s the plan of two of New York’s finest young tech entrepreneurs, Nat Turner and Zach Weinberg. The only thing missing in their plan is the final step, the ‘cure cancer’ thingy.

The Big Cancer Data Plan

The company they founded, Flatiron, is looking to harness the power of Big Data to try and find a cure for cancer. Their plan is to use their considerable tech knowledge to gather together all the world’s cancer data into one über dataset.

No, really, stop laughing - they’re serious. And so are Google, who have invested more than $100 million in them to do it.

It doesn’t seem to matter that neither of them have studied biology - they studied economics and entrepreneurship - and that previous efforts to build, organize and analyze such huge cancer databases have so far failed spectacularly, such as caBIG which sucked up a reported $500 million in public funding before being quietly closed down in 2012.

In fairness to Nat and Zach they have realized that the problem of building such large cancer databases cannot be solved by technology alone, so they’ve created a system that works with physicians to organise the clinical information into neat categories using artificial intelligence to learn from the physicians as it goes.

Well, I hope that I’m as wrong, but I think that they are doomed to failure, just like all the others that have tried to tackle this mammoth task before them.

Problems of Big Cancer Data

Whilst the concept of collecting world cancer data together to form one giant database is laudable, there are huge regulatory issues that must be dealt with.

In many parts of the world there are laws that prevent the sharing of personal health data that could be used to uniquely identify the patient.

OK, no problem, we just strip out their name and add in a unique ID number.

Only that doesn’t work.

Let’s take the example of a patient with a date of birth of 25/12/1960. In a country of, say, 100 million people you’d be hard-pressed to identify who that data belongs to. Now add in their postal code. The number of people to whom these data belong now drops to a couple, or just a few at most. If the patient has been diagnosed as having breast cancer, then you can pretty much uniquely identify them without too much difficulty.

To the clinicians who treat these patients there are markers of disease that can make the patient equally uniquely identifiable. Triple negative breast cancer occurs in less than 20% of cases, so you can see that putting this information together with other such markers increases the chances of potential unique identification of the patient. Clinicians are often able to pin-point which patient they are confronted with just by looking at the tumour profile.

As well as issues about anonymizing data, there are laws regarding the transfer of personal data across national boundaries. These laws are different according to the countries concerned and can create a lack of will-to-live in all those that have to deal with the necessary paperwork and regulatory requirements.

Although these can be difficult problems to deal with, a further - often intractable - problem is that of the clinician or health board that declares that ‘this is my data, it cost me a lot of money to collect it and I don’t see any reason why I should give it to you for free - what’s in it for me?’.

Ultimately, if a world cancer database project is to succeed it must have wide support and give as much as it receives. Given the past record of such attempts, scepticism is likely to be high and that may just be the biggest barrier to overcome.

However, there is an alternative.

Big Data versus Small Data

Round about the time that caBIG was starting I was just beginning a new job as a statistical consultant in a major teaching hospital. On a day-to-day basis I was analyzing different datasets that had a lot in common and often had data from some of the same patients. The data were collected by independent researchers, clinicians, biologists, nurses and pathologists who weren’t aware that the same data was being collected by other people and analyzed for similar, often overlapping, purposes. I started to think about setting up a system to collect all this data into one large database. Ah, but we have collaborators in other institutions. OK, we’ll collect their data too. Pretty soon I was thinking in the same terms - and size - as caBIG. I very quickly realized that the task was not only too big, but too difficult, given the legal, regulatory and compliance issues that I mentioned earlier.

So how else could you accomplish your goals of finding a cure for cancer without creating a giant database?

Well, why think ‘Big Data’ when you’ve already got lots of ‘Small Data’?

Problems of Small Cancer Data

Obviously Small Data is easy to work with, but has the disadvantage of being distributed and difficult to access by anyone other than the ‘gatekeeper’ (researcher, data manager, etc.) - hence the drive to gather it together into Big Data.

But if we can’t get past the Big Data problems mentioned above then how can we make Small Cancer Data give us the answers that Big Cancer Data can’t?

Well, although one might conclude that most cancer datasets are too small to bother with, together they hold the vast majority of all cancer data.

In my experience, typically around 90% of the data in small cancer datasets is not analyzed. There are often a number of reasons for this, including:

- Only a small fraction of the dataset is concerned with the researcher’s primary hypothesis

- Statistical analysis is difficult - particularly for scientists with little-to-no stats training

- Comprehensive analysis is extremely time-consuming, potentially taking months or years to complete manually

As a result, only the data that ‘fit’ with the researcher’s tightly focused hypothesis are cherry-picked for statistical analysis.

Once they’ve done that they write their paper, thesis or report and - due to publication bias pressures - only include a small selection of their results. Once this pruning exercise is complete there is barely 1% of all possible results from the dataset that makes its way into the big wide world for others to learn from and build upon.

There is a vast amount of untapped information contained in the small datasets around the world. Just imagine what crucial discoveries we could make from the worldwide data that currently exists. Imagine if we had the analytical tools to automate the analyses. Perhaps the discoveries that could be made from this untapped 99% of the data could take us several leaps forward in the search for a cure for cancer.

Certainly that’s one thing that we can learn from the Big Data movement - Big Data is too large to analyse manually so we have to find innovative ways of automating the data analyses.

So why not use the kind of automated data analysis tools that are commonly found in Big Data on Small Data? Then we can analyse all the data in a very short period of time.

OK, so then what?

We have our Small Cancer Data and our Small Cancer Data stats results. So what do we do with them?

Aggregate The Statistical Results

If aggregating all the Small Cancer Data is too difficult a task, then maybe supplying the tools to automate the analyses and then aggregating the results of statistical analyses might just be the answer.

After all - unlike with data - there are no regulatory issues with stats results. They don’t need to be anonymised (they already are) and there are no problems with p-values and odds ratios crossing geographical boundaries. Data gate-keepers also have no reason to keep the results secret - they are often contractually obliged to publish results of their research.

There are other advantages too. The same set of analyses will be repeated many times by independent researchers across the globe. Yielding similar results time and again, researchers can gain confidence that their result has likely not occurred by chance and is perhaps a real-world phenomenon. Differing results give low statistical confidence and may perhaps suggest that further research is needed in this area.

These are results that are less likely to be gleaned from working with a single Big Cancer Dataset. After all, why would you split your wonderful Big Dataset into smaller chunks and lose statistical power on your results? Doesn’t that defeat the purpose of having a Big Cancer Dataset in the first place?

Cancer Stats Bank

No, what is needed is some sort of Cancer Stats Bank, where all the results of worldwide cancer investigations can be pulled together and aggregated.

So how would a Cancer Stats Bank system work?

Well, research around the world would continue the same way that it always has - researchers and clinicians continue their lab and clinical research, collecting data as they go. The one difference would be that once they have completed their data collection and are ready to analyze them they should have access to tools that can automate their analysis in minutes rather than months.

These tools would have the facility to allow the submission of the completed results to the Cancer Stats Bank where a hybrid physician-machine-learning system would standardise and curate the results in much the same way as data is currently being curated by Turner and Weinberg’s system.

Once the results have been validated they can be added to the pool of results in the Cancer Stats Bank.

If similar analyses have been performed before, such as smoking versus lung cancer (i.e. is there a relationship between them?) there will already be a pool of p-values and odds ratios (results of association tests). More than that, there will be a distribution of p-values and odds ratios, so standard statistical tests can be performed on these meta data to yield average p-values and odds ratios and - more crucially - determine the level of confidence in these results.

Node-Link Visualizations

What would we do with these results?

Obviously a list of millions of p-values and odds ratios is pretty much useless, but if we plot the independent relationships between pairs of variables to form a large node-link map, we could have an interactive visualization that shows all the significant associations and correlations found in cancer research across the globe.

Sure, the visualization would be a very large one, but you could zone in on the particular part of it that concerns your research and visualize, say, all relationships within 4 degrees of separation from your chosen biomarker.

Clicking on a node should give you information about the chosen variable and selecting a link between a pair of nodes would allow you to investigate the relationship between them.

A system such as this could be used to generate new hypotheses for scientists and allow them to ask more ‘what if…’ questions in real time:

Let’s centre the visualization on lung cancer and view related variables up to 4 degrees of separation

- What if we want the same visualisation for men only?

- What about men under 50 years old that don’t smoke and have a low BMI?

We would be able to investigate the differences between each of these visualisations, see what research has already been conducted worldwide, what the aggregated results are, what gaps exist in the research, and determine the level of confidence in these results.

Clearly such a resource would be highly valuable in the search for a cure for cancer.

So Who Is Going To Do It?

We are.

We’re at a very early stage in this venture, but we’re creating a dynamic portfolio of the automated statistical analysis tools that would be used to analyse the current and future Cancer Data from around the world - both Big and Small.

Very soon the first of these tools - CorrelViz - will be launched.

As the name suggests, CorrelViz allows you to visualize all the correlations (and associations) found within your data and is completely automated, taking you ‘From Data To Story’ in minutes rather than months or years.

In time, CorrelViz will allow researchers to submit their statistical results directly to the Cancer Stats Bank for aggregation with results of other similar analyses.

Obviously we’re going to need significant venture capital investment to build the Cancer Stats Bank and create the hybrid physician-machine-learning system needed to standardize and curate the incoming statistical results, but at the moment we’re just taking baby steps.

Apparently the journey of a thousand miles starts in similar fashion…

Is There Still A Place For Big Data?


Perhaps the biggest issue with Small Cancer Data is the lack of power in many studies. If you want to have a high degree of confidence in your results then you’re going to need a lot of patients to contribute to the study.

Add in the fact that there are an estimated 20,000-25,000 genes in the human genome, undergoing alternative splicing and post-translational modifications, it’s clear that Big Cancer Data still has a very large role to play in the search for a cure for cancer.

We’re just not entirely convinced that a single über dataset is the way to go…

About the Author

Lee Baker is an award-winning software creator with a passion for turning data into a story.

A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!

Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress - but 100 times the fun!


Read next:

Journey of a Scientist: From a Microscope to Magnifying Lens