Human Genome, Bioinformatics: Is Big Data Pushing New Frontiers?

Analytics might indeed provide some answers to puzzles hidden in our DNA


Genome: an organism’s complete set of genetic instructions. Each genome contains all of the information needed to build that organism and allow it to grow and develop.

If printed out, the 3.2 billion letters in your genome would:

Fill a stack of books 200 feet (61 m) high

Take a century to recite, if we recited one letter per second 24 hours a day

Extend 3,000 km (1,864 miles), that's about the distance from London to the Canary Islands


Clearly, quite complex - there is a noteworthy scientific area of Bioinformatics: interdisciplinary field that develops methods and software tools for understanding biological data. It combines the insights and scientific approach of data science, statistics,mathematics and engineering to analyze and interpret biological data.

One if its key research areas is genetics and genomics, especially looking for certain patterns and anomalies in the data. The ultimate objective is to comprehend the genetic basis of a disease, unique adaptations or differences between populations.

This is making a positive impact on modern medical research, drug discovery and thus society, of course. Pharma companies like Novartis ( or Astellas ( - i-Genes: What the DNA and Data revolutions mean for our health) are driving some exciting initiatives in this field. For patients, this means a promise of a treatment and personalized medicine — based on an individual’s unique genetic profile.

Bioinformatics is a prime example of importance of ‘getting your data right’, if you want to aim at some serious outcomes. The data analysis which can reveal valuable insights into oncology, autism, or rheumatology needs to be based on a thorough, structured data preparation process.

Once you have your data sets cleansed, enriched and integrated, only then can cyou proceed with data analysis itself, based on machine learning, specific algorithms and other data science paradigms.

Joseph Szustakowski, who leads Translational Bioinformatics at Bristol Myers Squibb, another leading pharma company, came up with an interesting comparison to demonstrate the magnitude of Bioinformatics research: looking for a 1-meter needle in a haystack that stretched from Earth to the sun. Big data indeed, this is not analyzing a few company Excel sheets for sure.

For a complex data preparation effort as suggested above, traditional data warehousing and ETL (extract, transform, load) technologies simply cannot keep up – both in terms of comprehensive insights required and (high) cost/(non-agility) of these solutions.

Fortunately, there are solutions which can be of assistance with Data Preparation/Transformation/Unification pursuits and can both provide appropriate answers in terms of depth of Big Data/Hadoop/Spark of 2015 as well as being agile and flexible, which can ultimately impact TTR (time to results), perhaps more apt than the usual TCO (total cost of ownership), in this case.

Thoughts welcome, always.

Bruno Polach


University lecture small

Read next:

How Are Higher Education Institutions Using Analytics?