Influenza often seems harmless. Many people even go into work with it, spreading their germs around like they’re throwing confetti into a ceiling fan. The flu kills though, especially those in vulnerable groups like the elderly and the very young. A 2004 study by the Center for Disease Control (CDC) discovered that in excess of 200,000 patients are hospitalized with seasonal flu each year. Thousands die from conditions associated with the virus.
Vital to preventing the spread of flu is what is known as ‘disease forecasting’, which uses data analytics to try and predict the spread of disease. The use of analytics for this purpose is already one of the best known use cases of big data. It is also often held up as one of the best examples of big data’s limitations. The Centre for Disease Control (CDC) saw great success during the ebola epidemic, using a system called BioMosaic that provided near real-time availability of the global air transportation network. It also enabled them to pinpoint at-risk populations, and build a mosaic map of the diaspora population - both on the move from affected areas, and statically in terms of the US resident population. On the other hand, Google’s Flu Trends (GFT), which analyzed internet search terms to try and predict flu outbreaks, was discontinued despite seeing some early success, with a team at Northeastern University and Harvard University discovering that its prediction system had overestimated the number of influenza cases in the US for 100 of 108 weeks.
Computational epidemiologists at Boston’s Children Hospital seem to have found a better way to approach flu detection. The team is led by John Brownstein, PhD, Boston Children's chief innovation officer and co-founder of the disease tracking site HealthMap, and Mauricio Santillana, PhD, of Boston Children's Computational Health Informatics Program and the Harvard John A. Paulson School of Engineering and Applied Sciences. Their project uses ensemble modeling, or conducting analytics, which pulls together disparate sources of information to predict emerging trends - in this case, flu. They use four major sources of data: Search data from Google, messages on Twitter, near real-time EHR data from athenahealth, and crowd-sourced surveillance data from Flu Near You, a HealthMap site. Researchers then apply machine learning techniques to synthesize results.
Brownstein explained: ‘We've focused for many years on using individual data sources for tracking a range of diseases. This represents the next logical step: combining data in a new way where the whole is more valuable than the sum of its parts.’
Social media data in particular has proved an important ingredient to its success. Twitter has already been used in other analytics projects to predict events such as HIV outbreaks, adverse drug events, and spikes in emergency department visits for asthma. Brownstein noted that: ‘One individual on social media talking about their illness is not going to be that useful,” he said. “But in aggregate, that information can tell us really useful things about epidemics. It can even tell us about new things, like the Enterovirus epidemic that we recently experienced. So we are developing systems that are much more crowdsourcing in nature. We are trying to better engage the public, to put the ‘public’ back in public health. That provides us some really exciting opportunities to understand what’s happening on the ground level.’
The CDC has also been aggressively gathering flu data for some time, though their’s is roughly one or two weeks out of date - by which time a hospital could have already been overwhelmed and its limited resources exhausted. The combined model from Boston Children’s Hospital comes in near real time, and was found to correlate almost perfectly with the CDC’s reports of annual flu activity, and reached a 90 percent correlation for a two-week horizon. At the current time, capabilities only work on a national level, but the team hopes to refine its methods to provide flu predictions on a local level, and develop a public-facing tool that affords widespread access.
According to the McKinsey Global Institute, there could be savings to be made of between $300 and $450 billion by using data to better predict the healthcare needs of the U.S. population. In a nation where per capita spending on health is the highest in the world, this saving could prove game changing.