Limits to Big Data in healthcare analytics

March 16, 2014

Researchers suggest that the surge in analysis of large data sets may have limitations for understanding healthcare patterns

The man-in-the-street impression about Big Data and social-media data gathering is that new digital-data-based companies (Google, Facebook et al.) can scoop up massive amounts of trend data and report them, both quickly and precisely, while running rings around “traditional” (i.e., not new-media) polling and surveying sources. But a just-published paper in Science shoots some significant holes in this belief, and serves as something of a cautionary tale about the limits and risks of new-media, big-data analytics.

In the paper, The Parable of Google Flu: Traps in Big Data Analysis, (Science, March 14, 2014, 343 (6176): 1203-1205), scientists from Northeastern Univ., Harvard and a couple other research institutes review the track record of Google Flu Trends (GFT), which was started by Google in 2008 and is based in part on Google’s ability to track keyword searches believed to be associated with flu. Knowledge of the IP address of the searcher is believed to provide a fairly precise regional variation. As has been reported in the past, Google Flu Trends greatly overestimated flu prevalence in 2013 (by over 50%) and, according to the authors, “Although not widely reported until 2013, the new GFT has been persistently overestimating flu prevalence for a much longer time.” There actually isn’t a good time series for GFT’s performance, since Google has been modifying the search terms and the underlying search algorithm along the way. Looking retroactively, the researchers found that a better predictor of flu prevalence could be obtained by merging the trend data from GFT with that of the Centers for Disease Control, which publishes an analysis based on doctor visits.

The researchers propose that there are two basic problems with these flash analyses based on search or other online trend data. One they call “big data hubris”: “Quantity of data does not mean that one can ignore foundational issues of measurement and construct validity and reliability and dependencies among data;” and “algorithm dynamics”: “the changes made by engineers to improve the commercial service and by consumers in using that service.” In effect, Google engineers were refining their search and analytics along the way, and as they do so, consumers were modifying their search habits based on what they were hearing from places like GFT itself.

It’s not that Google’s approach or Big Data analytics are themselves flawed, but that the assumption has been that greatly expanding the volume of data being analyzed—without close review of how the data are generated—gets answers more quickly and accurately than traditional (“small data”) methods. “Instead of focusing on a ‘big data revolution,’ perhaps it is time we were focused on an ‘all data revolution,’ where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources,” the authors conclude.