Google scientist warns of medical data pitfalls

Kathryn Rough outlines the importance of understanding the limitations of the vast quantities of medical data we have available today


"Electronic health record (EHR) data has its limitations and more data does not necessarily help," remarked Kathryn Rough, research scientist at Google, on the AI in Healthcare stage at DATAx San Francisco.

Rough's presentation covered the benefits, as well as the pitfalls, of medical data as she emphasized the importance of understanding the weaknesses of the vast datasets of medical data available today.

"US healthcare data alone reached 150 exabytes in 2011. For reference, five exabytes of data would contain all the words ever spoken by everyone on earth," Rough said. "And we need to make this data work for patients."

"It's messy and complex – and it was not intended for research purposes – and as much potential as there is there, we have to be careful in how we use it."

Rough went on to discuss some of the pitfalls of medical data which everyone working in the healthcare industry should be aware of.

Firstly, Rough noted, data quality can impact the effectiveness of a medical dataset due to things such as errors in data entry, important information being trapped in unstructured text and "not replicated in the structure information we like to work with". Additionally, there are issues with rule-out diagnoses, and upcoding and errors in data processing to bear in mind.

"Think through and check the data quality for your importance variable," she remarked. "I'd recommend finding a validation study check as well."

Other factors impacting the validity of medical data include patient loss to follow up, overemphasizing statistical significance and confounding. Finally, she stressed the importance of reporting.

"This is not so much a pitfall as an important consideration," commented Rough. "At the bare minimum, when we are analyzing medical data we need to really explain what's been done, so that the reader can believe the analyses that has taken place. It's crucial to transparently report, thoughtfully address limitations and not exaggerate findings."

Being data driven means sometimes stepping back  says strava data director small

Read next:

Being data-driven means sometimes stepping back, says Strava data director