The Importance Of Accuracy In Open Data

There is a balance to find, but where is it?


Open data has become the cornerstone for many companies, governments, and NGOs over the last decade. The spread in its use has seen huge leaps forward in the way that our society operates, with apps like Citymapper essentially becoming a UX skin on top of open data from public transport. This has allowed the company to plan trips for residents of cities with impressive accuracy, reporting on real-time departures and delays and giving multiple options for a single trip.

Cities like Chicago have also fully embraced this new influx of open data, even bringing together groups of developers to help create new programs and analysis to help improve the city further. For instance, Tom Schenk told Socrata that ‘We have the Flu Shot Tracker. We made that data available and Tom Campari, a programmer, created an application and a map around it. It has been replicated in other cities.’

These have become huge successes. Chicago is now seen as world leader in open data initiatives, with many of their initiatives having spread to other cities and Citymapper is now worth in excess of £250 million ($312 million), operating in 38 cities with more on the way. They are prime examples of how open data can have a profound positive impact, but there are some conditions that need to be met for this to be the case. None are more important than the accuracy and authenticity of the data being made available.

Without accurate data, any analysis is essentially useless, given that it will be based on assumptions that simply aren’t true. For instance, if a city wanted to analyze the cleanliness of its streets, it would need to make sure that reporting was accurate throughout, as it may be that any small amount of rubbish is reported in one area, whilst another area may have a considerable amount that simply isn’t reported. If this data is then fed into a system, it would naturally say that the cleaner area is actually dirty and the dirty area is clean.

There are also issues of format, especially when it comes to historical data. Data in the last decade has been recorded through excel and comparatively standardized formats, lists, numbers, letters etc, but this certainly wasn’t the case only 20 years ago. It has only been since the turn of the millennium that the bulk of municipal data has been collected digitally, meaning that almost everything before that was written by hand or printed on paper. It is theoretically simple to get this data into a useable format, using optical character recognition software, but the issue, especially with handwriting, is that 1’s could be L’s or 5’s could be S’s, creating considerable confusion if this is then fed into a system again afterwards.

It also allows for confusion when trying to decipher non-common names, with Joshua Tauberer in his book ‘Open Government Data: The Book’ giving the example of Congresswoman Debbie Wasserman Shultz, ‘Congresswoman Debbie Wasserman Shultz is ‘Rep. Wasserman Schultz’ not ‘Rep. Shultz’ as you might think. Her last name defies convention by having a space separate the two parts rather than a hyphen. And so in a spreadsheet of names of Members of Congress, a single name field containing ‘Debbie Wasserman Shultz’ would almost certainly lead to an embarrassing outcome of causing someone who does not know the idiosyncrasy to refer to the congresswoman by the wrong name.’

These can cause real issues too, with Joshua Chambers describing one particular incident on FutureGov, ‘I was struck by this when we reported on an app in Australia that was issuing alerts for forest fires that didn’t exist. The data was coming from public emergency calls, but wasn’t verified before being displayed. This meant that app users would be alerted of all possible fires, but also could be caused unnecessarily panic. The government takes the view that more alerts are better than slower verified ones, but there is the potential for people to become less likely to trust all alerts on the app.’ Essentially, this unverified and inaccurate data was one the verge of causing a ‘boy who cried wolf situation’, which could have easily led the loss of life.

In an interview, Paul Maltby, former Director of Open Data and Government Innovation for the Cabinet Office in the UK, claims that a lack of complete accuracy is used as an excuse for departments to not release performance data that can then be scrutinized, ‘There’s sometimes a reluctance to get data out from the civil service; and whilst we see many examples of people understanding the reasons why data has been put to use, I’d say the general default is still not pro-release.’

It shows that there needs to be a delicate balancing act with open data between data accuracy for ease of use later on, and flexibility to make it as simple as possible for companies and governments to release it. Finding this balance is something that those who have utilized open data successfully have done. The big question is how those who are less mature in their approaches reach that point. 

Bean small

Read next:

City of Chicago: An Analytics-Driven City