COVID-19 and why public datasets are so important
In the wake of the global spread of coronavirus/COVID-19, it is more important than ever to have access to public data. While we are all trying to tackle the outbreak, data and specifically public data is one of the most important means to fight back with technology and ensure trust during these uncertain times. This post aims at highlighting why public data is important to secure transparency, and how open research data and the possibility to collaborate can be crucial for discovering new insights to fight the virus.
Rasmus Hauch | Head of Product Management | March | 2020
Accessibility of public data for data science projects
Public data is meant to be accessible to everyone. However, in many cases, the data that should be available are merely summarized, or available in formats that are difficult to interpret by a machine. Public data is vital for obtaining a deeper understanding and uncovering its underlying secrets, which leads to new machine learning models that can be used to make valuable predictions about the future or try to save human lives by working with COVID-19 research articles.
Public data needs to be assembled by someone, and in many cases, the organizations that you think would be responsible for assembling this data are not really doing so.
You would think that the official organizations responsible for global outbreaks of e.g. virus, should have been the ones releasing detailed data on COVID-19. But from what we have seen lately, public datasets were not made available in the beginning of the outbreak and did not contain the necessary details.
An example here is that many countries release data on the number of people in quarantine, however, this is not reflected in any global public datasets so far.
Similarly, the type of diagnosis that was done for patients, types of testing that each country is currently doing, as well as the initial symptoms are not available in the form of data.
The power of data
Data is one of our most important allies when it comes to fighting the outbreak of COVID-19.
John Hopkins Institute was the first to release a dataset that contained a global view of how the COVID-19 virus is spreading. The dashboards and dataset released on github was made available on January 22.
Image: The interactive web-based dashboard (static snapshot shown above) hosted by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, to visualize and track reported cases in real-time.
The “coronavirus tracking map” was developed to monitor confirmed cases of coronavirus worldwide, to track and record cases in real-time. The dashboard currently receives an average of 1,2 billion interactions a day. The purpose of making this data available was to provide researchers, public health authorities, and the general public with a user-friendly tool to track the outbreak as it unfolds.
South Korea can be seen as a frontrunner in releasing detailed data on its COVID-19 situation, taking its point of departure in transparency and technology. The country had 7,869 confirmed cases of COVID-19 as of midday March 12 – the fourth highest number in the world outside China, Italy, and Iran. However, their handling of the crisis has been widely lauded as a benchmark in terms of both effective response and its open and democratic approach to using cutting edge technology.
The large number of cases in South Korea can be attributed to the country’s widespread testing including more than 200,000 people, and with a capacity to test up to 20,000 per day. This was made possible by deploying multiple technologies like diagnostic apps, innovative testing kits, and telecommuting solutions.
By testing multiple cases, and not only symptomatic people, South Korea has detected more asymptomatic and positive cases of coronavirus than Italy, particularly among young people. This data is valuable because it shows that younger people, who may not be tested for COVID-19 because they are asymptomatic, might be the ones that are spreading the virus. But by applying this public health measure, asymptomatic people with the virus can isolate even if they don’t feel sick, and prevent spreading the virus.
The graph shows younger people in South Korea, who are tested for the disease regardless of showing symptoms, are perhaps more likely to be asymptomatic. Photo credit/Source: Medium/Andreas Backhaus.
During this recent outbreak, there have been a lot of public new tech offerings.
In China, they have been using big data to stop the spread of the virus. The leading provider of internet, Tencent, added value services and has been rolling out a QR code system on the social network WeChat to track potential COVID-19 carriers on public transportation. Passengers entering a bus, subway or taxi can submit their information through Tencent’s “ride registration code” and the system will synchronize their real names with the vehicle’s license plate, boarding time and other information. When a passenger is discovered to have been infected, other passengers who might have been exposed will get a message warning them.
Open research data sets
In response to the pandemic is open research data sets. Offering freely available data sets to the global research community is a way to generate new insights in support of the ongoing fight against COVID-19. These data sets are open for the community to apply recent advancements like natural language processing and other AI techniques.
The White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19) – a resource of over 29,000 scholarly articles available on Kaggle for the world’s AI experts for developing text and data mining tools that can help the medical community develop answers to high priority scientific questions.
What about data privacy?
Looking at these new and innovative tech offerings and public datasets that are being rolled out to fight the outbreak of COVID-19, there is also a concern in terms of data privacy related to people and their behavior. This is why Denmark, for instance, has just recently started releasing information about its patients.
The concern for peoples’ privacy is understandable when releasing these data sets and, it is therefore necessary that a combination of k-anonymity principles and properly audited anonymization of datasets is carried out. This is done to scientifically guarantee that person-specific data cannot re-identify the individual who is the subject of the data, while the data remain useful by data scientists.
For the common fight against COVID-19
Because of the rapid acceleration we are experiencing with COVID-19, which makes it difficult to keep up, there is a huge demand for information that can be trusted.
At the same time, we see a growing urgency for approaches such as new tech offerings as well as open research data sets for the research communities.
Public available data becomes even more important as it can be used for a variety of things that are crucial during an outbreak like this. But in the fight against COVID-19 open research data, that is available for global communities, data scientists and AI experts, give us a foundation for working toward a common goal – discovering new insights to fight the virus.
In many instances, these people will be working on their own, developing their projects and models that can be of great importance.
With the number of people working on this growing exponentially, adding the data sets to a platform can provide more significant advantages. For these communities and brightest minds within AI, a platform will allow them an opportunity to work together, collaborate, share projects, and models.
At 2021.AI, we will soon be offering open access to our Data-and AI Platform, to foster collaboration and development support for advanced data analysis and AI models to accelerate innovative and robust responses to the COVID-19 crisis.
Stay tuned, and stay safe!
Head of Product Management, 2021.AI
Rasmus is VP in Engineering and Chief Architect at 2021.AI. Rasmus has an abundance of experience in positions like Program Manager, Lead Architect, and Senior Consultant for various international Financial, Energy, and Telecom customers. His skills include leadership, mentoring and enterprise architecture.
You might also like…
Thirteen years ago, the Economist claimed that oil was no longer the world's most valuable resource, instead, it was data. Thirteen years later, with all the spectacular advances made in...
Everywhere you turn in the coverage of AI; you see the term AI Governance. What is interesting about this coverage is that four simple, but critical elements seem to be...
The Ethical AI Newsletter
It’s not fake. It’s not artificial. It’s real news! Sign up for our Ethical AI newsletter and get the latest AI insights from our data science and AI experts.