Crowdsourcing Difficult-to-Collect Epidemiological Data in Pandemics: Lessons from Ebola to the current COVID-19 Pandemic
Curator: Stephen J. Williams, Ph.D.
At the onset of the COVID-19 pandemic, epidemiological data from the origin of the Sars-Cov2 outbreak, notably from the Wuhan region in China, was sparse. In fact, official individual patient data rarely become available early on in an outbreak, when that data is needed most. Epidemiological data was just emerging from China as countries like Italy, Spain, and the United States started to experience a rapid emergence of the outbreak in their respective countries. China, made of 31 geographical provinces, is a vast and complex country, with both large urban and rural areas.

Geographic Regions of China. Source: https://www.unicef.cn/en/figure-11-geographic-regions-china
As a result of this geographical diversity and differences in healthcare coverage across the country, epidemiological data can be challenging. For instance, cancer incidence data for regions and whole country is difficult to calculate as there are not many regional cancer data collection efforts, contrasted with the cancer statistics collected in the United States, which is meticulously collected by cancer registries in each region, state and municipality. Therefore, countries like China must depend on hospital record data and autopsy reports in order to back-extrapolate cancer incidence data. This is the case in some developed countries like Italy where cancer registry is administered by a local government and may not be as extensive (for example in the Napoli region of Italy).

Population density China by province. Source https://www.unicef.cn/en/figure-13-population-density-province-2017
Epidemiologists, in areas in which data collection may be challenging, are relying on alternate means of data collection such as using devices connected to the internet-of-things such as mobile devices, or in some cases, social media is becoming useful to obtain health related data. Such as effort to acquire pharmacovigilance data, patient engagement, and oral chemotherapeutic adherence using the social media site Twitter has been discussed in earlier posts: (see below)
Twitter is Becoming a Powerful Tool in Science and Medicine at https://pharmaceuticalintelligence.com/2014/11/06/twitter-is-becoming-a-powerful-tool-in-science-and-medicine/
Now epidemiologists are finding crowd-sourced data from social media and social networks becoming useful in collecting COVID-19 related data in those countries where health data collection efforts may be sub-optimal. In a recent paper in The Lancet Digital Health [1], authors Kaiyuan Sun, Jenny Chen, and Cecile Viboud present data from the COVID-19 outbreak in China using information collected over social network sites as well as public news outlets and find strong correlations with later-released government statistics, showing the usefulness in such social and crowd-sourcing strategies to collect pertinent time-sensitive data. In particular, the authors aim was to investigate this strategy of data collection to reduce the time delays between infection and detection, isolation and reporting of cases.
The paper is summarized below:
Kaiyuan Sun, PhD Jenny Chen, BScn Cécile Viboud, PhD . (2020). Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. The Lancet: Digital Health; Volume 2, Issue 4, E201-E208.
Summary
Background
As the outbreak of coronavirus disease 2019 (COVID-19) progresses, epidemiological data are needed to guide situational awareness and intervention strategies. Here we describe efforts to compile and disseminate epidemiological information on COVID-19 from news media and social networks.
Methods
In this population-level observational study, we searched DXY.cn, a health-care-oriented social network that is currently streaming news reports on COVID-19 from local and national Chinese health agencies. We compiled a list of individual patients with COVID-19 and daily province-level case counts between Jan 13 and Jan 31, 2020, in China. We also compiled a list of internationally exported cases of COVID-19 from global news media sources (Kyodo News, The Straits Times, and CNN), national governments, and health authorities. We assessed trends in the epidemiology of COVID-19 and studied the outbreak progression across China, assessing delays between symptom onset, seeking care at a hospital or clinic, and reporting, before and after Jan 18, 2020, as awareness of the outbreak increased. All data were made publicly available in real time.
Findings
We collected data for 507 patients with COVID-19 reported between Jan 13 and Jan 31, 2020, including 364 from mainland China and 143 from outside of China. 281 (55%) patients were male and the median age was 46 years (IQR 35–60). Few patients (13 [3%]) were younger than 15 years and the age profile of Chinese patients adjusted for baseline demographics confirmed a deficit of infections among children. Across the analysed period, delays between symptom onset and seeking care at a hospital or clinic were longer in Hubei province than in other provinces in mainland China and internationally. In mainland China, these delays decreased from 5 days before Jan 18, 2020, to 2 days thereafter until Jan 31, 2020 (p=0·0009). Although our sample captures only 507 (5·2%) of 9826 patients with COVID-19 reported by official sources during the analysed period, our data align with an official report published by Chinese authorities on Jan 28, 2020.
Interpretation
News reports and social media can help reconstruct the progression of an outbreak and provide detailed patient-level data in the context of a health emergency. The availability of a central physician-oriented social network facilitated the compilation of publicly available COVID-19 data in China. As the outbreak progresses, social media and news reports will probably capture a diminishing fraction of COVID-19 cases globally due to reporting fatigue and overwhelmed health-care systems. In the early stages of an outbreak, availability of public datasets is important to encourage analytical efforts by independent teams and provide robust evidence to guide interventions.
A Few notes on Methodology:
- The authors used crowd-sourced reports from DXY.cn, a social network for Chinese physicians, health-care professionals, pharmacies and health-care facilities. This online platform provides real time coverage of the COVID-19 outbreak in China
- More data was curated from news media, television and includes time-stamped information on COVID-19 cases
- These reports are publicly available, de-identified patient data
- No patient consent was needed and no ethics approval was required
- Data was collected between January 20, 2020 and January 31,2020
- Sex, age, province of identification, travel history, dates of symptom development was collected
- Additional data was collected for other international sites of the pandemic including Cambodia, Canada, France, Germany, Hong Kong, India, Italy, Japan, Malaysia, Nepal, Russia, Singapore, UK, and USA
- All patients in database had laboratory confirmation of infection
Results
- 507 patient data was collected with 153 visited and 152 resident of Wuhan
- Reported cases were skewed toward males however the overall population curve is skewed toward males in China
- Most cases (26%) were from Beijing (urban area) while an equal amount were from rural areas combined (Shaanzi and Yunnan)
- Age distribution of COVID cases were skewed toward older age groups with median age of 45 HOWEVER there were surprisingly a statistically high amount of cases less than 5 years of age
- Outbreak progression based on the crowd-sourced patient line was consistent with the data published by the China Center for Disease Control
- Median reporting delay in the authors crowd-sourcing data was 5 days
- Crowd-sourced data was able to detect apparent rapid growth of newly reported cases during the collection period in several provinces outside of Hubei province, which is consistent with local government data
The following graphs show age distribution for China in 2017 and predicted for 2050.

projected age distribution China 2050. Source https://chinapower.csis.org/aging-problem/
The authors have previously used this curation of news methodology to analyze the Ebola outbreak[2].
A further use of the crowd-sourced database was availability of travel histories for patients returning from Wuhan and onset of symptoms, allowing for estimation of incubation periods.
The following published literature has also used these datasets:
Backer JA, Klinkenberg D, Wallinga J: Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travellers from Wuhan, China, 20-28 January 2020. Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin 2020, 25(5).
Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, Azman AS, Reich NG, Lessler J: The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application. Annals of internal medicine 2020, 172(9):577-582.
Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, Ren R, Leung KSM, Lau EHY, Wong JY et al: Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia. The New England journal of medicine 2020, 382(13):1199-1207.
Dataset is available on the Laboratory for the Modeling of Biological and Socio-technical systems website of Northeastern University at https://www.mobs-lab.org/.
References
- Sun K, Chen J, Viboud C: Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. The Lancet Digital health 2020, 2(4):e201-e208.
- Cleaton JM, Viboud C, Simonsen L, Hurtado AM, Chowell G: Characterizing Ebola Transmission Patterns Based on Internet News Reports. Clinical infectious diseases : an official publication of the Infectious Diseases Society of America 2016, 62(1):24-31.
Leave a Reply