Clustering of Country-Based Data in COVID-19 Infections by Coronavirus outbreak features – First wave, Data up to date 28/5/2020
Authors: Akad Doha, Markman Ofer and Lefkort Jared
This study investigated connections between the infection cycles of countries around the world. Utilizing factors such as the Day of Maximum Infections, the Total Infections and the Day of Maximum Infections, and Deaths and Recoveries per Million. In addition, countries that have completed the infection cycle were compared to understand similarities and differences amongst the aforementioned factors and others.
Note: All variables are reportedly up to date 28/5.
The variables:
Country
State / status – The state of the outbreak
Daily_peak – Maximum number of new daily infections
Total_at_daily_peak – The number of infections from the beginning of the outbreak to the maximum day of the new infections.
Death_per_m – The deaths per million people
Recovered_per_m – The recovery cases per million people
Continent – Continent
Time_to_peak- Time from day to the maximum day of new infections.
Break_time – Time in days from the maximum day for new infections until fading (only in countries that have significantly decreased the number of infections, which means that they can be considered in the end)
Total_time- Time from the day of first outbreak to the end.
Clustering:
Figure 1. Classification 1, Clustering Based on the variables – the number of new daily infections , the number of infections from the beginning of the outbreak to the maximum day of infections , the deaths per million people , the recoveries per million people , the time to the maximum day for new infections.
Cluster 1 – red – characterized by:
- The number of new daily maximum infections below average
- The number of infections from the beginning of the outbreak to the maximum daily infections below average
- Deaths per million persons below average
- Recoveries per million less than the number of deaths and below average.
Cluster 2 – blue – characterized by:
- The number of new daily infections usually above average Deaths per million people above average
- Recoveries per million above average yet less than deaths
- Time to the maximum day for new infections less than average.
Figure 2. Classification 2, Clustering Based on the variables – the number of new daily infections, the number of infections from the beginning of the outbreak to the maximum day of infections, the deaths per million people , the recovery cases per million people.
Cluster 1 (red): The number of new daily infections is less than average, the number of infections from the beginning of the outbreak to the maximum day of the new infections is almost average, deaths to one million people on average, recovery cases per million people above average
Cluster 2 ( green): the number of new daily maximum infections above average, the number of infections from the beginning of the outbreak to the maximum daily infections most often above average, yet less than the maximum daily new infections, the deaths per million above average, the recoveries per million above average, but less than deaths.
Cluster 3 (blue): maximum number of new daily infections smaller than average and smaller than cluster 1 , the number of infections since the beginning of the outbreak to the maximum new infections below the average, deaths per million people below average, recoveries per million people under the average and lower than deaths.
Figure 3. Classification 3, Cluster (clustering) Based on all variables for countries that have already completed the outbreak cycle.
Cluster 1 (red): maximum number of daily new infections above average, number of infections from the initial outbreak to the maximum day of new infections above average, recoveries per million people below average, the fading time below average, and total time to completion of outbreak circle below average.
Cluster 2 ( blue ): maximum number of daily new infections below average, number of infections from the initial outbreak to the maximum day of new infections less than average, fading time usually above average and not necessarily over cluster 1, and the total time to the end of the outbreak cycle above average.
This classification is done based on a small number of countries since there are a lack of countries who have completed the outbreak circle, so we will use it only to understand what kinds of classifications we receive if there is a fading time and total time.
Figure 4. World map by classification 1:
The map shows that the countries of Asia, Northeastern Europe, Africa, Central America and South America, and some of North America are classified by Cluster 1, which means that they have Cluster 1 characteristics.
Western Europe, Eastern South America, part of North America belongs to Cluster 2. (Please refer to Cluster properties in explanation of Figure 3)
Figure 5. World Map by Classification 2:
Northern North America, South America, the Middle East, parts of Europe, and North Asia are classified as Cluster 3.
Western Europe, Southeastern America, and some of North America are classified as Cluster 2.
East Asia, Africa, parts of Northern Europe, parts of South America and Central America are classified into Cluster 3. (Please refer to Cluster properties in explanation of Figure 2).
Figure 6. Summary Classification – Combining the two classifications 1 and 2:
Cluster 1 (red) is characterized by a maximum number of new infections larger than average (highest number of maximum daily infections), the number of infections since the beginning of the outbreak to the day of maximum new daily infections more than or equal to the average, deaths above average and above cluster 4, recoveries per million people over the average, yet less than deaths.
Cluster 2 (green) is characterized by the maximum number of daily new infections close to average and tends to be above average in most cases, the number of infections since the beginning of the outbreak to the day of maximum new daily infections almost average, deaths mostly at or above average, but below cluster 1, recoveries per million above average and greater than the deaths.
Cluster 3 (blue) is characterized by a maximum number of new infections below average, the number of infections since the beginning of the outbreak to the day of maximum new daily infections less than or equal to the average, deaths below average (lowest deaths) , recoveries per million people below average and less than deaths.
Cluster 4 (Purple) is characterized by a maximum number of new infections below average, the number of infections since the beginning of the outbreak to the day of maximum new daily infections below average, deaths above average and above clusters 2 and 3, recoveries per million above average and above deaths (greatest amount of recoveries)
Figure 7. Distribution of time until the maximum day of New infections by the summary classification.
Cluster 3 has the highest average time up to the maximum day for new infections, followed by Cluster 1, then Cluster 2 and Cluster 4 with the lowest average.
Figure 8. The world map is classified according to the summery classification:
Southern South America, parts of North America, and Western Europe are classified as Cluster 1.
Table 1. countries in first cluster:
Status | Country |
Ongoing | USA |
Subsiding | Belgium |
Subsiding | UK |
Subsiding | Italy |
Ongoing | Brazil |
Subsiding | France |
Subsiding | Spain |
Western South America, parts of North America, the Middle East, North Asia and some parts of Europe are classified as Cluster 2.
Table 2. countries in second cluster:
status | country | status | country |
ongoing | Panama | ongoing | Russia |
completed | Norway | subsiding | Turkey |
subsiding | Germany | reemerged | Iran |
ongoing | Peru | ongoing | Canada |
subsiding | Netherlands | ongoing | Saudi Arabia |
ongoing | Sweden | ongoing | Chile |
completed | Israel | subsiding | Portugal |
completed | Austria | subsiding | Ecuador |
subsiding | Denmark |
Parts of America, Africa, East Asia and parts of Europe are classified into Cluster 3.
Table 3. countries in second cluster:
status | country | status | country |
ongoing | South Africa | ongoing | Poland |
ongoing | Philippines | ongoing | Mexico |
ongoing | Dominican Republic | ongoing | India |
ongoing | Egypt | ongoing | Pakistan |
completed | South Korea | ongoing | Bangladesh |
subsiding | Czechia | ongoing | Ukraine |
ongoing | Argentina | ongoing | Indonesia |
ongoing | Algeria | subsiding | Romania |
subsiding | Finland | completed | Japan |
subsiding | Hungary | ongoing | Colombia |
completed | China |
Small parts of Western Europe are classified into Cluster 4. (Please refer to Cluster properties in explanation of Figures 6 and 7)
Table 4. countries in second cluster:
status | country |
completed | Switzerland |
completed | Ireland |
Interesting discovery:
While searching the variables that contribute to a clearer picture of the world situation, some countries were found to have a day that repeats every week, characterized by the minimum number of deceased from coronavirus. These countries include: The United States, Brazil, the Netherlands, Sweden, and Israel.
In addition, India had a day characterized by a maximum number of new infections that repeats every week.
Peru had a devoted day that repeats every week characterized by a minimum number of new infections.
Statistical insights appendix:
Figure 9. The quantum of the quantitative variables
We can see that:
- The maximum number of new daily infections in most countries is less than 10000 people. In individual cases over 10000.
- The number of deaths from the virus in most countries is less than 200 people per million.
- The number of people who have recovered from the virus in most countries are under 2000 people per million.
- The maximum time to date for new infections varies by country and there is no common reservation for a number of days, but from the chart it can be assumed that most countries are below 80 days for maximum full outbreak.
- The number of infections from the beginning of the outbreak to the maximum day for new infections in most countries does not exceed 250000 infections.
Relationships and adjustments between variables:
Figure 10. Correlation between the different variables
The most prominent correlations between the variables are:
- The number of new daily infections in the maximum day for new infections and the number of infections from the beginning of the outbreak to the maximum day for new infections. Indicates a strong positive correlation.
- Between the number of deaths and the number of recoveries a moderate positive correlation exists.
- Between the number recoveries per million and the time to maximum day of new infections a moderate negative correlation exists.
Figure 11. Correlation of all variables Countries that completed the outbreak cycle:
The most prominent correlations between the variables are:
- The number of new daily infections in the maximum day for new infections and the number of infections from the beginning of the outbreak to the maximum day for new infections. Indicates a very strong positive correlation.
- Between the number of deaths and the number of recoveries correlates strong positive.
- The number of infections that have healed, the maximum number of new daily cases and the number of infections from the beginning of the outbreak to the maximum day of new infections has a negative medium correlation.
- Between the time of the outbreak fading and the time of the complete outbreak cycle there is a very strong positive correlation.
- The maximum number of daily new infections and outbreak fading time and all the time of outbreak cycle has a strong negative correlation.
- Between the number of infections from the onset of the outbreak to the maximum day for new infections, the time of outbreak fading and the whole time of the complete outbreak cycle has a very strong negative correlation.
* consider that the correlations are based on a small number of countries, so there may be biases in the correctness of adjustment with the true situation. If there were more countries that have completed the outbreak cycle would have been more precise – recommends future research.
Figure 12. Diagram of the correlation between variables by PCA analysis (For all countries)
The diagram shows the relationships between all variables, they can be interpreted as follows:
- As the total number of infections from the onset of the outbreak to the maximum day for new infections increases, the number of maximum new daily infections increases.
- As the number of deaths increases, the number of recovered patients also increases.
- As the time to the maximum day for new infections decreases, the number of recovered patients increases.
- The variables depicted in red represent those that are significant to understanding the world data, and conversely, the variables in blue are less significant, but are also necessary in understanding the data. Therefore, subsequently, one analysis was performed including the maximum day for new infections variable, and one was performed without it.
Figure 13. Diagram of the correlation between variables by PCA analysis (Countries that have completed the outbreak cycle)
Chart is prepared to show the connections of the variables with two variables that were found only in countries that have completed the outbreak cycle, 1. Fading time 2. The total time to completion.
- As the time between the reduction of infection rates and the day of maximum infections increases, so does the total length of the infection cycle. And it seems that a negative relationship exists between this relationship and time to the maximum day of new infections.
- As the fading time and time to end decreases, the total number of infections in the maximum day of new infections and new daily infections number increases (very interesting).
Reference:
The data was collected from:
https://ourworldindata.org/covid-deaths
Leave a Reply