Data Science

Covid-9 Evolution Model: Calibration results for Italy

Disclaimer: what follow is just a logic/math/coding personal exercise, it does not have any medical implications. I’m not a virologist, nor an expert of medical topics. What I’m sharing is just made for intellectual pleasure.

Here is the results of an calibration exercise based on Italian data as of April, 19th. I’ll present the results of the calibration only, mathematical details of the model and a general description of variables and calibration strategy are described here; the code of the model, saved results and notebooks are available at this github repository

Results

In what follows, I’ll consider only the model optimized for a composite error measure, considering major KPIs involved. Moreover, I’ve also estimated the model for every Italian region to study differences in estimated parameters and in the evolution of the contagion. For practical reason, I’ll focus on only two regions, Emilia Romagna, Lombardia and Piemonte, and I’ll post graphs and analysis by region in the following Jupyter Notebook that I’ll try to update on a regular basis.

In general, the dynamics resulting from the Italy model may be affected by the different "maturity" and conditions of the contagion in each regions. Moreover, the model is highly sensitive to rg and ra parameters: to project into the future their values and provide predicions the model makes the hypotesis that rg and ra are the same as the last observed value (actually average of the last two windows to avoid possible excessive oscillations of the last window).

Lastly, it seems to me that the ability of the model to mimic the shape of the KPIs curves is quite impressive: they are not derived from regression / curve fitting (with which it easier to provide very close mimic of the observed values by increasing the order of the polynomial of the fitted curve, with evident overfitting issues), but from the intrinsic logic of the math model that governs the relations between variables (see graphs below).

Model Calibration and Parameters Analysis

The estimated parameters from the calibration are shown in the table below, from which we can try to tell the "story of the virus lifecycle". If a person gets infected and is asymptomatic, she will be infecting 0.45 (ra, median) people per day, for 4 days (ta2), after which she’ll be recovered and never be infecting again. On the contrary, if she is symptomatic, she’ll infect other 0.09 (rg) people per day for 4 days (t1). The new infected from either asymptomatic or symptomatic will be herself symptomatic with a probability of 80%. After t1, the symptomatics will get severely infected, 10% of them will go in intensive care, where they’ll stay for 10-14 days until recovering. During this period, they may die each day with a probability of 2-5% per day. The remaining 90% will be either staying home or hospedalized for about 28 days until recovery. During this period, they my die with a probability of 0.5% per day.

Italy

The National model estimates a peak in the number of currently infected people of 105.24K on April 15th (vs the current value of 108.26K cases reported as of April 19th). The Currently Infected people is estimated quite effectively, with with a difference of 4.4% from the actual value, while mortality and recovery rate are a bit overestimated. The constant decrease in the number of Recovered in intensive care is also quite well captured. Finally, the model estimates ar R0 of 2.39 and the total number of infected people (the ones known from ufficial statistics plus the ones unknown) is 1.26 times the number of known ones (quite low compared to what seems to be the general understanding). The following picture sums up the key statistics and kpi of the model on April 19th. One note on the evolution curves shown in picture "Model: tot – Prediction": the curve of Currently Infected (Igc_cum, green line) has flattened quite significantly (much lower peak, much flatter right tail). This is a clear evidence of the effect of the lock down measure mentioned above.

Emilia Romagna

The calibration has done a good job for Emilia Romagna: both delta vs actual data and structural kpi (e.g. Mortality rate, % Increase in infected, % Recovered / Tot, % Dead / Tot, etc.) seems to be quite close to reality. To notice: Emilia Romagna showed a constant decrease in R0, with an initial peak that is below the national value. The model predicts a peak in currently infected of 13.6K on April 12th, a more gentile peak in Intensive Care of 271 on March 29th, and a peak in daily deads of 103 on April 4th.

For the other regions, I’ll avoid to comment so not to be too repetitive, as all info are in the graphs.

Lombardia

Few remarks: R0 is still a bit higher than most of the other regions, at 1.57. There seems to be something strange in the sudden drop in Intensive care on the one hand, meanwhile on the other hand recovery and death seems to be lower than what the model predicts.

Piemonte

Remarks: quite hight R0 at 1.73; all KPIs are well represented by the model, but the number of Intensive Care.

Data Science

Covid-19 Evolutionary Model: study on Italy

Here is an attempt to describe the epidemic diffusion of Covid-19 and its effect on the population of a State. The evolution through time is described by a discrete time micro-founded model. The model is calibrated on Italian data, available thanks to the Dipartimento Protezione Civile, and available at this GitHub repository. After the calibration, it is possible to study the characteristics of the parameters on the evolution of the disease, forecast its spread and time to peak, and evaluate the impact of restraining measures put in place by the Government.

Another feature of the model that may be of interest is the possibility to infer some key indicators of the phenomena, like r0 (which is a measure of the capacity of an infected individual to infect other people), time to recover, death probability, etc.

For those interested in looking at the code, please visit this GitHub repo (I’ll update it in the next days, so if you still find it empty, please come back in a little while).

DISCLAIMER: This is a preliminary attempt of a home-made exercise, so please consider it as NON-SCIENTIFIC. I’ve no medical background, and I’m not a researcher, just a person who likes numbers and coding 🙂 I’ll update my findings in this blog as new data comes in, and as I discover and fix bugs / implement new features.

DISCLAIMER 2: This is by no means a post about estimating number of people infected or dead, it is just a math/logical/coding exercise. The high number of infected / dead may be due to the fact that the calibration takes into consideration what happened in the first days of the infection, which is typically characterised by high contagious rate.

The Model

The model describes the evolution in time of a society in which an initial number of people affected by the virus is introduced. At each point in time, every person in the society is in one of the following categories:

1. $Ias_t$: unknown asymptomatic, people that will not suffer from the severe consequences of the virus and will continue their normal life. They will recover from the virus after $t2_{as}$ periods, becoming $Gas_t$.
2. $Igs_t$: unknown seriously affected, people that are currently not showing symptoms but will be hit by the serious consequences of the virus after $t1$ periods, becoming either $Igci_t$ with probability $\gamma$ or $Igcn_t$ with probability $1 - \gamma$.
3. $Igci_t$: known seriuos affected in intensive care, they may die during each period of the hospitalization with probability $\beta_{gci}$, or, if they don’t die, they will recover after $t2_{gi}$ periods becoming $Gci_t$.
4. $Igcn_t$: known serious affected but not in intensive care, they also may die during each period with probaility $\beta_{gcn}$, or, alternatively, they will recover if they survive after $t2_{gn}$ periods becoming $Gcn_t$.
5. $Popi_t$: people that have not been affected by the virus and so can be affected by an infected person. Once infected, the person can become an unknown serious affected ($Igs_t$) with probability $\alpha$, or an unknown asymptomatic ($Ias_t$) with probability $1 - \alpha$

Evolutionary equations

The equations that describe the evolution of each variable in time are desribed below. We wrote the equation with the following convention, to make it more readible: the prefix $N$ is used to descrive people that are new to the category $N$ refers to, while $U$ is used in the same fashion for those that exited the category in which they where before.

$Ias_{t+1} = Ias_{t} + NIas_{t+1} - UIas_{t+1}$

where $NIas_{t+1}$ is the number of people been infected by either $Ias_t$ or $Igs_t$ during period $t$ and become new asymptomatic infected at time $t+1$, and $UIas_{t+1}$ are those that recovered during/after time $t$

$NIas_{t+1} = (1 - \alpha) (rg_t Igs_t + ra_t Ias_t)$

$UIas_{t+1} = Gas_{t+1} = NIas_{t+1 - t2_{as}}$

$Igs_{t+1} = Igs_{t} + NIgs_{t+1} - UIgs_{t+1}$

$NIgs_{t+1} = \alpha (rg_t Igs_t + ra_t Ias_t)$

$UIgs_{t+1} = NIgs_{t+1 - t1}$

$Igci_{t+1} = Igci_t + NIgci_{t+1} - UIgci_{t+1}$

$NIgci_{t+1} = \gamma UIgs_{t+1}$

$Uigci_{t+1} = Ggci_{t+1} + Mgci_{t+1}$

$Ggci_{t+1} = NIgci_{t+1-t2_{gi}} (1-\beta)^{(t2_{gi} - t1)}$

$Mgci_{t+1} = \beta Igci_t$

$Igcn_{t+1} = Igcn_t + NIgcn_{t+1} - UIgcn_{t+1}$

$NIgcn_{t+1} = (1 - \gamma) UIgs_{t+1}$

$UIgcn_{t+1} = GIgcn_{t+1} + Mgcn_{t+1}$

$Ggcn_{t+1} = NIgcn_{t+1 - t2_{gn}} (1 - \beta_{gcn})^{(t2_{gn} - t1)}$

$Mgcn_{t+1} = \beta_{gcn} Igcn_{t}$

$Igc_t = Igci_t + Igcn_t$

Finally, $rg_t$ and $ra_t$ can be seen as a way to calculate what in medical theory is called $r_0$. They represent the number of people that will be infected in a period of time by an unknown seriously affected person and by an unknown asymptomatic, respectively. $rg_t$ and $ra_t$ are not constant parameters, as they vary due to environmental conditions: the higher are the chances of contacts between individuals, for instance, the higher they will be. From a mathematical perspective, they represents an important closure condition: we’ll make them depened on the population not yet affected, and this will make the model not exploding as time goes.

$rg_t = rg \frac{Popi_t}{Pop}$ $ra_t = ra \frac{Popi_t}{Pop}$

where $Pop$ is the total population at time 0. In other words, you can see $rg$ and $ra$ as $rg_{t0}$ $ra_{t0}$, where $t0$ is the starting time when the first infected person is introduced in the society.

Model characteristics and parameter sensitivity

The model is highly dependent on exogenous parameters: here we describe how the most important variable evolve in time given changes in the exogenous parameters.

The rest of the sensitivity graphs are shown at the end of the page, for convenience reasons.

Calibration

In order to calibrate the model, we proceed with an extensive grid search on all parameters and initial conditions. The procedure we developed consists on three different grid searches, that will be described below, each of which follows this path:

1. create a grid of parameters
2. run the model for every combination of parameters in the grid
3. compare the results of the model to actual data coming from the Dipartimento della Protezione Civile Italiana, and calculate an error measure (more on this later)
4. select an optimal model for each error calculated: we calculate errors on the number of total infected people since the beginning of the infection $err^{Igc_{cum}}$, number of people currently infected $err^{Igc}$, total number of deaths $err^{M_{cum}}$, total number of known recovered peolpe $err^{Gc_{cum}}$ and the average of all the above $err^{tot}$

The calibration steps mentioned above are the following:

1. Initial grid search: create a grid of all parameters and find optimal models
2. Parameter fine tuning: starting from the optimal parameters from step 1, create a grid with +- a percentage increase of the optimal parameters and find optimal models
3. Window Search: we divide the periods in windows of fixed length (7 periods), and for each of them we allow the $rg$ and $ra$ parameter to change (like in step 2), so to allow for changing conditions in the infection rate. This will try to find changes in exogenous conditions affecting the infection rate, like social restriction measures taken by the government.

Results

Here are the results of a calibration performed on Italian data at national level, updated as of March 28th. Differences between the model and the window optimized one will be also presented.

DISCLAIMER: this is a quite preliminary version of the model, so its results are still under investigation. In the next days/weeks I’ll be working on it and more robust analysis will be shared. For this reasons, please consider the
following analysis only a pure academic exercise. For the same reasons, I’m not investing time in describing the results,
but I’m only showing them as they are, also to get feedbacks that are always very very wellcome.

Parameter Model Opt Window
rg 0.155 0.153
ra 0.888 0.870
alpha 0.599 0.599
beta 0.011 0.011
beta_gcn 0.009 0.009
gamma 0.064 0.064
t1 1 1
tgi2 15 15
tgn2 15 15
ta2 103 103
Igs_t0 156 156
Peak Var Model Opt Window
Total Infected (Igc_cum) 9.85M @ Sep 08 8.72M @ Sep 08
Daily Infected (Igc) 2.76M @ May 28 2.14M @ Jun 03
Daily Infected Int. Care (Igci) 175K @ May 28 136K @ Jun 02
Daily Deaths (M) 26K @ May 29 20K @ Jun 04
Total Deaths (M_cum) 1.27M @ Aug 04 1.12M @ Aug 20

Finally, here is a comparison between the model, optimal window model and actual data for key variables.

Data Science

Things in Data Science 2015-11

Apache Spark Machine Learning Tutorial
Informative tutorial that walks us through using Spark’s machine learning capabilities and Scala to train a logistic regression classifier on a larger-than-memory dataset.
by Dmitry Petrov

60+ Free books on Bg Data, Data Science, Data Mining, …
Here is a great collection of eBooks written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science. by Brendan Martin

Advanced Jupyter Notebook Tricks – Part 1
Worth reading to know about “Magics”: it allows to use multiple languages in a single notebook. In particular, with rmagic you can mix multiple python with R.