I am running a Multivariate regression. Pymc3 works when individual data values are small but the dataset I have has data values that are large (ex: 5,340,343). When data values are this high, PYMC3 fails. It took a long time to deduce that this is what was causing chains to fail. The sampling will stop at say 7% and then you get the following message.
ValueError: Mass matrix contains zeros on the diagonal.
RuntimeError: Chain 3 failed.
2 solutions I found was to basically somehow scale down the entire dataset.
Standardize the entire data - This works but then I need the unstandardized coefficients after PYMC3 has run. Standardized coefficients are in terms of a standard deviation and this is not what I need. I would have to undo standardization for every coefficient after the trace values have been calculated.
Scaling down the dataset by dividing by say 100,000 works but this produces vastly different results if I scale down (divide by) say 10,000 vs 100,000
Is there any other solution to this? If there is no solution, PYMC3 is not practical for real world datasets that have large values or large standard deviations.
If I had to guess, I would say you are probably running into floating point issues. You can try changing the default floating point precision as discussed here.
Alternatively, standardizing data is a fairly conventional approach and it alleviates all sorts of issues. Besides not having to worry too much about the numerical representations, it’s also often easier to specify reasonable priors because you automatically know what the basic distribution of all variables is (i.e., each has a mean of 0 and an SD of 1). If you later need coefficients in the native space, it does require a bit of algebra to un-transform things, but it’s typically not that big a deal.
You can also automatically unscale your coefficients to the data range using a
Deterministic random variable in the model.
It might also be worth noting that the NUTS sampler (the default one in PyMC) has no issues with large numbers by themselves, but tends to have problems when different variables have very different scales and there are some correlations between them.
For such cases, if there aren’t many parameters in your model, instead of reparametrizing (or rescaling), you can also try using a full mass matrix by passing
init="adapt_full" to pm.sample. This blogpost has some extra context on this topic and some trade offs between reparametrization and full mass matrices.
Thanks. That seems to fix the chain failure issue but I am not getting meaningful results. It keeps saying increase tuning. Let me play around more. Thank you for your help. At least it works now.