Help with fitting Gamma distribution with more than 10k samples


I’m having hard time to fit a Gamma distribution… The weird thing is that It work nice when the observable samples are <= 10k, at 50k or more I get some convergence warning, that can be mitigated by increasing the number of tuning steps…

But at 200k samples it crashes all the time, with this error message :

ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV alpha_log__.ravel()[0] is zero.

This is my toy dataset (my ‘observables’) :

data = np.random.gamma(shape=1.6,scale=1.0/0.00169,size=10000)

This 10k dataset histogram looks like this :

This is my model :

with pm.Model() as model:
    alpha = pm.Exponential('alpha',lam=1)
    beta = pm.Exponential('beta',lam=1)
    obs = pm.Gamma('obs',alpha=alpha, beta=beta, observed=data)
    trace = pm.sample(draws=1000, tune=1000,chains=1) 

It samples fast : (>500 it/s)

… And nice :

The hidden parameters are correctly estimated. I’m happy :smiley:

But now simply adding more observables data, from 10k to 200k for instance make the model crash !!

My toy dataset is then :
data = np.random.gamma(shape=1.6,scale=1.0/0.00169,size=200000)

The model starts to sample ok, but suddenly, at around 30%, it stalls, and then crashes with this error message.

ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `alpha_log__`.ravel()[0] is zero.

I don’t understand this behavior, a model is either good or bad but it should not depend on the number of observables? Intuitively I would even think that the more observable I have, the better it is, but in this case this intuition seems wrong…

Any help will be appreciated.


I would reason that the increase of observation makes the posterior distribution much more peaky (concentrated), which makes the sampler quickly adapted to a small step size (as the parameter space is much smaller now), which makes it much easier to stumble upon the local maximum and returns 0 gradient error (the derivative at the maximum or minimum is 0).

1 Like

I am with hwassner here that this behavior seems counter-intuitive, as the data points are all perfectly sampled from the distribution in question, and more datapoints should (in some ways) ease the inferential burden on PyMC’s sampler.

Are the suggestions for efficiently dealing with this to simply mess with tuning and warmup settings, or is this more just an unfortunate downside to nuts that just has to be accepted?

ok thanks @junpenglao, this interpretation totally make sens for me.

But how can I correct it ?

Is there a way to forbid the too small step size ?
or adding a small jitter to the step size ?

And I’m looking for a scalable solution, able to deal with whatever observable size…