Hi,
I am trying to switch my model from pm.sample() to pm.fit() for performance reasons. I have gotten my model to sample well with pm.sample() on a subset of my data–meaning there are 0 divergences, the r_hats are all under 1.01, and the posteriors are reasonable–but my full dataset is much too large for pm.sample() to be a viable option long term, unless there’s some performance boost I’m missing. I am using nutpie and with ~12k records and pm.sample() takes about 6.5min, but my full dataset is 100x that.
So, when I try pm.fit(method=“advi”), I get the following error immediately for several of my parameters:
The current approximation of RV
parameter_x.ravel()[0] is NaN.
I’m not understanding how this could be happening when I’m able to use pm.sample() without issue. I have tried using the start parameter, but that doesn’t seem to be doing anything, nor does setting the initvals in the model. Is there specific reparametrization required when switching from the sample method to the fit method?
Thanks
Do you get the error with the full dataset or the same you tried sample with?
I can report that I’ve also experienced this 
Meaning same dataset is less stable than mcmc or with larger dataset?
May need smaller learning rates?
Same dataset.
I assumed it’s a feature of NaN handling in our VI code vs MCMC. MCMC will just give the sample -inf probability if it’s nan, whereas VI fails outright (as reported).
Well MCMC can just stay where it was and try a new trajectory, not sure VI can do anything? It’s a deterministic optimization no?
It could backtrack and adjust learning rate, or try a new batch (in minibatch setting). Optax for example has apply_if_finite, which would for us to have in combination with other stochastic optimization niceties like gradient clipping and learning rate scheduling
@ricardoV94 It occurs on the same dataset that I use pm.sample(). I haven’t tried it out on any larger set.
@jessegrabowski Do you know of any workaround / ways to troubleshoot? I’m not really understanding the NaN issue when my dataset doesn’t have any NaNs..
The nan
happens in the optimization, because it gets too extreme gradients probably
Is there a typical workflow for trying to deal with that?
Try a different optimizer, and/or adjust the learning rate. For optimizers with momentum parameters, you can play around with those as well.
As I mentioned, we don’t have fancier tools like gradient clipping/learning rate scheduling/retry on nan (yet!), so help wanted.
1 Like
Will follow up if I find something that works 
In my (admittedly limited) experience, I’ve found RMSProp to be a good one to try first.