PyMC3 on AWS EC2

Hi

I’m trying to run a model on an AWS instance running Redhat and I keep getting an error “Bad initial energy: nan. The model might be misspecified.” The same model runs successfully on my local Ubuntu desktop. For simplicity, I also tried running the extremely simple

import pymc3 as pm

with pm.Model() as model:

   p = pm.Uniform('p', lower = 0.0, upper = 1.0)
   b = pm.Binomial('b', 100, p, observed = 10)


with model:
   start = pm.find_MAP(model = model)
   trace = pm.sample(10000, start = start)

and I get the same behavior: it works on my Ubuntu desktop and it fails on AWS.

The versions of NumPy (1.133), SciPy (1.0.0) , Theano (0.9.0), and PyMC3 (3.2) are the same in both machines.

Taking a look at step_methods/hmc/nuts.py it seems like at some point the call to self.potential.random() at line 177 returns [inf] thus leading to infinite energy. This only happens in the AWS machine.

I’m completely unsure of how to proceed here. Any help would be appreciated. Thank you.

Did you use GPU? Also, try casting all the value to float:

with pm.Model() as model:

   p = pm.Uniform('p', lower = 0.0, upper = 1.0)
   b = pm.Binomial('b', 100.0, p, observed = 10.0)

Thanks for the reply

I didn’t use GPUs. Casting to float yields the same behavior: it runs on my Desktop and fails with infinite energy on AWS

what is the theano float type and int type on your EC2? With such a simple model the only possibility I can think of is that p was rounded down to zero…

float64 and float32 give the same error

I realized my original model (more complex) also stopped working on my desktop after I upgraded PyMC3 to try and reproduce the EC2 behavior (the simple model described above does work). I’m not sure which version I was running, but it had been a few months since I downloaded it. I downgraded PyMC3 to 3.1 with pip install -I --no-cache-dir 'pymc3<3.2' --no-deps and now both the simple model and the complex model work on both my laptop and EC2. This will work for my needs, but it’s probably worth investigating what’s going on

Thanks for reporting back.
In your more complex model, do you have the error in the beginning of the sampling? or a bit after the sampling start?

Hi,

I was getting the same error yesterday. I am trying out multinomial regression with continuous predictors. I think it has something to do with using NUTS on a discrete output. But I am not sure about this, so I’d love to hear from the pros about it. I switched to Metropolis steps and am getting mostly reasonable results.

I have an implementation of softmax regression with multinominal observed and it runs fine for me. How do you deal with the continuous predictor? If you also do a softmax to make p unitary you should try to restrict one column to zero or set all the priors to N(0, 1)

This is what gave me the error.

with pm.Model() as multinomial_model:
   global_alpha_mean = pm.Normal("global_alpha_mean", mu=0, sd=100)
   global_alpha_sigma = pm.HalfCauchy("global_alpha_sigma", beta=10)
   
   global_beta_mean = pm.Normal("global_beta_mean", mu=0, sd=100)
   global_beta_sigma = pm.HalfCauchy("global_beta_sigma", beta=10)

   #centered
   local_alphas = pm.Normal("local_alphas", mu=global_alpha_mean, sd=global_alpha_sigma, shape= n_labels)
   local_betas = pm.Normal("local_betas", mu=global_beta_mean, sd=global_beta_sigma, shape=(n_labels,n_cols))
   
   mu = local_alphas[labels] + (local_betas[labels] * input_var).sum(axis=1)
   
   probs = T.nnet.softmax(mu)

   y = pm.Categorical('y', p=probs, observed=target_var)
   #y = pm.Multinomial('y', n=n_counts, p=probs, observed=target_var)

with multinomial_model:
   start = pm.find_MAP()
   step = pm.HamiltonianMC()
   trace = pm.sample(3000, step=step, start=start, tune=1000, njobs=4)

This is what I have running now. But the diagnostics aren’t looking too hot.

with pm.Model() as multinomial_model:
   global_alpha_mean = pm.Normal("global_alpha_mean", mu=0, sd=1)
   global_alpha_sigma = pm.HalfCauchy("global_alpha_sigma", beta=1)
   
   global_beta_mean = pm.Normal("global_beta_mean", mu=0, sd=1)
   global_beta_sigma = pm.HalfCauchy("global_beta_sigma", beta=1)

   #Non centered reparametrization
   alpha_offset = pm.Normal('a_offset', mu=0, sd=1, shape= n_labels)
   local_alphas = pm.Deterministic("local_alphas", global_alpha_mean + alpha_offset * global_alpha_sigma)
   
   #Non centered reparametrization
   beta_offset = pm.Normal('beta_offset', mu=0, sd=1, shape=(n_labels,n_cols))
   local_betas = pm.Deterministic("local_betas", global_beta_mean + beta_offset * global_beta_sigma)
 
   #centered
    #local_alphas = pm.Normal("local_alphas", mu=global_alpha_mean, sd=global_alpha_sigma, shape=n_labels)
    #local_betas = pm.Normal("local_betas", mu=global_beta_memodelan, sd=global_beta_sigma, shape=(n_labels,n_cols))

   mu = local_alphas[labels] + (local_betas[labels] * input_var).sum(axis=1)

   probs = T.nnet.softmax(mu)
   y = pm.Categorical('y', p=probs, observed=target_var)
   #y = pm.Multinomial('y', n=n_counts, p=probs, observed=target_var)

with multinomial_model:
  start=pm.findMAP()
  step = pm.Metropolis()
  trace = pm.sample(10000, step=step, start=start, tune=1000, njobs=4)[2000:10000:5]

Yeah the softmax often makes the model unidentifiable, because the scaling does not plays a role anymore. In my problem, setting a Normal(0, 1) prior works well (I might even get rid of the HalfCauchy). One of the other solution is to restrict one of the column being zero in mu.

The model you are running now seems fine - did you try that with the default (doing just trace = pm.sample(1000, tune=1000, njobs=4) for example)

1 Like

Wow. Yea, I ran the default sampler on the new model and convergence and Gelman-Rubin diagnostics are looking much better! Thanks!

1 Like

In both models the error occurs after about 100 steps of NUTS

In that case, it is usually an indication there are some latent problem somewhere - would be helpful if you post the real model with some simulation code.

I’ll need to do some rewriting before sharing it here, but sure. However, can you think of any reason why it would work with PyMC 3.1 and not PyMC 3.2? I think the main point here is this difference, which is also observed with the extremely simple model above on AWS: version 3.1 works while 3.2 does not

Hard to tell but we did move some of the computation from Theano to numpy for NUTS. cc @aseyboldt