PyMC3 on AWS EC2

aar2163 · December 8, 2017, 5:21pm

Hi

I’m trying to run a model on an AWS instance running Redhat and I keep getting an error “Bad initial energy: nan. The model might be misspecified.” The same model runs successfully on my local Ubuntu desktop. For simplicity, I also tried running the extremely simple

import pymc3 as pm

with pm.Model() as model:

   p = pm.Uniform('p', lower = 0.0, upper = 1.0)
   b = pm.Binomial('b', 100, p, observed = 10)


with model:
   start = pm.find_MAP(model = model)
   trace = pm.sample(10000, start = start)

and I get the same behavior: it works on my Ubuntu desktop and it fails on AWS.

The versions of NumPy (1.133), SciPy (1.0.0) , Theano (0.9.0), and PyMC3 (3.2) are the same in both machines.

Taking a look at step_methods/hmc/nuts.py it seems like at some point the call to self.potential.random() at line 177 returns [inf] thus leading to infinite energy. This only happens in the AWS machine.

I’m completely unsure of how to proceed here. Any help would be appreciated. Thank you.

junpenglao · December 8, 2017, 6:51pm

Did you use GPU? Also, try casting all the value to float:

with pm.Model() as model:

   p = pm.Uniform('p', lower = 0.0, upper = 1.0)
   b = pm.Binomial('b', 100.0, p, observed = 10.0)

aar2163 · December 8, 2017, 7:10pm

Thanks for the reply

I didn’t use GPUs. Casting to float yields the same behavior: it runs on my Desktop and fails with infinite energy on AWS

junpenglao · December 9, 2017, 9:01am

what is the theano float type and int type on your EC2? With such a simple model the only possibility I can think of is that p was rounded down to zero…

aar2163 · December 11, 2017, 3:55pm

float64 and float32 give the same error

I realized my original model (more complex) also stopped working on my desktop after I upgraded PyMC3 to try and reproduce the EC2 behavior (the simple model described above does work). I’m not sure which version I was running, but it had been a few months since I downloaded it. I downgraded PyMC3 to 3.1 with pip install -I --no-cache-dir 'pymc3<3.2' --no-deps and now both the simple model and the complex model work on both my laptop and EC2. This will work for my needs, but it’s probably worth investigating what’s going on

junpenglao · December 12, 2017, 6:11am

Thanks for reporting back.
In your more complex model, do you have the error in the beginning of the sampling? or a bit after the sampling start?

ccrawford · December 12, 2017, 2:47pm

Hi,

I was getting the same error yesterday. I am trying out multinomial regression with continuous predictors. I think it has something to do with using NUTS on a discrete output. But I am not sure about this, so I’d love to hear from the pros about it. I switched to Metropolis steps and am getting mostly reasonable results.

junpenglao · December 12, 2017, 2:55pm

I have an implementation of softmax regression with multinominal observed and it runs fine for me. How do you deal with the continuous predictor? If you also do a softmax to make p unitary you should try to restrict one column to zero or set all the priors to N(0, 1)

ccrawford · December 12, 2017, 3:51pm

This is what gave me the error.

with pm.Model() as multinomial_model:
   global_alpha_mean = pm.Normal("global_alpha_mean", mu=0, sd=100)
   global_alpha_sigma = pm.HalfCauchy("global_alpha_sigma", beta=10)
   
   global_beta_mean = pm.Normal("global_beta_mean", mu=0, sd=100)
   global_beta_sigma = pm.HalfCauchy("global_beta_sigma", beta=10)

   #centered
   local_alphas = pm.Normal("local_alphas", mu=global_alpha_mean, sd=global_alpha_sigma, shape= n_labels)
   local_betas = pm.Normal("local_betas", mu=global_beta_mean, sd=global_beta_sigma, shape=(n_labels,n_cols))
   
   mu = local_alphas[labels] + (local_betas[labels] * input_var).sum(axis=1)
   
   probs = T.nnet.softmax(mu)

   y = pm.Categorical('y', p=probs, observed=target_var)
   #y = pm.Multinomial('y', n=n_counts, p=probs, observed=target_var)

with multinomial_model:
   start = pm.find_MAP()
   step = pm.HamiltonianMC()
   trace = pm.sample(3000, step=step, start=start, tune=1000, njobs=4)

ccrawford · December 12, 2017, 4:02pm

This is what I have running now. But the diagnostics aren’t looking too hot.

with pm.Model() as multinomial_model:
   global_alpha_mean = pm.Normal("global_alpha_mean", mu=0, sd=1)
   global_alpha_sigma = pm.HalfCauchy("global_alpha_sigma", beta=1)
   
   global_beta_mean = pm.Normal("global_beta_mean", mu=0, sd=1)
   global_beta_sigma = pm.HalfCauchy("global_beta_sigma", beta=1)

   #Non centered reparametrization
   alpha_offset = pm.Normal('a_offset', mu=0, sd=1, shape= n_labels)
   local_alphas = pm.Deterministic("local_alphas", global_alpha_mean + alpha_offset * global_alpha_sigma)
   
   #Non centered reparametrization
   beta_offset = pm.Normal('beta_offset', mu=0, sd=1, shape=(n_labels,n_cols))
   local_betas = pm.Deterministic("local_betas", global_beta_mean + beta_offset * global_beta_sigma)
 
   #centered
    #local_alphas = pm.Normal("local_alphas", mu=global_alpha_mean, sd=global_alpha_sigma, shape=n_labels)
    #local_betas = pm.Normal("local_betas", mu=global_beta_memodelan, sd=global_beta_sigma, shape=(n_labels,n_cols))

   mu = local_alphas[labels] + (local_betas[labels] * input_var).sum(axis=1)

   probs = T.nnet.softmax(mu)
   y = pm.Categorical('y', p=probs, observed=target_var)
   #y = pm.Multinomial('y', n=n_counts, p=probs, observed=target_var)

with multinomial_model:
  start=pm.findMAP()
  step = pm.Metropolis()
  trace = pm.sample(10000, step=step, start=start, tune=1000, njobs=4)[2000:10000:5]

junpenglao · December 12, 2017, 4:06pm

Yeah the softmax often makes the model unidentifiable, because the scaling does not plays a role anymore. In my problem, setting a Normal(0, 1) prior works well (I might even get rid of the HalfCauchy). One of the other solution is to restrict one of the column being zero in mu.

The model you are running now seems fine - did you try that with the default (doing just trace = pm.sample(1000, tune=1000, njobs=4) for example)

ccrawford · December 12, 2017, 5:53pm

Wow. Yea, I ran the default sampler on the new model and convergence and Gelman-Rubin diagnostics are looking much better! Thanks!

aar2163 · December 12, 2017, 7:50pm

In both models the error occurs after about 100 steps of NUTS

junpenglao · December 12, 2017, 8:30pm

In that case, it is usually an indication there are some latent problem somewhere - would be helpful if you post the real model with some simulation code.

aar2163 · December 12, 2017, 8:43pm

I’ll need to do some rewriting before sharing it here, but sure. However, can you think of any reason why it would work with PyMC 3.1 and not PyMC 3.2? I think the main point here is this difference, which is also observed with the extremely simple model above on AWS: version 3.1 works while 3.2 does not

junpenglao · December 12, 2017, 9:47pm

Hard to tell but we did move some of the computation from Theano to numpy for NUTS. cc @aseyboldt

Topic		Replies	Views
When ever I tried running NUTS sampling my notebook gets disconnected from kernel Questions	4	1855	July 7, 2018
ValueError: Bad initial energy: inf. The model might be misspecified error replicating Multilevel Regression and Poststratification with PyMC3 notebook Questions	2	1608	May 11, 2018
Getting 'Bad initial energy: inf' when trying to sample simple model Questions	4	13962	May 21, 2018
SamplingError: Bad initial energy Questions	1	2806	September 10, 2020
Sampling time is very long (minutes per sample) Questions	13	4522	November 2, 2017

PyMC3 on AWS EC2

Related topics