Sampling time is very long (minutes per sample)

Hi,

I am working on a problem to predict a continuous output variable, label (possible values: [0, infinity)), given a continuous input variable, feature (possible values: [0, infinity)). Based on previous MCMC work I’ve done with emcee (http://dfm.io/emcee/current/), I have found that the following relationship works reasonably well:

index = feature > threshold
predicted_label[index] = scale * (feature[index] - offset) ** exponent
predicted_label[~index] = 0

With scale ~ 0.43, offset ~ 31, exponent ~ 0.66, and threshold = 40.

I am now trying to apply the same model using PyMC3, but I am finding that the sampling process gets stuck for long periods of time. Here is sample data and my model:

sample_data.csv (769.1 KB)

sample_data = pd.read_csv(‘sample_data.csv’)
feature = sample_data[‘feature’].values
label = sample_data[‘label’].values

with pm.Model() as model:
    threshold = pm.Uniform('threshold', lower=5, upper=50)
    scaling = pm.HalfNormal('scaling', sd=0.3)
    exponent = pm.Normal('exponent', mu=0.7, sd=0.15)
    offset = pm.Uniform('offset', lower=5, upper=50)
    
    model_1 = scaling * (feature - offset) ** exponent
    model_2 = np.zeros(len(sample_data))
    condition = (feature < threshold) | (feature < offset)
    model_ = pm.math.switch(condition, model_2, model_1)
    
    obs = pm.Normal('obs', mu=model_, sd=9, observed=label)

with model:
    trace = pm.sample(500, n_init=50000)

I’m currently sitting on sample 37/500. Going from 36/500 to 37/500 took several minutes. Any ideas on what I’m doing wrong would be greatly appreciated!

Thanks,
Shane Bussmann

The reason this is slowing down is that NUTS decides for some reason that it needs to simulate very long trajectories for each sample, so that the samples are more or less independent.
I’m not sure, but by best guess is that in this case this is because the posterior density isn’t smooth in your model. You could try to rewrite the switch using a smooth approximation, something like in Frequently Asked Questions

Thanks for the response, @aseyboldt. I tried to implement the suggestion to replace the switch with a sigmoid. The problem I am getting now is that feature - offset can sometimes be negative, so (feature - offset) ** exponent can be nan, since exponent can take on values less than 1.

Maybe I need to find a different approach that moves offset outside the exponent?

Here is my model:

feature = sample_data['feature'].values
label = sample_data['label'].values

with pm.Model() as model:
    threshold = pm.Uniform('threshold', lower=5, upper=50)
    scaling = pm.HalfNormal('scaling', sd=0.3)
    exponent = pm.Normal('exponent', mu=0.7, sd=0.15)
    offset = pm.Uniform('offset', lower=5, upper=50)
    
    model_1 = scaling * (feature - offset) ** exponent
    model_2 = np.zeros(len(sample_data))
    
    # feature must be larger than both `threshold` and `offset` to apply the non-zero model
    weightA = tt.nnet.sigmoid(2 * (feature - threshold))
    weightB = tt.nnet.sigmoid(2 * (feature - offset))
    weight = weightA * weightB
    model_ = weight * model_1 + (1 - weight) * model_2
    
    obs = pm.Normal('obs', mu=model_, sd=9, observed=label)

with model:
    trace = pm.sample(500, n_init=70000)

Hm, I see the problem. Maybe it would actually be better to go back a step and think about the model a bit.

Just from looking at the data it looks like there might actually be a pretty linear relationship between log(feature) and log(label), but this seems to be hidden because of rounding, and the fact that is looks like all values for label that should be smaller than 3.6 are just 0. I think you are actually doing something like that, just assuming abolute errors instead of relative (which from a quick glance I don’t think is right).

pos = data.label > 0
plt.scatter(np.log(data.feature)[pos],
            np.log(data.label[pos]),
            marker='.')

So to me that looks like a reasonable model for this might involve censoring? (Basically, for some cases we don’t observed the actual value, but only that the value is smaller or larger than a certain value. This comes up a lot in survival analysis, eg in a cancer treatment study, where a patient might have died in a car crash, so you don’t know how long he would have lived otherwise.)
From the dataset alone it kind of looks like all observations with label==0 might just be censored at 3.6, so we don’t actually know the label value, but just that is smaller than 3.6. I know nothing about the dataset, so I might be completely off, but I’ll run with it for now :wink:

Maybe you could assume that the observations are distributed like this (at some point in the future I hope we get latex in here)

theta = intercept + feature_scale * log(feature)

P(log(label[i])=x) = 
    if x == -inf:
        normal_cdf(log(3.6), mu=theta[i], sd=sd)
    else:
        normal_pdf(x, mu=theta[i], sd=sd)

You could do that using something like this:

class CensoredNormal(pm.Distribution):
    def __init__(self, censoring_point, mu, sd):
        self._censoring_point = censoring_point
        self._mu = mu
        self._sd = sd

    def logp(self, value):
        # This distribution can only be used as an observed
        assert isinstance(value, np.ndarray)
        # stuff like 
        return tt.switch(
            value == -np.inf,
            something-involving-pm.math.erfc,
            pm.Normal.dist(mu=self._mu, sd=self._sd).logp(value)
        )

    # We need this for posterior predictive checks
    def random(self, shape):  # check required args
        mu, sd = draw_values([self._mu, self._sd])
        vals = sd * np.random.randn(shape) + mu
        return np.where(vals < np.log(3.6), 0, vals)

The model could then look something like this:

with model:
    intercept = pm.Flat('intercept')
    feature_sd = pm.HalfStudent('feature_sd', nu=3, sd=2.5)  # TODO think about args

    theta = intercept + feature_sd * np.log(data.feature)
    sd = pm.HalfStudentT('sd', nu=3, sd=2.5)
    CensoredLogNormal('y', censoring_point, mu=mu, sd=sd, observed=np.log(data.label))

You should then also do some predictive posterior checking. Get samples using pm.sample_ppc and compare the result to the actual values (you can also round the ppc samples the same way as the actual data). I’d especially look at the relationship between the residual y - data.label, and if the variance of that depends on mu.

This is only a very rough sketch of the code, I hope you get the idea. If not, don’t hesitate to ask.

1 Like

Thanks for the guidance @aseyboldt. Here is my attempt at filling in the blanks in your code.

class CensoredNormal(pm.Distribution):
    def __init__(self, censoring_point, mu, sd):
        self._censoring_point = censoring_point
        self._mu = mu
        self._sd = sd

    def logp(self, value):
        # This distribution can only be used as an observed
        assert isinstance(value, np.ndarray)
        # stuff like 
        return tt.switch(
            value == -np.inf,
            pm.math.erfc(mu=self._mu, sd=self._sd).logp(self.censoring_point),
            pm.Normal.dist(mu=self._mu, sd=self._sd).logp(value)
        )

    # We need this for posterior predictive checks
    def random(self, shape):  # check required args
        mu, sd = draw_values([self._mu, self._sd])
        vals = sd * np.random.randn(shape) + mu
        return np.where(vals < self.censoring_point, 0, vals)

And to build the model:

# should censoring_point actually be np.exp(3.6) ?
censoring_point = 3.6

with pm.Model() as model:
    intercept = pm.Flat('intercept')
    
    # Using HalfStudentT yields "AttributeError: 'bool' object has no attribute 'any'"
    # `feature_sd` here should be analogous to the exponent applied to `feature`
    feature_sd = pm.HalfNormal('feature_sd', sd=0.5)  # TODO think about args

    # not sure about putting in the offset here, but I think doing so most closely mimics my original approach
    offset = pm.Flat('offset')
    theta = intercept + feature_sd * np.log(feature - offset)

    # same problem with HalfStudentT here
    sd = pm.HalfNormal('sd', sd=9)

    CensoredNormal('y', censoring_point, mu=theta, sd=sd, observed=np.log(label)) 

I am running into an error that I don’t understand (besides the issue with HalfStudentT that I am skipping for now):

AttributeError: 'CensoredNormal' object has no attribute 'dtype'

Do I need to define a dtype attribute when defining the class CensoredNormal? If so, what would a sensible choice be here?

I haven’t tried running this, but the dtype and certain other arguments go up through superclasses. I think if you changed your initialization to

class CensoredNormal(pm.Distribution):
    def __init__(self, censoring_point, mu, sd, **kwargs):
        self._censoring_point = censoring_point
        self._mu = mu
        self._sd = sd
        super(CensoredNormal, self).__init__(**kwargs)

you’ll have better luck (note the last line there, that tells the super class to do its magic).

Thanks @colcarroll. I tried your suggestion, but I am getting an error message that says init requires 3 arguments, and I only gave it 1. Were you suggesting that I need to fill in **kwargs with appropriate keywords for my use case? If so, I’m not sure how to determine what the appropriate keywords should be. Here is the full error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-79f951066c5c> in <module>()
      7     theta = intercept + feature_sd * tt.log(feature - offset)
      8     sd = pm.HalfNormal('sd', sd=9)
----> 9     y = CensoredNormal('y', censoring_point, mu=theta, sd=sd, observed=np.log(label))

/Users/rbussman/anaconda/lib/python2.7/site-packages/pymc3/distributions/distribution.pyc in __new__(cls, name, *args, **kwargs)
     36                 raise TypeError("observed needs to be data but got: {}".format(type(data)))
     37             total_size = kwargs.pop('total_size', None)
---> 38             dist = cls.dist(*args, **kwargs)
     39             return model.Var(name, dist, data, total_size)
     40         else:

/Users/rbussman/anaconda/lib/python2.7/site-packages/pymc3/distributions/distribution.pyc in dist(cls, *args, **kwargs)
     47     def dist(cls, *args, **kwargs):
     48         dist = object.__new__(cls)
---> 49         dist.__init__(*args, **kwargs)
     50         return dist
     51 

<ipython-input-7-1c3e0ac2b7ed> in __init__(self, censoring_point, mu, sd, **kwargs)
      4         self._mu = mu
      5         self._sd = sd
----> 6         super(CensoredNormal, self).__init__(**kwargs)
      7 #     def __init__(self, censoring_point, mu, sd, **kwargs):
      8 #         self._censoring_point = censoring_point

TypeError: __init__() takes at least 3 arguments (1 given)

Shoot – I looked a little closer, and this is what you will need:

class CensoredNormal(pm.Distribution):
    def __init__(self, censoring_point, mu, sd, **kwargs):
        self._censoring_point = censoring_point
        self._mu = mu
        self._sd = sd
        kwargs.setdefault('shape', ())
        kwargs.setdefault('dtype', theano.config.floatX)            
        super(CensoredNormal, self).__init__(**kwargs)

(I’m using dict.setdefault which is a bit of a deep cut as far as python goes – that could also just be hardcoded). You have to dig through the source a little to find that Distribution requires those two arguments explicitly. I tried to actually get the code running, but was unsure of what your intention was with the erfc stuff, but the stack trace you’ll get there will probably be more informative for you than me!

This should do the trick:

class CensoredNormal(pm.Distribution):
    def __init__(self, censoring_point, mu, sd, **kwargs):
        self._censoring_point = censoring_point
        self._mu = mu
        self._sd = sd
        kwargs.setdefault('shape', ())
        kwargs.setdefault('dtype', theano.config.floatX)            
        super(CensoredNormal, self).__init__(**kwargs)

    def logp(self, value):
        return tt.switch(
            tt.eq(value, -np.inf),
            tt.log(tt.erfc(-(self._censoring_point - self._mu) / self._sd / np.sqrt(2)) / 2),
            pm.Normal.dist(mu=self._mu, sd=self._sd).logp(value)
        )

    def random(self, **kwargs): 
        vals = pm.Normal.dist(mu=self._mu, sd=self._sd).random(**kwargs)
        return np.where(vals < self._censoring_point, -np.inf, vals)


censoring_point = np.log(3.6)

with pm.Model() as model:
    intercept = pm.Flat('intercept')
    
    feature_sd = pm.HalfStudentT('feature_sd', sd=2.5, nu=3)

    # not sure about putting in the offset here, but I think doing so most closely mimics my original approach
    offset = pm.Flat('offset')
    theta = intercept + feature_sd * tt.log(data.feature.values - offset)

    sd = pm.HalfNormal('sd', sd=1)

    CensoredNormal('y', censoring_point, mu=theta, sd=sd, observed=np.log(data.label).values) 

I played around a bit with directly modeling the rounding to multiples to 3.6, but ran into a bad theano bug…

You can get rounded posterior predictive samples using something like this:

with model:
    ppc = pm.sample_ppc(trace)

vals_round = (np.exp(ppc['y']) / 3.6).astype('int')[-1] * 3.6

pos = data.label > 0
plt.scatter(np.log(data.feature)[pos],
            np.log(data.label[pos]),
            marker='.')

pos = vals_round > 0
plt.scatter(np.log(data.feature)[pos],
            np.log(vals_round[pos]),
            marker='.')
1 Like

This fell off my radar for a bit, but I’m coming back to it now. I tried @aseyboldt’s suggestion. When I use ADVI to initialize the sampler, I get an error message indicating a nan value encountered during optimization. In this case, it happened during iteration 39 (i.e., very early on but not immediate). Here’s the error message:

---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<ipython-input-26-cf2f886771e1> in <module>()
      1 with model:
----> 2     trace = pm.sample()

/Users/rbussman/anaconda/lib/python2.7/site-packages/pymc3/sampling.pyc in sample(draws, step, init, n_init, start, trace, chain, njobs, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, **kwargs)
    241         start_, step = init_nuts(init=init, njobs=njobs, n_init=n_init,
    242                                  model=model, random_seed=random_seed,
--> 243                                  progressbar=progressbar, **args)
    244         if start is None:
    245             start = start_

/Users/rbussman/anaconda/lib/python2.7/site-packages/pymc3/sampling.pyc in init_nuts(init, njobs, n_init, model, random_seed, progressbar, **kwargs)
    600             callbacks=cb,
    601             progressbar=progressbar,
--> 602             obj_optimizer=pm.adagrad_window
    603         )  # type: pm.MeanField
    604         start = approx.sample(draws=njobs)

/Users/rbussman/anaconda/lib/python2.7/site-packages/pymc3/variational/inference.pyc in fit(n, local_rv, method, model, random_seed, start, inf_kwargs, **kwargs)
    753                         'or Inference instance' %
    754                         set(_select.keys()))
--> 755     return inference.fit(n, **kwargs)

/Users/rbussman/anaconda/lib/python2.7/site-packages/pymc3/variational/inference.pyc in fit(self, n, score, callbacks, progressbar, **kwargs)
    124         progress = tqdm.trange(n, disable=not progressbar)
    125         if score:
--> 126             self._iterate_with_loss(n, step_func, progress, callbacks)
    127         else:
    128             self._iterate_without_loss(n, step_func, progress, callbacks)

/Users/rbussman/anaconda/lib/python2.7/site-packages/pymc3/variational/inference.pyc in _iterate_with_loss(self, n, step_func, progress, callbacks)
    165                     scores = scores[:i]
    166                     self.hist = np.concatenate([self.hist, scores])
--> 167                     raise FloatingPointError('NaN occurred in optimization.')
    168                 scores[i] = e
    169                 if i % 10 == 0:

FloatingPointError: NaN occurred in optimization. 

Meanwhile, I get reasonably good results using Metropolis sampling. I’d be interested to know if there is something I’m doing incorrectly that is causing ADVI to encounter nan’s, but I’m at a point now where I can get moving on this topic. Really appreciate the help!

The autocorrelation in the trace from Metropolis is quite high. I will suggest you upgrade to version 3.2 (the most recent release) and try the new default initialisation for NUTS.

1 Like

As @aseyboldt has said before, if NUTS doesn’t sample properly, usually metropolis will find it hard as well (even worse have wrong/deceiving results). The jitter+adapt_diag and advi+adapt_diag should help and also playing around with the advi parameters (e.g cost_grad_scale) has helped with my NaN issues before as well.

3 Likes

Why not trying the SMC … I know I always give the same answer :wink: . However, unfortunately you would need to turn of transformations in your RVs …

Thanks for the suggestion @junpenglao! After upgrading to 3.2 I am getting much better results with NUTS. Seems to be a really significant improvement with the new version!

2 Likes