Geometric variable not being properly Sampled

from_github

#4

Maybe you can try stacking it on a different dimension, but I can not try yet as I also need inital_pos["CT"] and n. Is n = len(obs)?


#5

Apologies on the incompleteness.
inital_pos = {"CT": pm.distributions.multivariate.MvNormal.dist(mu=np.array([255, 240]), cov=np.array([[15.**2, 0.], [0., 10.**2]]))}

n=2, n_samples=10000


#6

OK, so if I understand correctly, your data is A Random walk in 2D, and that some point there is a change in mu?
In that case shouldn’t use a DiscreteUniform(time_0, time_1) to model the change point instead of a Geometric distribution?
Better yet is to use a Uniform(time_0, time_1) and use tt.switch to construct drift


#7

Yes, that is correct. I assume that cov is same over time but the drift will at some point change. However, since each of my sample is different length, how will i be able to model the switching point when i dont know what the upper bound with be? I would like a general soln and thats why i picked geometric because even if i cant sample it well, I can at least infer P. If i were to create a discreteuniform for each sample, i think the sampler will still fail to sample properly

Also I have tried tt.switch but couldnt get it to work, have you had any success with it? Also have you gotten the tt.stack to work with what @fonnesbeck wrote?


#8

I see what you mean now. Well, it is still worth to try modelling it with tt.switch as you can model the switch point as a continuous variable, ADVI and NUTS would be available in that case.
For example, this seems to work (but slow, need to further optimized):

with pm.Model() as position_model:
    eta = 1 #pm.Exponential('eta', lam=0.5, testval=0.5)
    sd_dist = pm.HalfCauchy.dist(beta=2.5)
    packed_chol = pm.LKJCholeskyCov('chol_cov', n=n, eta=eta, sd_dist=sd_dist)
    # compute the covariance matrix
    chol = pm.expand_packed_triangular(n, packed_chol, lower=True)
    cov = tt.dot(chol, chol.T)
    
    mu_start = pm.Normal('start_drift', mu=5., sd=10., shape=n)
    mu_hold = pm.Normal('hold_drift', mu=0., sd=10., shape=n)
    switchpoint = pm.Beta('switchpoint', alpha=1, beta=1)
    
    pos = []
    for i in range(len(obs)):
        px = obs[i]
        obs_n = len(px)-1
        drift = tt.switch(switchpoint >= np.linspace(0,1,obs_n)[:, np.newaxis], mu_start, mu_hold)
        pos.append(pm.MvGaussianRandomWalk('position_CT_rnd_%i' % i, mu=drift, 
                                                            cov=cov, init=inital_pos["CT"],
                                                            observed=px))

    trace = pm.sample(1e3)

The problem of using Geometric distribution is that Metropolis becomes very inefficient, so no sample is accepted.


#9

Update: ADVI runs OK, NUTS not so much…

Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 31,678:  20%|█▉        | 39899/200000 [13:23<57:27, 46.44it/s]     
Convergence archived at 39900
Interrupted at 39,900 [19%]: Average Loss = 5.378e+05
  0%|          | 6/1500.0 [00:48<5:37:50, 13.57s/it]

#10

Thanks for the code (I see that I was making the mistake with tt.switch where my condition wasn’t in T x 2 dim). A few questions (more as an aside):

  • You replaced LKJCorr with LKJCholeskyCov, is there a partricular reason for that?
  • Is find_MAP still encouraged in practice? i notice a lot of code samples stopped using find_MAP
  • Is NUTS sampler with ADVI always encouraged? I know that it only supports continuous

#11

When debugging models like this, where nuts gets stuck in the beginning, it can help to decrease max_treedepth (to maybe ~7). That will limit the lengths of the transitions, so you get samples that you can look at and try to find out what the problem is. The samples will probably be quite a bit more correlated than usual, however.

with model:
    trace = pm.sample(nuts_kwargs={'max_treedepth': 7})

#12

About your questions:

  • LKJCholeskyCov should work much better in most situations. It solves the problem with LKJCorr, that it doesn’t have support on $R^n$.
  • find_MAP is pretty much discouraged now in most situations. The MAP is not a representative point in high dimensions anymore, and it can make life difficult for NUTS, as the gradients are small at the MAP.
  • Short answer: Use nuts when you can, it scales reasonably in higher dimensions. How to properly initialize it is a bit of an open question right now, but advi is probably the best we have at the moment.

#13
  • LKJCholeskyCov is encouraged because it is numerically more stable (thanks to @aseyboldt), and faster to sample (not in this case as MvGaussianRandomWalk only accept cov currently).
  • find_MAP is really really bad in high dimension cases because the MAP is usually not in the typical set (Betancourt’s paper is the best reference https://arxiv.org/abs/1701.02434)
  • ADVI is awesome for initialling NUTS, because it gives a very good scaling matrix. And in general, the estimation from ADVI is already very close to the posterior from MCMC (you can even get better ADVI result using FullRankADVI)

#14

@aseyboldt you beat me to it! :joy:


#15

For ADVI, I’m iterating at a much slower pace than what your results suggest, is there a way to set a higher tolerance so that ADVI will terminate much earlier? The advi code here suggests we can use tol_obj but I’m not sure how to access that value when it seems to be using the more general wrapper ADVI/Inference class.


#16

My suggestion is to try to reparameterization first. Right now the model is not very efficient. I think writing a custom logp instead of appending a list would help. Also, start with fewer observations (e.g., for i in range(3):) might help.


#17

Hi @kpei, did you make any progress on this? It recently come to our attention that using pm.dist_math.switch might break the gradient. You can see this post for a case study, and here for some recommendation/workaround.


#18

I am currently redoing the code with a custom logp. It seems like I may have to use a non-theano op to carry the logp calculation (which may limit me from using NUTS/ADVI). I will try the sigmoid workaround to see if it helps.


#19

I seem to be stuck using the theano as_op to define the likelihood. Below is my code, note that my input is now an (N, T, 2) tensor where T is some arbitrary value i set. If the length of the sample < T, then it is padded with np.nan. The sample can be downloaded here

@theano.compile.ops.as_op(itypes=[tt.ftensor3, tt.fmatrix, tt.fmatrix, tt.fvector, tt.fmatrix], otypes=[tt.fscalar])
def logp(x, mu, cov, init_mu, init_cov):
    ll = tt.sum(pm.MvNormal.dist(mu=init_mu, cov=init_cov).logp(x[:,0]))
    for i in range(x.shape[0]):
        v = x[i,:]
        v = v[~np.isnan(v).any(axis=1)]
        innov_like = pm.MvNormal.dist(mu=v[:-1] + mu[v.shape[0]-1], cov=cov).logp(v[1:])
        ll += tt.sum(innov_like)
    return ll

Since then, I have added in an additional variable to iterate over (the player). Assuming for now that we have 1 player, The data is still the same as before.

        px = np.array(obs[player]['CT'])
        len_player = len(px)
        print 'Adding %i Rounds for %s' % (len_player, player)
        drift = tt.switch(switchpoint[k] >= np.linspace(0,1,cf-1)[:, np.newaxis], mu_start[k], mu_hold[k])
        loglike = pm.DensityDist('ll', lambda x: logp(x, drift, cov, inital_pos["CT"].mu, inital_pos["CT"].cov), observed=px)

        db = pm.backends.Text('pos_dump_%s'  %player)
        pm.sample(n_samples, trace = db)

I get the following error: AttributeError: 'scratchpad' object has no attribute 'test_value'. I know from this issue, that as_op holds only non-theano function. However if I were to use some logp like def logp(x, mu ...): return 0.0. I still get the same error.

full log

Adding 96 Rounds for rush
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-128-e423220c249f> in <module>()
     28     theano.config.compute_test_value = 'off'
     29     #step = pm.Metropolis()
---> 30     pm.sample(n_samples, trace = db)

C:\Anaconda2\lib\site-packages\pymc3\sampling.pyc in sample(draws, step, init, n_init, start, trace, chain, njobs, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, **kwargs)
    241         start_, step = init_nuts(init=init, njobs=njobs, n_init=n_init,
    242                                  model=model, random_seed=random_seed,
--> 243                                  progressbar=progressbar, **args)
    244         if start is None:
    245             start = start_

C:\Anaconda2\lib\site-packages\pymc3\sampling.pyc in init_nuts(init, njobs, n_init, model, random_seed, progressbar, **kwargs)
    598             callbacks=cb,
    599             progressbar=progressbar,
--> 600             obj_optimizer=pm.adagrad_window
    601         )  # type: pm.MeanField
    602         start = approx.sample(draws=njobs)

C:\Anaconda2\lib\site-packages\pymc3\variational\inference.pyc in fit(n, local_rv, method, model, random_seed, start, inf_kwargs, **kwargs)
    741                 local_rv=local_rv, model=model, random_seed=random_seed,
    742                 start=start,
--> 743                 **inf_kwargs
    744             )
    745         except KeyError:

C:\Anaconda2\lib\site-packages\pymc3\variational\inference.pyc in __init__(self, local_rv, model, cost_part_grad_scale, scale_cost_to_minibatch, random_seed, start)
    363             cost_part_grad_scale=cost_part_grad_scale,
    364             scale_cost_to_minibatch=scale_cost_to_minibatch,
--> 365             random_seed=random_seed, start=start)
    366 
    367     @classmethod

C:\Anaconda2\lib\site-packages\pymc3\variational\inference.pyc in __init__(self, op, approx, tf, local_rv, model, op_kwargs, **kwargs)
     55             approx = approx(
     56                 local_rv=local_rv,
---> 57                 model=model, **kwargs)
     58         elif isinstance(approx, Approximation):    # pragma: no cover
     59             pass

C:\Anaconda2\lib\site-packages\pymc3\variational\opvi.pyc in __init__(self, local_rv, model, cost_part_grad_scale, scale_cost_to_minibatch, random_seed, **kwargs)
    576         self.lbij = DictToArrayBijection(ArrayOrdering(self.local_vars), {})
    577         self.flat_view = model.flatten(
--> 578             vars=self.local_vars + self.global_vars
    579         )
    580         self._setup(**kwargs)

C:\Anaconda2\lib\site-packages\pymc3\model.pyc in flatten(self, vars)
    697         order = ArrayOrdering(vars)
    698         inputvar = tt.vector('flat_view', dtype=theano.config.floatX)
--> 699         inputvar.tag.test_value = flatten_list(vars).tag.test_value
    700         replacements = {self.named_vars[name]: inputvar[slc].reshape(shape).astype(dtype)
    701                         for name, slc, shape, dtype in order.vmap}

AttributeError: 'scratchpad' object has no attribute 'test_value'

#20

I dont think you need the as_op decorator - did you try removing it?


#21

Yes however the follow code would not work:

    for i in range(x.shape[0]):
        v = x[i,:]
        v = v[~np.isnan(v).any(axis=1)]

as they are numpy procedures and not theano. Do you know anyway to be able to efficiently get the likelihood for the values that are there but return a 0 for nans? This would deal with any different vector lengths and also get around list implementation

Edit: Getting the same results using Potential: ll = pm.Potential('likelihood', logp(px, drift, cov, inital_pos["CT"].mu, inital_pos["CT"].cov)). The logp function no longer has the as_op decorator but still getting the same error.

Edit: cleared the cache, the potential method seems to be sampling albeit I get a type error sometimes (this has happened throughout a lot of my computations): TypeError: expected type_num 11 (NPY_FLOAT32) got 12. I am using the cuda gpu to train it

Edit v4: Seems like removing floatx=float from theanorc fixed it. It is now sampling at around ~16 it/s for metropolis with the potential method, will try the pm.switch workaround + using NUTS/ADVI and see how they work.


#22

Thanks for the update.

Just a quick heads up: training under GPU has not been tested extensively, and there are quite a few known issue using GPU (e.g., here). For now, try to make sure your model works on CPU first.

Keep me posted!


#23

Just wanted to update that my original code samples perfectly under cpu (and even faster under single core!!) :rage:. nevertheless I made a few switches based on your original suggestion so i definitely appreciate that.