Geometric variable not being properly Sampled

Github issue #2303 by kpei:

I am new to pymc3 and I have scanned through the list of outstanding/closed issues and couldn’t find an exact solution so I am posting it here. If there is any existing soln or written piece, I’d appreciate if you can link it to me.

The Problem

I am trying to sample from a mvGaussianRandomwalk with a switching component where the drift will change permanently after each time period with a probability p. The natural conclusion I came to was to use a geometric variable where its values represent the point of switching. However, using the metropolis sampler on newest theano (and stable release of pymc3), I find that part is not being properly sampled when I increase the size of observations that I have. See picture below

The Code

with pm.Model() as position_model:

    eta = pm.distributions.continuous.Exponential('eta', lam=0.5, testval=0.5)
    sd = pm.distributions.continuous.Exponential('sd', lam=0.5, shape=n, testval=0.5)
    C_triu = pm.distributions.multivariate.LKJCorr('C_triu', eta=eta, n=n, testval=0.5)
    C = pm.Deterministic('C', tt.fill_diagonal(C_triu[np.zeros((n, n), dtype=np.int32)], 1.))
    sigma_diag = pm.Deterministic('sigma_mat', tt.nlinalg.diag(sd))
    sigma = pm.Deterministic('cov', tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag))

    mu_start = pm.distributions.continuous.Normal('start_drift', mu=5., shape=n, sd=10.)
    mu_hold = pm.distributions.continuous.Normal('hold_drift', mu=0., sd=10., shape=n)
    p = pm.distributions.continuous.Beta('p_hold', alpha=1, beta=1)
    t_hold = pm.Geometric('time_until_hold', p=p, dtype='int64')
    
    pos = []
    for i in range(len(obs['CT'])):
        px = obs['CT'][i]
        obs_n = len(px)
        tth = pm.math.minimum(t_hold, obs_n-1)
        ht = obs_n-tth-1
        drift = pm.math.concatenate([tt.tile(mu_start, (tth,1)), tt.tile(mu_hold, (ht,1))])
        pos.append(pm.distributions.timeseries.MvGaussianRandomWalk('position_CT_rnd_%i' % i, mu=drift, 
                                                            cov=sigma, init=inital_pos["CT"],
                                                            observed=px))
    print 'Starting PyMC with %i samples' % len(pos) # Starting PyMC with 94 samples
    start = pm.find_MAP(vars=position_model.cont_vars)
    print start
    step = pm.Metropolis()
    trace = pm.sample(n_samples, step=step, start=start, tune=2000)

I believe the problem lies within this piece:

        obs_n = len(px)
        tth = pm.math.minimum(t_hold, obs_n-1)
        ht = obs_n-tth-1
        drift = pm.math.concatenate([tt.tile(mu_start, (tth,1)), tt.tile(mu_hold, (ht,1))])

Since each observation (a Tx2 vector) differs in size, restricting the geometric variable with min and max can produce intractable results. Under small samples (e.g range(5)), the geometric variable seems to be sampling well and produce different results each iteration but under the entire 94 samples, it seems to produce the same output. Also, let me know if my drift specification is correct

Things I’ve tried:

  1. Using different samplers such as NUTS (very slow, < 1 it/s), Metropolis (currently only one working), HamiltonMC (get typeError: expected type_num 11 (NPY_FLOAT32) but got 12).
  2. Using pm.math.switch for drift but it throws me a dimension mismatch error (my mu vectors are length 2)
  3. Using Bound for Geometric var (throws errors on negative array lengths).

Could you please also provide the data of this model px?

Also, a small tip: you dont need to write pm.distributions.continuous.Exponential(...), you can just go pm.Exponential(...)

and as @fonnesbeck mentioned on Github:
It should help to tt.stack your list of random walks (where tt is theano.tensor). You end up with a huge graph otherwise.

Here is the npy file for the variable obs[‘ct’]. The admins have disabled uploading on here.

Also I will try tt.stack for the mvrvs, thanks for the advice.

edit: It seems like I cannot use tt.stack as my random walks are of different lengths in observed. I get the error ValueError: all the input array dimensions except for the concatenation axis must match exactly which did not occur when i input the same observed set into the stack.

my code: pos = tt.stack(pos) after the list appendment

Maybe you can try stacking it on a different dimension, but I can not try yet as I also need inital_pos["CT"] and n. Is n = len(obs)?

Apologies on the incompleteness.
inital_pos = {"CT": pm.distributions.multivariate.MvNormal.dist(mu=np.array([255, 240]), cov=np.array([[15.**2, 0.], [0., 10.**2]]))}

n=2, n_samples=10000

OK, so if I understand correctly, your data is A Random walk in 2D, and that some point there is a change in mu?
In that case shouldn’t use a DiscreteUniform(time_0, time_1) to model the change point instead of a Geometric distribution?
Better yet is to use a Uniform(time_0, time_1) and use tt.switch to construct drift

1 Like

Yes, that is correct. I assume that cov is same over time but the drift will at some point change. However, since each of my sample is different length, how will i be able to model the switching point when i dont know what the upper bound with be? I would like a general soln and thats why i picked geometric because even if i cant sample it well, I can at least infer P. If i were to create a discreteuniform for each sample, i think the sampler will still fail to sample properly

Also I have tried tt.switch but couldnt get it to work, have you had any success with it? Also have you gotten the tt.stack to work with what @fonnesbeck wrote?

I see what you mean now. Well, it is still worth to try modelling it with tt.switch as you can model the switch point as a continuous variable, ADVI and NUTS would be available in that case.
For example, this seems to work (but slow, need to further optimized):

with pm.Model() as position_model:
    eta = 1 #pm.Exponential('eta', lam=0.5, testval=0.5)
    sd_dist = pm.HalfCauchy.dist(beta=2.5)
    packed_chol = pm.LKJCholeskyCov('chol_cov', n=n, eta=eta, sd_dist=sd_dist)
    # compute the covariance matrix
    chol = pm.expand_packed_triangular(n, packed_chol, lower=True)
    cov = tt.dot(chol, chol.T)
    
    mu_start = pm.Normal('start_drift', mu=5., sd=10., shape=n)
    mu_hold = pm.Normal('hold_drift', mu=0., sd=10., shape=n)
    switchpoint = pm.Beta('switchpoint', alpha=1, beta=1)
    
    pos = []
    for i in range(len(obs)):
        px = obs[i]
        obs_n = len(px)-1
        drift = tt.switch(switchpoint >= np.linspace(0,1,obs_n)[:, np.newaxis], mu_start, mu_hold)
        pos.append(pm.MvGaussianRandomWalk('position_CT_rnd_%i' % i, mu=drift, 
                                                            cov=cov, init=inital_pos["CT"],
                                                            observed=px))

    trace = pm.sample(1e3)

The problem of using Geometric distribution is that Metropolis becomes very inefficient, so no sample is accepted.

1 Like

Update: ADVI runs OK, NUTS not so much…

Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 31,678:  20%|█▉        | 39899/200000 [13:23<57:27, 46.44it/s]     
Convergence archived at 39900
Interrupted at 39,900 [19%]: Average Loss = 5.378e+05
  0%|          | 6/1500.0 [00:48<5:37:50, 13.57s/it]

Thanks for the code (I see that I was making the mistake with tt.switch where my condition wasn’t in T x 2 dim). A few questions (more as an aside):

  • You replaced LKJCorr with LKJCholeskyCov, is there a partricular reason for that?
  • Is find_MAP still encouraged in practice? i notice a lot of code samples stopped using find_MAP
  • Is NUTS sampler with ADVI always encouraged? I know that it only supports continuous

When debugging models like this, where nuts gets stuck in the beginning, it can help to decrease max_treedepth (to maybe ~7). That will limit the lengths of the transitions, so you get samples that you can look at and try to find out what the problem is. The samples will probably be quite a bit more correlated than usual, however.

with model:
    trace = pm.sample(nuts_kwargs={'max_treedepth': 7})

About your questions:

  • LKJCholeskyCov should work much better in most situations. It solves the problem with LKJCorr, that it doesn’t have support on $R^n$.
  • find_MAP is pretty much discouraged now in most situations. The MAP is not a representative point in high dimensions anymore, and it can make life difficult for NUTS, as the gradients are small at the MAP.
  • Short answer: Use nuts when you can, it scales reasonably in higher dimensions. How to properly initialize it is a bit of an open question right now, but advi is probably the best we have at the moment.
1 Like
  • LKJCholeskyCov is encouraged because it is numerically more stable (thanks to @aseyboldt), and faster to sample (not in this case as MvGaussianRandomWalk only accept cov currently).
  • find_MAP is really really bad in high dimension cases because the MAP is usually not in the typical set (Betancourt’s paper is the best reference https://arxiv.org/abs/1701.02434)
  • ADVI is awesome for initialling NUTS, because it gives a very good scaling matrix. And in general, the estimation from ADVI is already very close to the posterior from MCMC (you can even get better ADVI result using FullRankADVI)
1 Like

@aseyboldt you beat me to it! :joy:

1 Like

For ADVI, I’m iterating at a much slower pace than what your results suggest, is there a way to set a higher tolerance so that ADVI will terminate much earlier? The advi code here suggests we can use tol_obj but I’m not sure how to access that value when it seems to be using the more general wrapper ADVI/Inference class.

My suggestion is to try to reparameterization first. Right now the model is not very efficient. I think writing a custom logp instead of appending a list would help. Also, start with fewer observations (e.g., for i in range(3):) might help.

1 Like

Hi @kpei, did you make any progress on this? It recently come to our attention that using pm.dist_math.switch might break the gradient. You can see this post for a case study, and here for some recommendation/workaround.

I am currently redoing the code with a custom logp. It seems like I may have to use a non-theano op to carry the logp calculation (which may limit me from using NUTS/ADVI). I will try the sigmoid workaround to see if it helps.

I seem to be stuck using the theano as_op to define the likelihood. Below is my code, note that my input is now an (N, T, 2) tensor where T is some arbitrary value i set. If the length of the sample < T, then it is padded with np.nan. The sample can be downloaded here

@theano.compile.ops.as_op(itypes=[tt.ftensor3, tt.fmatrix, tt.fmatrix, tt.fvector, tt.fmatrix], otypes=[tt.fscalar])
def logp(x, mu, cov, init_mu, init_cov):
    ll = tt.sum(pm.MvNormal.dist(mu=init_mu, cov=init_cov).logp(x[:,0]))
    for i in range(x.shape[0]):
        v = x[i,:]
        v = v[~np.isnan(v).any(axis=1)]
        innov_like = pm.MvNormal.dist(mu=v[:-1] + mu[v.shape[0]-1], cov=cov).logp(v[1:])
        ll += tt.sum(innov_like)
    return ll

Since then, I have added in an additional variable to iterate over (the player). Assuming for now that we have 1 player, The data is still the same as before.

        px = np.array(obs[player]['CT'])
        len_player = len(px)
        print 'Adding %i Rounds for %s' % (len_player, player)
        drift = tt.switch(switchpoint[k] >= np.linspace(0,1,cf-1)[:, np.newaxis], mu_start[k], mu_hold[k])
        loglike = pm.DensityDist('ll', lambda x: logp(x, drift, cov, inital_pos["CT"].mu, inital_pos["CT"].cov), observed=px)

        db = pm.backends.Text('pos_dump_%s'  %player)
        pm.sample(n_samples, trace = db)

I get the following error: AttributeError: 'scratchpad' object has no attribute 'test_value'. I know from this issue, that as_op holds only non-theano function. However if I were to use some logp like def logp(x, mu ...): return 0.0. I still get the same error.

full log

Adding 96 Rounds for rush
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-128-e423220c249f> in <module>()
     28     theano.config.compute_test_value = 'off'
     29     #step = pm.Metropolis()
---> 30     pm.sample(n_samples, trace = db)

C:\Anaconda2\lib\site-packages\pymc3\sampling.pyc in sample(draws, step, init, n_init, start, trace, chain, njobs, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, **kwargs)
    241         start_, step = init_nuts(init=init, njobs=njobs, n_init=n_init,
    242                                  model=model, random_seed=random_seed,
--> 243                                  progressbar=progressbar, **args)
    244         if start is None:
    245             start = start_

C:\Anaconda2\lib\site-packages\pymc3\sampling.pyc in init_nuts(init, njobs, n_init, model, random_seed, progressbar, **kwargs)
    598             callbacks=cb,
    599             progressbar=progressbar,
--> 600             obj_optimizer=pm.adagrad_window
    601         )  # type: pm.MeanField
    602         start = approx.sample(draws=njobs)

C:\Anaconda2\lib\site-packages\pymc3\variational\inference.pyc in fit(n, local_rv, method, model, random_seed, start, inf_kwargs, **kwargs)
    741                 local_rv=local_rv, model=model, random_seed=random_seed,
    742                 start=start,
--> 743                 **inf_kwargs
    744             )
    745         except KeyError:

C:\Anaconda2\lib\site-packages\pymc3\variational\inference.pyc in __init__(self, local_rv, model, cost_part_grad_scale, scale_cost_to_minibatch, random_seed, start)
    363             cost_part_grad_scale=cost_part_grad_scale,
    364             scale_cost_to_minibatch=scale_cost_to_minibatch,
--> 365             random_seed=random_seed, start=start)
    366 
    367     @classmethod

C:\Anaconda2\lib\site-packages\pymc3\variational\inference.pyc in __init__(self, op, approx, tf, local_rv, model, op_kwargs, **kwargs)
     55             approx = approx(
     56                 local_rv=local_rv,
---> 57                 model=model, **kwargs)
     58         elif isinstance(approx, Approximation):    # pragma: no cover
     59             pass

C:\Anaconda2\lib\site-packages\pymc3\variational\opvi.pyc in __init__(self, local_rv, model, cost_part_grad_scale, scale_cost_to_minibatch, random_seed, **kwargs)
    576         self.lbij = DictToArrayBijection(ArrayOrdering(self.local_vars), {})
    577         self.flat_view = model.flatten(
--> 578             vars=self.local_vars + self.global_vars
    579         )
    580         self._setup(**kwargs)

C:\Anaconda2\lib\site-packages\pymc3\model.pyc in flatten(self, vars)
    697         order = ArrayOrdering(vars)
    698         inputvar = tt.vector('flat_view', dtype=theano.config.floatX)
--> 699         inputvar.tag.test_value = flatten_list(vars).tag.test_value
    700         replacements = {self.named_vars[name]: inputvar[slc].reshape(shape).astype(dtype)
    701                         for name, slc, shape, dtype in order.vmap}

AttributeError: 'scratchpad' object has no attribute 'test_value'

I dont think you need the as_op decorator - did you try removing it?