Maybe you can try stacking it on a different dimension, but I can not try yet as I also need inital_pos["CT"]
and n
. Is n = len(obs)
?
Apologies on the incompleteness.
inital_pos = {"CT": pm.distributions.multivariate.MvNormal.dist(mu=np.array([255, 240]), cov=np.array([[15.**2, 0.], [0., 10.**2]]))}
n=2, n_samples=10000
OK, so if I understand correctly, your data is A Random walk in 2D, and that some point there is a change in mu
?
In that case shouldn’t use a DiscreteUniform(time_0, time_1) to model the change point instead of a Geometric distribution?
Better yet is to use a Uniform(time_0, time_1) and use tt.switch to construct drift
Yes, that is correct. I assume that cov is same over time but the drift will at some point change. However, since each of my sample is different length, how will i be able to model the switching point when i dont know what the upper bound with be? I would like a general soln and thats why i picked geometric because even if i cant sample it well, I can at least infer P. If i were to create a discreteuniform for each sample, i think the sampler will still fail to sample properly
Also I have tried tt.switch but couldnt get it to work, have you had any success with it? Also have you gotten the tt.stack to work with what @fonnesbeck wrote?
I see what you mean now. Well, it is still worth to try modelling it with tt.switch
as you can model the switch point as a continuous variable, ADVI and NUTS would be available in that case.
For example, this seems to work (but slow, need to further optimized):
with pm.Model() as position_model:
eta = 1 #pm.Exponential('eta', lam=0.5, testval=0.5)
sd_dist = pm.HalfCauchy.dist(beta=2.5)
packed_chol = pm.LKJCholeskyCov('chol_cov', n=n, eta=eta, sd_dist=sd_dist)
# compute the covariance matrix
chol = pm.expand_packed_triangular(n, packed_chol, lower=True)
cov = tt.dot(chol, chol.T)
mu_start = pm.Normal('start_drift', mu=5., sd=10., shape=n)
mu_hold = pm.Normal('hold_drift', mu=0., sd=10., shape=n)
switchpoint = pm.Beta('switchpoint', alpha=1, beta=1)
pos = []
for i in range(len(obs)):
px = obs[i]
obs_n = len(px)1
drift = tt.switch(switchpoint >= np.linspace(0,1,obs_n)[:, np.newaxis], mu_start, mu_hold)
pos.append(pm.MvGaussianRandomWalk('position_CT_rnd_%i' % i, mu=drift,
cov=cov, init=inital_pos["CT"],
observed=px))
trace = pm.sample(1e3)
The problem of using Geometric distribution is that Metropolis becomes very inefficient, so no sample is accepted.
Update: ADVI runs OK, NUTS not so much…
Autoassigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 31,678: 20%█▉  39899/200000 [13:23<57:27, 46.44it/s]
Convergence archived at 39900
Interrupted at 39,900 [19%]: Average Loss = 5.378e+05
0%  6/1500.0 [00:48<5:37:50, 13.57s/it]
Thanks for the code (I see that I was making the mistake with tt.switch where my condition wasn’t in T x 2 dim). A few questions (more as an aside):
 You replaced LKJCorr with LKJCholeskyCov, is there a partricular reason for that?
 Is find_MAP still encouraged in practice? i notice a lot of code samples stopped using find_MAP
 Is NUTS sampler with ADVI always encouraged? I know that it only supports continuous
When debugging models like this, where nuts gets stuck in the beginning, it can help to decrease max_treedepth
(to maybe ~7). That will limit the lengths of the transitions, so you get samples that you can look at and try to find out what the problem is. The samples will probably be quite a bit more correlated than usual, however.
with model:
trace = pm.sample(nuts_kwargs={'max_treedepth': 7})
About your questions:

LKJCholeskyCov
should work much better in most situations. It solves the problem withLKJCorr
, that it doesn’t have support on $R^n$. 
find_MAP
is pretty much discouraged now in most situations. The MAP is not a representative point in high dimensions anymore, and it can make life difficult for NUTS, as the gradients are small at the MAP.  Short answer: Use nuts when you can, it scales reasonably in higher dimensions. How to properly initialize it is a bit of an open question right now, but advi is probably the best we have at the moment.
 LKJCholeskyCov is encouraged because it is numerically more stable (thanks to @aseyboldt), and faster to sample (not in this case as MvGaussianRandomWalk only accept cov currently).
 find_MAP is really really bad in high dimension cases because the MAP is usually not in the typical set (Betancourt’s paper is the best reference https://arxiv.org/abs/1701.02434)
 ADVI is awesome for initialling NUTS, because it gives a very good scaling matrix. And in general, the estimation from ADVI is already very close to the posterior from MCMC (you can even get better ADVI result using FullRankADVI)
For ADVI, I’m iterating at a much slower pace than what your results suggest, is there a way to set a higher tolerance so that ADVI will terminate much earlier? The advi code here suggests we can use tol_obj but I’m not sure how to access that value when it seems to be using the more general wrapper ADVI/Inference class.
My suggestion is to try to reparameterization first. Right now the model is not very efficient. I think writing a custom logp instead of appending a list would help. Also, start with fewer observations (e.g., for i in range(3):
) might help.
Hi @kpei, did you make any progress on this? It recently come to our attention that using pm.dist_math.switch
might break the gradient. You can see this post for a case study, and here for some recommendation/workaround.
I am currently redoing the code with a custom logp. It seems like I may have to use a nontheano op to carry the logp calculation (which may limit me from using NUTS/ADVI). I will try the sigmoid workaround to see if it helps.
I seem to be stuck using the theano as_op to define the likelihood. Below is my code, note that my input is now an (N, T, 2) tensor where T is some arbitrary value i set. If the length of the sample < T, then it is padded with np.nan. The sample can be downloaded here
@theano.compile.ops.as_op(itypes=[tt.ftensor3, tt.fmatrix, tt.fmatrix, tt.fvector, tt.fmatrix], otypes=[tt.fscalar])
def logp(x, mu, cov, init_mu, init_cov):
ll = tt.sum(pm.MvNormal.dist(mu=init_mu, cov=init_cov).logp(x[:,0]))
for i in range(x.shape[0]):
v = x[i,:]
v = v[~np.isnan(v).any(axis=1)]
innov_like = pm.MvNormal.dist(mu=v[:1] + mu[v.shape[0]1], cov=cov).logp(v[1:])
ll += tt.sum(innov_like)
return ll
Since then, I have added in an additional variable to iterate over (the player). Assuming for now that we have 1 player, The data is still the same as before.
px = np.array(obs[player]['CT'])
len_player = len(px)
print 'Adding %i Rounds for %s' % (len_player, player)
drift = tt.switch(switchpoint[k] >= np.linspace(0,1,cf1)[:, np.newaxis], mu_start[k], mu_hold[k])
loglike = pm.DensityDist('ll', lambda x: logp(x, drift, cov, inital_pos["CT"].mu, inital_pos["CT"].cov), observed=px)
db = pm.backends.Text('pos_dump_%s' %player)
pm.sample(n_samples, trace = db)
I get the following error: AttributeError: 'scratchpad' object has no attribute 'test_value'
. I know from this issue, that as_op holds only nontheano function. However if I were to use some logp like def logp(x, mu ...): return 0.0
. I still get the same error.
full log
Adding 96 Rounds for rush
Autoassigning NUTS sampler...
Initializing NUTS using ADVI...

AttributeError Traceback (most recent call last)
<ipythoninput128e423220c249f> in <module>()
28 theano.config.compute_test_value = 'off'
29 #step = pm.Metropolis()
> 30 pm.sample(n_samples, trace = db)
C:\Anaconda2\lib\sitepackages\pymc3\sampling.pyc in sample(draws, step, init, n_init, start, trace, chain, njobs, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, **kwargs)
241 start_, step = init_nuts(init=init, njobs=njobs, n_init=n_init,
242 model=model, random_seed=random_seed,
> 243 progressbar=progressbar, **args)
244 if start is None:
245 start = start_
C:\Anaconda2\lib\sitepackages\pymc3\sampling.pyc in init_nuts(init, njobs, n_init, model, random_seed, progressbar, **kwargs)
598 callbacks=cb,
599 progressbar=progressbar,
> 600 obj_optimizer=pm.adagrad_window
601 ) # type: pm.MeanField
602 start = approx.sample(draws=njobs)
C:\Anaconda2\lib\sitepackages\pymc3\variational\inference.pyc in fit(n, local_rv, method, model, random_seed, start, inf_kwargs, **kwargs)
741 local_rv=local_rv, model=model, random_seed=random_seed,
742 start=start,
> 743 **inf_kwargs
744 )
745 except KeyError:
C:\Anaconda2\lib\sitepackages\pymc3\variational\inference.pyc in __init__(self, local_rv, model, cost_part_grad_scale, scale_cost_to_minibatch, random_seed, start)
363 cost_part_grad_scale=cost_part_grad_scale,
364 scale_cost_to_minibatch=scale_cost_to_minibatch,
> 365 random_seed=random_seed, start=start)
366
367 @classmethod
C:\Anaconda2\lib\sitepackages\pymc3\variational\inference.pyc in __init__(self, op, approx, tf, local_rv, model, op_kwargs, **kwargs)
55 approx = approx(
56 local_rv=local_rv,
> 57 model=model, **kwargs)
58 elif isinstance(approx, Approximation): # pragma: no cover
59 pass
C:\Anaconda2\lib\sitepackages\pymc3\variational\opvi.pyc in __init__(self, local_rv, model, cost_part_grad_scale, scale_cost_to_minibatch, random_seed, **kwargs)
576 self.lbij = DictToArrayBijection(ArrayOrdering(self.local_vars), {})
577 self.flat_view = model.flatten(
> 578 vars=self.local_vars + self.global_vars
579 )
580 self._setup(**kwargs)
C:\Anaconda2\lib\sitepackages\pymc3\model.pyc in flatten(self, vars)
697 order = ArrayOrdering(vars)
698 inputvar = tt.vector('flat_view', dtype=theano.config.floatX)
> 699 inputvar.tag.test_value = flatten_list(vars).tag.test_value
700 replacements = {self.named_vars[name]: inputvar[slc].reshape(shape).astype(dtype)
701 for name, slc, shape, dtype in order.vmap}
AttributeError: 'scratchpad' object has no attribute 'test_value'
I dont think you need the as_op
decorator  did you try removing it?
Yes however the follow code would not work:
for i in range(x.shape[0]):
v = x[i,:]
v = v[~np.isnan(v).any(axis=1)]
as they are numpy procedures and not theano. Do you know anyway to be able to efficiently get the likelihood for the values that are there but return a 0 for nans? This would deal with any different vector lengths and also get around list implementation
Edit: Getting the same results using Potential: ll = pm.Potential('likelihood', logp(px, drift, cov, inital_pos["CT"].mu, inital_pos["CT"].cov))
. The logp function no longer has the as_op decorator but still getting the same error.
Edit: cleared the cache, the potential method seems to be sampling albeit I get a type error sometimes (this has happened throughout a lot of my computations): TypeError: expected type_num 11 (NPY_FLOAT32) got 12
. I am using the cuda gpu to train it
Edit v4: Seems like removing floatx=float from theanorc fixed it. It is now sampling at around ~16 it/s for metropolis with the potential method, will try the pm.switch workaround + using NUTS/ADVI and see how they work.
Thanks for the update.
Just a quick heads up: training under GPU has not been tested extensively, and there are quite a few known issue using GPU (e.g., here). For now, try to make sure your model works on CPU first.
Keep me posted!
Just wanted to update that my original code samples perfectly under cpu (and even faster under single core!!) . nevertheless I made a few switches based on your original suggestion so i definitely appreciate that.