I noticed that in some papers people determine whether the model converged based on perplexity on some data set aside earlier for testing. Is there a way to compute the perplexity of a model with the new OPVI API? As far as I understand, to get the perplexity we need to compute the values of log(q) of the current variational approximation q on user-provided data.
I had a look at the source code and it seems it would amount to evaluating logq (or norm_logq, I’m not sure at this point) but the complexity of the OPVI API is beyond me at this point and I have no idea how to supply user data to these functions. Maybe I’m looking in the wrong place and there is an easier way around it.
Long term I hope to implement a perplexity-based convergence test as a callback.
Are the other techniques of evaluating the convergence, apart from simply looking at average loss, available in pymc3? For example, the Wikipedia article also mentions using cross-entropy as the test function.
As for evaluating the values of log(q) maybe @ferrine knows a quick way to do that.
– pwl: @junpenglao Doesn’t that callback simply look track a change in the norm of the overall parameter vector of approximations q_i (so means and stdevs of transformed probability distributions) instead looking at the loss? In particular, it is completely ignoring loss in its __call__ definition.
– ferrine: @pwl Yes, We are currently looking at parameters as it works for both parametric gradient and non parametric gradient descent.
I hope I correctly understood the main idea of the proposed method. Evaluating likelihood for approximate model is quite easy. You can get access to approx.model.observed_RVs in callback and in the first iteration compile a function using approx_lik = approx.apply_replacements(likelihood). Note that likelihood is scaled to the full dataset there if using minibatches
– pwl: @ferrine thanks for the tip. How do I access likelihood? Is it stored somewhere in approx?
# first, you create a tensor, that can be any expression that uses pymc3 random nodes
# in our case it is likelihood
likelihood = tt.sum([v.logpt for v in approx.model.observed_RVs])
# to evaluate it under approximate posterior you can replace random nodes with approximation
# for the only sample it is
approx_lik = approx.apply_replacements(likelihood)
# for many samples
approx_liks = approx.sample_node(likelihood, n_samples)
# then you can teke expectation under posterior.
marginal_lik = approx_liks.mean(0)
# after all, function can be compiled
marginal_lik_fn = theano.function([], marginal_lik)
@ferrine Thanks for a thorough answer. One thing I still don’t understand how can I supply custom data point for the likelihood to be evaluated at. I guess instead of computing the mean(0) I would have to evaluate each logpt at some point but I will take a closer look at that tomorrow.
data point where likelihood is evaluated can be a shared variable or just a symbolic tensor. There is no way to change it except using theano.clone and having a reference to your data. As a shortcut you can use approx.sample_node(likelihood, n_samples, more_replacements={old_data_tensor:new_data_tensor})
@ferrine I am playing around with the new functionality of sample_node and apply_replacements and I’m slowly getting a hold of it (mostly thanks to your recent blog post, having working examples to build on really helps). I still have some questions though, as some things are not entirely clear to me yet.
Why are there two functions sample_node and apply_replacements if the sample_node with size=1 seems to do the same thing as apply_replacements?
When running a model with the Tracker callback involving sample_node or apply_replacements I’m getting much more cpu usage then without the callback (all 8 cores at 100%). Is that expected? Why?
As for 2 methods: shapes differ. Moreover internally scan is behind sample_node while apply_replacements does not use them. I agree that is not the best design and it’s better to wrap up in single function
Maybe I was not entirely clear with the CPU load. I’m doing inference on a GPU with usually just one core active so I was surprised to see such a dramatic increase in CPU usage. But that was my mistake: I have not wrapped the test data set with the theano.shared so these computations were done on a CPU instead on a GPU (as far as I understand that). It’s now working much better with the shared arrays.
with pm.Model():
val = pm.Poisson.dist(
mu = pm.Gamma('U',1,1,shape=3)).logp(theano.shared(np.arange(3)))
inference = pm.ADVI()
inference.approx.sample_node(val)
I’m getting a MethodNotDefined erorr. Am I doing something wrong here or is it a bug? Also the above snippet works perfectly fine if I remove theano.shared and use the numpy array directly.
data = np.random.randint(3,size=(2,2)).astype(np.float32)
with pm.Model():
a = pm.Beta('a',1,1,shape=2)
b = pm.Beta('b',1,1,shape=2)
psi = theano.tensor.outer(a,b)
theta = pm.Gamma('g',1,1,shape=(2,2))
data = pm.ZeroInflatedPoisson('data',theta=theta,psi=psi,observed=data)
inference = pm.ADVI()
val = inference.approx.sample_node(
pm.ZeroInflatedPoisson.dist(theta=theta,psi=psi).logp(data),size=1)
inference.fit(n=2,
callbacks=[pm.callbacks.Tracker(val=val.eval)])
yields AttributeError but trains fine if I remove the callbacks. I tried replacing outer with explicit multiplication but then it complains that AttributeError: mul.0 has no test value. It does not happen for scalar variables though (that is if a and b are scalars).
One more thing, it only seems to happen only for ZeroInflatedPoisson.dist(theta=theta,psi=psi), I tried using Poisson.dist(mu=psi) without any issues. And there is no difference between defining tracker outside of the call to fit, if that’s what you meant.