Perplexity of a model

Github issue #2297 by pwl:

I noticed that in some papers people determine whether the model converged based on perplexity on some data set aside earlier for testing. Is there a way to compute the perplexity of a model with the new OPVI API? As far as I understand, to get the perplexity we need to compute the values of log(q) of the current variational approximation q on user-provided data.

I had a look at the source code and it seems it would amount to evaluating logq (or norm_logq, I’m not sure at this point) but the complexity of the OPVI API is beyond me at this point and I have no idea how to supply user data to these functions. Maybe I’m looking in the wrong place and there is an easier way around it.

Long term I hope to implement a perplexity-based convergence test as a callback.

Are the other techniques of evaluating the convergence, apart from simply looking at average loss, available in pymc3? For example, the Wikipedia article also mentions using cross-entropy as the test function.


junpenglao: You can find the convergence evaluation callback for VI in PyMC3 here
So far we are doing it in by looking at the change of loss in a given window.

As for evaluating the values of log(q) maybe @ferrine knows a quick way to do that.

pwl: @junpenglao Doesn’t that callback simply look track a change in the norm of the overall parameter vector of approximations q_i (so means and stdevs of transformed probability distributions) instead looking at the loss? In particular, it is completely ignoring loss in its __call__ definition.

ferrine: @pwl Yes, We are currently looking at parameters as it works for both parametric gradient and non parametric gradient descent.
I hope I correctly understood the main idea of the proposed method. Evaluating likelihood for approximate model is quite easy. You can get access to approx.model.observed_RVs in callback and in the first iteration compile a function using approx_lik = approx.apply_replacements(likelihood). Note that likelihood is scaled to the full dataset there if using minibatches

pwl: @ferrine thanks for the tip. How do I access likelihood? Is it stored somewhere in approx?

ferrine: I think not yet. But you can make it

# first, you create a tensor, that can be any expression that uses pymc3 random nodes
# in our case it is likelihood
likelihood = tt.sum([v.logpt for v in approx.model.observed_RVs])
# to evaluate it under approximate posterior you can replace random nodes with approximation
# for the only sample it is
approx_lik = approx.apply_replacements(likelihood)
# for many samples
approx_liks = approx.sample_node(likelihood, n_samples)
# then you can teke expectation under posterior.
marginal_lik = approx_liks.mean(0)
# after all, function can be compiled
marginal_lik_fn = theano.function([], marginal_lik)

Wow, nice, it will be great to continue discussion here

Great! Thanks for setting this up @junpenglao!

@ferrine Thanks for a thorough answer. One thing I still don’t understand how can I supply custom data point for the likelihood to be evaluated at. I guess instead of computing the mean(0) I would have to evaluate each logpt at some point but I will take a closer look at that tomorrow.

data point where likelihood is evaluated can be a shared variable or just a symbolic tensor. There is no way to change it except using theano.clone and having a reference to your data. As a shortcut you can use approx.sample_node(likelihood, n_samples, more_replacements={old_data_tensor:new_data_tensor})

@ferrine I am playing around with the new functionality of sample_node and apply_replacements and I’m slowly getting a hold of it (mostly thanks to your recent blog post, having working examples to build on really helps). I still have some questions though, as some things are not entirely clear to me yet.

Why are there two functions sample_node and apply_replacements if the sample_node with size=1 seems to do the same thing as apply_replacements?

When running a model with the Tracker callback involving sample_node or apply_replacements I’m getting much more cpu usage then without the callback (all 8 cores at 100%). Is that expected? Why?

More computations are done. That’s why you get cpuload

As for 2 methods: shapes differ. Moreover internally scan is behind sample_node while apply_replacements does not use them. I agree that is not the best design and it’s better to wrap up in single function :slight_smile:

Maybe I was not entirely clear with the CPU load. I’m doing inference on a GPU with usually just one core active so I was surprised to see such a dramatic increase in CPU usage. But that was my mistake: I have not wrapped the test data set with the theano.shared so these computations were done on a CPU instead on a GPU (as far as I understand that). It’s now working much better with the shared arrays.

When running

with pm.Model():
    val = pm.Poisson.dist(
        mu = pm.Gamma('U',1,1,shape=3)).logp(theano.shared(np.arange(3)))
    inference = pm.ADVI()

I’m getting a MethodNotDefined erorr. Am I doing something wrong here or is it a bug? Also the above snippet works perfectly fine if I remove theano.shared and use the numpy array directly.

To be clear, I’m using these flags:

os.environ['THEANO_FLAGS'] = 'device=cuda,floatX=float32,exception_verbosity=high'

Are you on master? The above code works fine for me.

Yes, I am on the master (in particular works for me already).

Oh I see your edit. It must be a GPU issue then…

I posted an issue about that.

1 Like

Continuing on sample_node related errors:

data = np.random.randint(3,size=(2,2)).astype(np.float32)
with pm.Model():
    a = pm.Beta('a',1,1,shape=2)
    b = pm.Beta('b',1,1,shape=2)
    psi = theano.tensor.outer(a,b)
    theta = pm.Gamma('g',1,1,shape=(2,2))
    data = pm.ZeroInflatedPoisson('data',theta=theta,psi=psi,observed=data)

    inference = pm.ADVI()
    val = inference.approx.sample_node(

yields AttributeError but trains fine if I remove the callbacks. I tried replacing outer with explicit multiplication but then it complains that AttributeError: mul.0 has no test value. It does not happen for scalar variables though (that is if a and b are scalars).

Are you able to reproduce the problem on master?

Yes, I’m using the master branch (updated a few hours ago).

You’d better define tracker before calling fit

I see the problem is hidden in missing test_values somewhere in logp. That’s strange. I feel more worried about context dependent flag change

One more thing, it only seems to happen only for ZeroInflatedPoisson.dist(theta=theta,psi=psi), I tried using Poisson.dist(mu=psi) without any issues. And there is no difference between defining tracker outside of the call to fit, if that’s what you meant.