How to (re-)sample from an observed random (mixture) variable

I have multimodal data that I want to model, with the goal to re-sample from the model to study the variance of it mean.

This is my (simplifyed) code for the model :

    import sys
    import pymc3 as pm, theano.tensor as tt
    import numpy as np
    import matplotlib.pyplot as plt

    # observed data
    y = np.concatenate([np.random.normal(loc=0,scale=1,size=1000),
        np.random.normal(loc=5,scale=1,size=1000)])
    # show data
    plt.hist(y,bins=100); plt.show()

    with pm.Model() as model:
        # model description
        mix_nbr = 6
        sd = pm.HalfNormal('sd', sd=np.asarray([1.0]*mix_nbr), shape=mix_nbr)
        mu = pm.HalfNormal('mu', sd=np.asarray([5.0]*mix_nbr), shape=mix_nbr)
        components = pm.Normal.dist(mu=mu,sd=sd,shape=mix_nbr)
        w = pm.Dirichlet('w',a=np.array([1.0]*mix_nbr))
        mix = pm.Mixture('mix',w=w,comp_dists=components, observed=y)

        # optimization
        mean_field = pm.fit(n=10000, obj_optimizer=pm.adagrad(learning_rate=0.1))
        trace = mean_field.sample(1000)
        # plot trace
        pm.traceplot(trace,varnames=['mu','sd','w']);plt.show()
        # plot Preditive Posterior Check
        ppc = pm.sample_ppc(trace,vars=[mix],samples=len(y))
        plt.hist(ppc['mix'],bins=100); plt.show()

I got a PPC of my data, which histogram seem to fit the data, I’m happy with that.

My problem appears when it comes to apply the mean operator on this re-sampled data and comparing with original data :

    y.mean() # original data mean
    => 2.4788735942752873

    ppc['mix'].mean() # re-sampled mean
    => 2.5360738033551042

There are different… That’s normal since it’s a a re-sample, it’s not exactly the same data. But each time I re-sample and compare, I discover a biais : re-samples means are always bigger that the original data mean !
I expected that the resampled means would spread around the original data mean.

Execute this line several time and you’ll see the problem :
ppc = pm.sample_ppc(trace,vars=[mix],samples=len(y), model=model);ppc['mix'].mean()

My understanding is that the optimization is on the likelihood only, so there is no reason that the mean behaves the way as expected.

So I want to constrain my model such the mean of resampled data don’t go too far form the original data mean.

This means that I need to acces to samples from my ‘mix’ variable in order to compute the mean and build a Potential
penalty based on the difference of that mean and the original data mean.

That’s where my problem starts… :wink:

I tryed to create a new random variable mixture with the same components and weights : (just after the ‘mix’ declaration)

re_mix = pm.Mixture('re_mix',w=w,comp_dists=components,shape=len(y))

But adding this simple line to my model changed the results!!
The PPC is ruined! It doesn’t fit the original data any more.

This wasn’t expected since the re_mix random variable is not observed !

The bias could also comes from the meanfield approximation - did you try FullRank ADVI?

ok… How do I do this ?

I confess that I picked the sampling invocation line from the mixture exemple : http://docs.pymc.io/notebooks/gaussian-mixture-model-advi.html

And also because the basic “sample” invocation is far too slow on my complex model.

I’d really like to have introduction document that could help me to find the right way to perfom an optimization that won’t lead to a bias on the mean…

And I’d also like to understand why my adding the ‘re_mix’ line to my model does change the result (even if this random variable is not observed).

You can specify different method while call pm.fit:

fullrank = pm.fit(n=10000, method='fullrank_advi', 
                  obj_optimizer=pm.adagrad(learning_rate=0.1))

I dont think there are any, variational inference is still an actively developing subject.

You are not supposed to put unobserved node like this in VI model - as currently PyMC3 will try to build an approximation of any free Random variable, even if in this case the node is fully generative and not linked to the model logp. If you do mean_field.params[0].eval().shape you will find that the approximation included all the dimension from re_mix.

Overall, I have to point out that some biases are expected when you are using VI. If the PPC looks fine and the model prediction (on eg test data set) is reasonable, I will settle with that. If you need full Bayesian inference with proper account of uncertainty or full posterior, you should try to get the sampler to work.

Thanks for all these hints.

The full-rank version doesn’t change the biais problem, but using the ‘basic’ sample method did it !

It would be cool to have a list of different inference methods available, and a link to the corresponding articles explaining the technique. I didn’t found such section in the doc…, I found all the inference calls I use in code examples (I know it’s bad, but
It’d like to do better :wink: ) .

When using the sample function (MCMC), the re_mix is ok and does not alter the Predictive Posterior Check.
I confess that I’m surprised by this difference between ADVI and MCMC. It’s certainly because I don’t understand enough about ADVI… I’ll be happy to have any reference about it (even if it’s scientific articles and not easy tutorials)

Thanks for all your support.

I tend to trust sampling a bit more as well. There really not enough theory around automatic VI yet unfortunately.

I think including generative RV in the model is currently unsafe in PyMC3. In theory it should not change the inference as the generative node is conditional on the other parent parameters and has no effect on the overall model inference. But currently it does (see a discussion here):

For sampling, the problem is that the auto compound step does not fully follow the order of the graph, which means you might sometimes end up sampling the generative RV before sampling its parent, which generated a bias because now the parent parameters are depending on the sample of the generative RV. But even when the generative RV is sampled after its parent, it still has an effect on the sampling in the next sample, as the model.logp includes the generative RV and thus the MH update is dependent on the value of the generative RV.

It might work in some model, overall you should always added the generative node after inference, using sample_ppc, eg