I have multimodal data that I want to model, with the goal to re-sample from the model to study the variance of it mean.
This is my (simplifyed) code for the model :
import sys import pymc3 as pm, theano.tensor as tt import numpy as np import matplotlib.pyplot as plt # observed data y = np.concatenate([np.random.normal(loc=0,scale=1,size=1000), np.random.normal(loc=5,scale=1,size=1000)]) # show data plt.hist(y,bins=100); plt.show() with pm.Model() as model: # model description mix_nbr = 6 sd = pm.HalfNormal('sd', sd=np.asarray([1.0]*mix_nbr), shape=mix_nbr) mu = pm.HalfNormal('mu', sd=np.asarray([5.0]*mix_nbr), shape=mix_nbr) components = pm.Normal.dist(mu=mu,sd=sd,shape=mix_nbr) w = pm.Dirichlet('w',a=np.array([1.0]*mix_nbr)) mix = pm.Mixture('mix',w=w,comp_dists=components, observed=y) # optimization mean_field = pm.fit(n=10000, obj_optimizer=pm.adagrad(learning_rate=0.1)) trace = mean_field.sample(1000) # plot trace pm.traceplot(trace,varnames=['mu','sd','w']);plt.show() # plot Preditive Posterior Check ppc = pm.sample_ppc(trace,vars=[mix],samples=len(y)) plt.hist(ppc['mix'],bins=100); plt.show()
I got a PPC of my data, which histogram seem to fit the data, I’m happy with that.
My problem appears when it comes to apply the mean operator on this re-sampled data and comparing with original data :
y.mean() # original data mean => 2.4788735942752873 ppc['mix'].mean() # re-sampled mean => 2.5360738033551042
There are different… That’s normal since it’s a a re-sample, it’s not exactly the same data. But each time I re-sample and compare, I discover a biais : re-samples means are always bigger that the original data mean !
I expected that the resampled means would spread around the original data mean.
Execute this line several time and you’ll see the problem :
ppc = pm.sample_ppc(trace,vars=[mix],samples=len(y), model=model);ppc['mix'].mean()
My understanding is that the optimization is on the likelihood only, so there is no reason that the mean behaves the way as expected.
So I want to constrain my model such the mean of resampled data don’t go too far form the original data mean.
This means that I need to acces to samples from my ‘
mix’ variable in order to compute the mean and build a
penalty based on the difference of that mean and the original data mean.
That’s where my problem starts…
I tryed to create a new random variable mixture with the same components and weights : (just after the ‘mix’ declaration)
re_mix = pm.Mixture('re_mix',w=w,comp_dists=components,shape=len(y))
But adding this simple line to my model changed the results!!
The PPC is ruined! It doesn’t fit the original data any more.
This wasn’t expected since the
re_mix random variable is not observed !