I have multimodal data that I want to model, with the goal to re-sample from the model to study the variance of it mean.
This is my (simplifyed) code for the model :
import sys
import pymc3 as pm, theano.tensor as tt
import numpy as np
import matplotlib.pyplot as plt
# observed data
y = np.concatenate([np.random.normal(loc=0,scale=1,size=1000),
np.random.normal(loc=5,scale=1,size=1000)])
# show data
plt.hist(y,bins=100); plt.show()
with pm.Model() as model:
# model description
mix_nbr = 6
sd = pm.HalfNormal('sd', sd=np.asarray([1.0]*mix_nbr), shape=mix_nbr)
mu = pm.HalfNormal('mu', sd=np.asarray([5.0]*mix_nbr), shape=mix_nbr)
components = pm.Normal.dist(mu=mu,sd=sd,shape=mix_nbr)
w = pm.Dirichlet('w',a=np.array([1.0]*mix_nbr))
mix = pm.Mixture('mix',w=w,comp_dists=components, observed=y)
# optimization
mean_field = pm.fit(n=10000, obj_optimizer=pm.adagrad(learning_rate=0.1))
trace = mean_field.sample(1000)
# plot trace
pm.traceplot(trace,varnames=['mu','sd','w']);plt.show()
# plot Preditive Posterior Check
ppc = pm.sample_ppc(trace,vars=[mix],samples=len(y))
plt.hist(ppc['mix'],bins=100); plt.show()
I got a PPC of my data, which histogram seem to fit the data, I’m happy with that.
My problem appears when it comes to apply the mean operator on this re-sampled data and comparing with original data :
y.mean() # original data mean
=> 2.4788735942752873
ppc['mix'].mean() # re-sampled mean
=> 2.5360738033551042
There are different… That’s normal since it’s a a re-sample, it’s not exactly the same data. But each time I re-sample and compare, I discover a biais : re-samples means are always bigger that the original data mean !
I expected that the resampled means would spread around the original data mean.
Execute this line several time and you’ll see the problem :
ppc = pm.sample_ppc(trace,vars=[mix],samples=len(y), model=model);ppc['mix'].mean()
My understanding is that the optimization is on the likelihood only, so there is no reason that the mean behaves the way as expected.
So I want to constrain my model such the mean of resampled data don’t go too far form the original data mean.
This means that I need to acces to samples from my ‘mix
’ variable in order to compute the mean and build a Potential
penalty based on the difference of that mean and the original data mean.
That’s where my problem starts…
I tryed to create a new random variable mixture with the same components and weights : (just after the ‘mix’ declaration)
re_mix = pm.Mixture('re_mix',w=w,comp_dists=components,shape=len(y))
But adding this simple line to my model changed the results!!
The PPC is ruined! It doesn’t fit the original data any more.
This wasn’t expected since the re_mix
random variable is not observed !