Best way to model a observed variable that is a mixture of two distributions

Hi Pymc3 Community,
I have a data generating process where I am trying to model contaminates.
My observed data is generally sampled from a single Normal distribution but occasionally is sampled from a contaminate process (which is modeled with a number of Uniform Distributions). I’d like to infer which data samples are contaminates.
If I have a variable that signifies if a data sample is contaminated or not:
theta ~ pm.Bernoulli(p=0.1, shape=n_data)

Then how to I model the observed data as a switch on theta?

There seems to be a few possible ways to do this:

  1. Build my own distribution that switches between Normal and Uniform. Something like this:
class SwitchingContinuous(pm.Continuous):
    def __init__(self, switch_val, dist1, dist2, *args, **kwargs):
        self.dist1 = dist1
        self.dist2 = dist2
        self.switch_val = switch_val
        super(SwitchingContinuous, self).__init__(*args, **kwargs)

    def logp(self, value):
        #Could also just use the tt.switch statement
        return (1-self.switch_val)*self.dist1.logp(value) + 
                  self.switch_val*self.dist2.logp(value)

theta= pm.Bernoulli('theta', p=0.1, shape=n_data)
a = pm.Normal.dist(mu=9, sd=1)
b = pm.Uniform.dist(0,10, shape=n_data)
obs = SwitchingContinous('mix', switch_val=theta, dist1=a, dist2=b)

2.) Pack theta into two columns, and use the pm.Mixture() class:

theta= pm.Bernoulli('theta', p=0.1, shape=n_data)
theta_stacked = tt.stack([1-theta, theta], axis=1)
a = pm.Normal.dist(mu=9, sd=1)
b = pm.Uniform.dist(0,10, shape=n_data)
pm.Mixture('mix', w=theta_stacked , comp_dists=[a, b], observed=data)
  1. Use DensityDist (not sure how to use this, but could work it out if its the right way to do it)

  2. Somehow make a switched variable observed:

theta= pm.Bernoulli('w', p=0.5, shape=n_data)
a = pm.Normal.dist(mu=9, sd=1)
b = pm.Uniform.dist(0,10, shape=n_data)
obs_data = tt.switch(theta, a, b)
obs  =pm.Deterministic(obs_data, observed=data)

FYI, dummy data could be generated like this:

    theta= np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])
    data1 = np.random.normal(size=(len(theta), 1))+9
    data2 = np.random.uniform(0,10, size=(len(theta), 1))
    data = np.zeros((len(theta), 1))
    data[theta] = data1[theta]
    data[~theta] = data2[~theta]
1 Like

Definitely option 2, but with a Beta prior for theta. You can use something like a \text{Beta}(1, 10) to reflect the strong sparseness.

1 Like

Awesome stuff! Thanks.