Gaussian Mixture of regression


#1

df.csv (94.7 KB)
Hello,

I’m just curious to know why I’m getting two different results in the Mixture of regression. I tried to standardize the data but still obtained different results from the two model setups. Below are codes used in the analysis. I have also attached the data.

Model 1:

k = 2 

with pm.Model() as Mixture_regression1:

    # Priors for weight parameter
    π1 = pm.Uniform('π1', 0, 1)
    π2 = pm.Deterministic('π2', 1. - π1)
    
    # Priors for unknown model parameters
    α1 = pm.Normal('α1', mu=0, sd=100) #Intercept
    β1 = pm.Normal('β1', mu=0, sd=100) 
    σ1  = pm.HalfCauchy('σ1', 5)  #Noise
    
    α2 = pm.Normal('α2', mu=0, sd=100) #Intercept
    β2 = pm.Normal('β2', mu=0, sd=100) 
    σ2  = pm.HalfCauchy('σ2', 5)  #Noise
     
    μ1 = α1 + β1*X 
    μ2 = α2 + β2*X 

    mu = T.stack([μ1, μ2]).T
    sd = T.stack([σ1, σ2])
    π = T.stack([π1, π2])
        
    likelihood = pm.NormalMixture('likelihood', π, mu, sd=sd, observed=Y)

Model 1 results:
image

Model 2:

n =  X.shape[0]                                         
with pm.Model() as Mixture_regression2:
    
       # Proportion in each mixture
       π = pm.Dirichlet('π', np.array([1]*k), testval=np.ones(k)/k)
    
       # Priors for unknown model parameters
       α = pm.Normal('α', mu=Y.mean(), sd=100,shape=k) #Intercept
       β = pm.Normal('β', mu=0, sd=100, shape=k) 
       σ  = pm.HalfCauchy('σ', 5,shape=k)  #Noise
    
       # Classifying each observation
       category = pm.Categorical('category', p=π, shape=n)

       α1 = pm.Deterministic('α1', α[category])
       β1 = pm.Deterministic('β1', β[category]) 
       σ1 = pm.Deterministic('σ1', σ[category])
    
       μ = α1 + β1*X 

       # Likelihood 
       likelihood = pm.Normal('likelihood', mu=μ, sd=σ1, observed=Y)
    
       step1 = pm.Metropolis([π, α, β, σ]) 
       step2 = pm.ElemwiseCategorical([category], values=range(k))
       trace = pm.sample(niter, [step1, step2], random_seed=1345, progressbar=True)

Model 2 results:
image

Thank you for the help!


Mixture of hierarchical model
Mixture with multiple observations
Mixture of Linear models
Constrain one variable to be greater than another
Using stick breaking algorithm example but receiving gradient problem
Hierarchical model with mixture prior
#2

Did you check the trace? I am getting the same result from Model 1 (also much faster with NUTS):

This is the code I am using (small rewrite so it is more compact):

df = pd.read_csv('mixture.csv')
X = df['X'].values[:, None]
Y = df['Y'].values
k = 2 
with pm.Model() as Mixture_regression1:
    # Priors for weight parameter
    π = pm.Dirichlet('π', np.array([1]*k), testval=np.ones(k)/k)
    
    # Priors for unknown model parameters
    α = pm.Normal('α', mu=0, sd=100, shape=(1, k)) #Intercept
    β = pm.Normal('β', mu=0, sd=100, shape=(1, k))
    σ  = pm.HalfCauchy('σ', 5, shape=k)  #Noise
    
    mu = α + β*X
        
    likelihood = pm.NormalMixture('likelihood', π, mu, sd=σ, observed=Y)
    trace = pm.sample(1000, tune=1000)

Mixture of hierarchical model
#3

OK I think I know why this happens: there are label switching in the mixture, for example, in the first trace pi=[.05, .95] but in the second trace pi=[.95, .05].

You can use an ordered transform to restrict the order of $alpha$: Mixture models and breaking class symmetry

[EDIT]:
After some more thought and playing with the model, I think your problem is not related to label switching (plus, you are running one chain only, right?). In fact when I reran the model I am getting the output as the model 1 in your case quite consistently. So what takes? My intuition is that there are multimode in the posterior, one around pi = [.05, .95] and one around [.35, .65]. A more informative prior should help with the model identification issue.

I put up a Gist notebook, which the two way of modelling the mixture regression converge to a similar result http://nbviewer.jupyter.org/gist/junpenglao/1907bf019906c125f63126ec4bf23880/Mixture_discourse.ipynb. It is important to run multiple chains and check the fitting.
It is also quite interesting in itself as it clearly shows that although the marginalizing and the non-marginalizing gives the same posterior, the effective sample size is much better with marginalizing.


#4

Thank you for your help! That right, I was using one chain in both setups. The descriptions you provided are real helpful!

One more issue that I have is when using pm.Mixture I’m getting an error. I guess the way pm.normal.dist I defined is not right for the regression:


Multiple observations sharing priors and likelihood model
#5

The correct way to do in this case is likelihood = pm.Mixture('likelihood', π, pm.Normal.dist(mu, sd), observed=Y)


#6

It is working. Thank you very much.