Defining an unpooled mixture model

I want to define an unpooled model consisting of a mixture of 5 Gaussians. I’m having trouble defining the shape of the model components. Here’s my code:

categories = df.unit.unique() # there are 24 categories
idx = pd.Categorical(df.unit, categories=categories).codes # there are 21866 rows of data
coords = {"units": categories, "units_flat": df.unit.values}

with pm.Model(coords=coords) as model:
    p = pm.Dirichlet("p", a=5*[1])
    mu = pm.Normal("mu", mu=[1, 24, 48, 72, 96], sigma=5*[2], shape=5, dims="unit")
    sigma = pm.HalfNormal("sigma", sigma=5, shape=5, dims="units")
    components = pm.TruncatedNormal.dist(mu=mu[idx], sigma=sigma[idx], lower=len(idx)*[1], upper=None)
    
    Y = pm.Mixture("Y", w=p, comp_dists=components, observed=df.hours, dims="unit_flat")

The error message I get states, ValueError: Alloc static input type and target shape are incompatible: Vector(float64, shape=(5,)) vs (21886,)

Why are you indexing into the parsmeters of the components? That looks like you know what component each observation belongs to, which wouldn’t be a mixture scenario.

That way you’re defining 21k components a priori not the five you wanted.

I’ll admit that I don’t fully understand the syntax which defines the unpooled groups. After further examination I believe I found my error. The likelihood distribution should not provide a dims argument. In other words, the likelihood distribution should be Y = pm.Mixture("Y", w=p, comp_dists=components, observed=df.hours).

However, I’ve realized this model isn’t quite what I wish it to be. This is because the Dirichlet distribution is still fully pooled, whereas I would like it to be unpooled. How can I make it unpooled?

Can you explain a bit more about the whole model you have in mind. The mixture part still seems strange. Perhaps you have some code that generates data according to your model?

Sure. The data I am fitting represents the duration spent completing various repairs. A histogram suggests that the data is a mixture of Gaussians. It is multi-modal, where the largest peak occurs at 1 hour, the next largest peak at 24, then 48, 72 & 96. The data is also grouped into 24 groups, corresponding to different facilities that are carrying out this work. I am attempting to fit an unpooled model to this data as a stepping-stone to a partially pooled model. Does that help?

Do you have no predictors you can feed into the model? A model that’s Normal on b0 + x*b1 can look marginally multimodal (say if x comes in roughly 5 levels) the you wouldn’t want to fit it as a Mixture model but a GLM.

If indeed you want a mixture, do you want to have 5 means learned from the data for each group or pooled? If you don’t pool the means it doesn’t make sense to pool the weights (unless there’s more details you are aware of).

The Mixture API doesn’t allow easily to make heterogeneous observations (unless you have the same number of observations per site). So you would need to define different Mixture for each site, each with 5 components.

So far I have resisted incorporating predictors into this model in the spirit of keeping it simple. My plan was to try adding predictors after I had done my best to model the observed Y density.

My plan was to fit an unpooled mixture model to most groups that have the most data, using the resulting group means and standard deviations as a starting point for the hyper priors of the hierarchical model. In both of these models I believe it would be imprudent to fix the ratio of the 5 Gaussians across all groups. Instead I think each group should have its own ratios of Gaussians.

Then the easiest is to have a Mixture for each group instead of trying to combine all in one