Hierarchical Mixture Model


#1

Hi,

I’m new to working with PyMC3 and I’m trying to specify a hierarchical mixture model to cluster house types (ie one story vs two story or something of that nature) based on their sale prices, using the house’s county and town as nested covariates. In my model, each town has mean house price beta_j, but I would like to add mixture priors to the model, where each town’s house prices are drawn from one of K normal distributions. I am able to include the hierarchical portion of the model without issues, but haven’t been successful trying to code the mixture priors for each town.

Below is an example of one of the models I have tried that hasn’t worked. I am getting an error because the number of observations being fed in is different than the number of distributions:

ValueError: Input dimension mis-match. (input[0].shape[0] = 10000, input[1].shape[0] = 3)

model = pm.Model()

with model:
  mu = pm.Normal('mu', mu=GM, sd=SD) # overall mean of house prices
  sigma = pm.HalfCauchy('sigma', beta=1) # overall standard

  alpha = pm.Normal('alpha', mu=mu, sd=sigma, shape=I) # county level mean
  gamma = pm.Uniform('gamma', lower=0, upper=sigma, shape=I) # county level standard deviation

  beta = pm.Normal('beta', # town level mean
                   mu=alpha[county_indices],
                   sd=gamma[county_indices],
                   shape=county_count)
  omega = pm.Uniform('omega', # town level standard deviation
                     lower=0,
                     upper=gamma[county_indices],
                     shape=county_count)

  p = pm.Dirichlet('p', np.ones(K), shape=(town_count, K)) # component mixing proportions
  theta = pm.Normal('theta', # component level means for each town
                    mu=beta[town_indices],
                    sd=omega[town_indices],
                    shape=(K, town_count))
  eta = pm.Uniform('eta', # component level standard deviations for each town
                   lower=0,
                   upper=omega[town_indices],
                   shape=(K, town_count))
  y = pm.NormalMixture('y',
                 p[indexed_transactions.town_idx.values, :],
                 mu=theta[:, indexed_transactions.town_idx.values],
                 sd=eta[:, indexed_transactions.town_idx.values],
                 observed=indexed_transactions.SalePrice)

  # approx = pm.fit(n=50000)
  trace = pm.sample(10000, chains=1)

I suspect that my issues may stem from either how I am specifying the Dirichlet priors, or how I am slicing the component-level mean and standard deviation matrices, but at this point I’m not too sure.

The code I am using the generate the data follows this post and is included below.

import pandas as pd
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt

I = 4 # number of counties
town_mult = 5 # number of towns per county
J = I*town_mult # total number of towns
K = 3 # number of mixture components
GM = 250000 # grand mean of all house prices
SD = 50000 # standard deviation of all house prices
N = 10000 # number of observations
spread = 200000 # difference between component means

comp_centers = np.array([GM-spread, GM, GM+spread])
comp_vals = np.random.randint(0, K, N)
alpha = np.random.normal(loc=comp_centers, scale=SD/2, size=(I, K)) # component mean for each county
county_vals = np.random.randint(0, I, N)
town_vals = np.ones(N)
for i in range(len(county_vals)):
  town_vals[i] = np.random.randint(county_vals[i]*town_mult,
                                   county_vals[i]*town_mult+I)

alpha = np.reshape(alpha, I*K)
beta = np.random.normal(loc=alpha, scale=SD/2, size=(J, I*K)) # component mean for each town
y = np.ones(N)
for i in range(N): # simulate individual house prices
  y[i] = np.random.normal(loc=beta[int(town_vals[i]),
                                   int((county_vals[i]*K)+comp_vals[i])],
                          scale=SD/2)

transactions = pd.DataFrame({'County': county_vals,
                             'Town': town_vals,
                             'SalePrice': y})

county_idx = transactions.groupby(['County']).all().reset_index().reset_index()[['index', 'County']]
county_idx.columns = ['county_idx', 'County']
county_town_idx = transactions.groupby(['County', 'Town']).all().reset_index().reset_index()[['index', 'County', 'Town']]
county_town_idx.columns = ['town_idx', 'County', 'Town']

county_idx_df = pd.merge(county_idx, county_town_idx, how='inner', on='County')

indexed_transactions = pd.merge(transactions,
                                county_idx_df,
                                how='inner',
                                on=['County', 'Town']).reset_index()

county_indices = county_idx_df.county_idx.values
county_count = len(county_indices)

town_indices = county_idx_df.town_idx.values
town_count = len(town_indices)

I appreciate any help you can offer!


#2

You need to pass the correct shape and comp_shape into pm.NormalMixture, because it is a mixture of multivariate normals.

I’m a bit lost by the many index values you have, so I’ll tell you the idea of how to structure the shape and comp_shape values. shape should be the resulting random variable’s shape, this means that it should not contain any information on the number of mixture components. comp_shape should be the shape of the multidimensional normal random variables that go into the mixture, plus the number of mixture components as the last axis.

Without the indexed_transactions.town_idx.values indexing, your mixture should be defined somewhat like this:

  theta = pm.Normal('theta', # component level means for each town
                    mu=beta[town_indices, None],  # Maybe you should tt.stack due to #3147
                    sd=omega[town_indices, None],  # Maybe you should tt.stack due to #3147
                    shape=(town_count, K))  # The mixture components must be the last index
  eta = pm.Uniform('eta', # component level standard deviations for each town
                   lower=0,
                   upper=omega[town_indices, None],  # Maybe you should tt.stack due to #3147
                   shape=(town_count, K))  # The mixture components must be the last index
  y = pm.NormalMixture('y',
                 p,
                 mu=theta,
                 sd=eta,
                 comp_shape=(town_count, K),
                 shape=(town_count,),
                 observed=...)

I hope this helps you figure out how to write up the correct shapes for your use case with the indexed_transactions indexing.


#3

It’s running now. Thanks for your help!