Hi,
I’m new to working with PyMC3 and I’m trying to specify a hierarchical mixture model to cluster house types (ie one story vs two story or something of that nature) based on their sale prices, using the house’s county and town as nested covariates. In my model, each town has mean house price beta_j
, but I would like to add mixture priors to the model, where each town’s house prices are drawn from one of K
normal distributions. I am able to include the hierarchical portion of the model without issues, but haven’t been successful trying to code the mixture priors for each town.
Below is an example of one of the models I have tried that hasn’t worked. I am getting an error because the number of observations being fed in is different than the number of distributions:
ValueError: Input dimension mis-match. (input[0].shape[0] = 10000, input[1].shape[0] = 3)
model = pm.Model()
with model:
mu = pm.Normal('mu', mu=GM, sd=SD) # overall mean of house prices
sigma = pm.HalfCauchy('sigma', beta=1) # overall standard
alpha = pm.Normal('alpha', mu=mu, sd=sigma, shape=I) # county level mean
gamma = pm.Uniform('gamma', lower=0, upper=sigma, shape=I) # county level standard deviation
beta = pm.Normal('beta', # town level mean
mu=alpha[county_indices],
sd=gamma[county_indices],
shape=county_count)
omega = pm.Uniform('omega', # town level standard deviation
lower=0,
upper=gamma[county_indices],
shape=county_count)
p = pm.Dirichlet('p', np.ones(K), shape=(town_count, K)) # component mixing proportions
theta = pm.Normal('theta', # component level means for each town
mu=beta[town_indices],
sd=omega[town_indices],
shape=(K, town_count))
eta = pm.Uniform('eta', # component level standard deviations for each town
lower=0,
upper=omega[town_indices],
shape=(K, town_count))
y = pm.NormalMixture('y',
p[indexed_transactions.town_idx.values, :],
mu=theta[:, indexed_transactions.town_idx.values],
sd=eta[:, indexed_transactions.town_idx.values],
observed=indexed_transactions.SalePrice)
# approx = pm.fit(n=50000)
trace = pm.sample(10000, chains=1)
I suspect that my issues may stem from either how I am specifying the Dirichlet priors, or how I am slicing the component-level mean and standard deviation matrices, but at this point I’m not too sure.
The code I am using the generate the data follows this post and is included below.
import pandas as pd
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
I = 4 # number of counties
town_mult = 5 # number of towns per county
J = I*town_mult # total number of towns
K = 3 # number of mixture components
GM = 250000 # grand mean of all house prices
SD = 50000 # standard deviation of all house prices
N = 10000 # number of observations
spread = 200000 # difference between component means
comp_centers = np.array([GM-spread, GM, GM+spread])
comp_vals = np.random.randint(0, K, N)
alpha = np.random.normal(loc=comp_centers, scale=SD/2, size=(I, K)) # component mean for each county
county_vals = np.random.randint(0, I, N)
town_vals = np.ones(N)
for i in range(len(county_vals)):
town_vals[i] = np.random.randint(county_vals[i]*town_mult,
county_vals[i]*town_mult+I)
alpha = np.reshape(alpha, I*K)
beta = np.random.normal(loc=alpha, scale=SD/2, size=(J, I*K)) # component mean for each town
y = np.ones(N)
for i in range(N): # simulate individual house prices
y[i] = np.random.normal(loc=beta[int(town_vals[i]),
int((county_vals[i]*K)+comp_vals[i])],
scale=SD/2)
transactions = pd.DataFrame({'County': county_vals,
'Town': town_vals,
'SalePrice': y})
county_idx = transactions.groupby(['County']).all().reset_index().reset_index()[['index', 'County']]
county_idx.columns = ['county_idx', 'County']
county_town_idx = transactions.groupby(['County', 'Town']).all().reset_index().reset_index()[['index', 'County', 'Town']]
county_town_idx.columns = ['town_idx', 'County', 'Town']
county_idx_df = pd.merge(county_idx, county_town_idx, how='inner', on='County')
indexed_transactions = pd.merge(transactions,
county_idx_df,
how='inner',
on=['County', 'Town']).reset_index()
county_indices = county_idx_df.county_idx.values
county_count = len(county_indices)
town_indices = county_idx_df.town_idx.values
town_count = len(town_indices)
I appreciate any help you can offer!