Hi all,
This is an updated version of a question I asked earlier (here). I’m trying to simplify it so I wanted to start anew.
I’m using PyMC3 to do a Bayesian ANOVA and would like to move from normal distributions in my likelihood to mixture ones that better match the data.
I have 5 different groups to compare in my data, but I’m reducing it to 2 here for simplicity’s sake. Here’s some sample code that works fine using a gaussian distribution for the likelihood:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
import arviz as az
#####################
# Create a sample dataset
#####################
# Create some data for two groups
ages_group01 = pd.Series(np.round(np.random.normal(scale=5, size=500)) + 55)
ages_group02 = pd.Series(np.round(np.random.normal(scale=5, size=500)) + 35)
# Organize it into a data frame
sampleDF = pd.DataFrame(0, columns=['age', 'group'], index=range(0, 1000))
sampleDF.iloc[0:500, 0] = ages_group01
sampleDF.iloc[0:500, 1] = 1
sampleDF.iloc[500:1000, 0] = ages_group02
sampleDF.iloc[500:1000, 1] = 2
# Add dummy/indicator variables into the data frame
dummyVars = pd.get_dummies(sampleDF['group'])
dummyVars.columns = ['group01', 'group02']
sampleDF = sampleDF.join(dummyVars)
#####################
# Bayesian ANOVA
#####################
DV = 'age'
with pm.Model() as model:
# Define priors
sigma = pm.HalfCauchy('sigma', beta=10)
group01 = pm.Normal('group01', mu=40, sd=10)
group02 = pm.Normal('group02', mu=40, sd=10)
likelihood = pm.Normal(DV,
mu=group01 * sampleDF['group01'] + group02 * sampleDF['group02'],
sd=sigma,
observed=sampleDF[DV])
trace = pm.sample(draws=1000, tune=1000, chains=2, target_accept=0.95, cores=8, return_inferencedata=True)
However, when I change the PyMC model to the following:
#####################
# Bayesian ANOVA
#####################
DV = 'age'
with pm.Model() as model:
# Define priors
sigma = pm.HalfCauchy('sigma', beta=10)
group01 = pm.Normal('group01', mu=40, sd=10)
group02 = pm.Normal('group02', mu=40, sd=10)
# Create two distributions
w = pm.Dirichlet('w', a=np.array([0.3, 1]))
dist1 = pm.HalfNormal.dist(sigma=0.8)
dist2 = pm.Normal.dist(mu=group01 * sampleDF['group01'] + group02 * sampleDF['group02'], sigma=sigma)
# Define likelihood using a misture of the two distributions instead of the gaussian
likelihood = pm.Mixture(DV,
w=w,
comp_dists=[dist1, dist2],
observed=sampleDF[DV])
trace = pm.sample(draws=1000, tune=1000, chains=2, target_accept=0.95, cores=8, return_inferencedata=True)
I end up getting the following error:
I can, however, use similar code to produce the mixture distribution that I’m looking for:
w = pm.Dirichlet('w', a=np.array([0.3, 1]))
dist1 = pm.HalfNormal.dist(sigma=0.8)
dist2 = pm.Normal.dist(mu=10, sigma=1.5)
customDist = pm.Mixture('customDist', w=w, comp_dists = [dist1, dist2])
That makes a distribution something like the following:
I'm assuming the error stems from how the component distributions are being specified in the second model, but I'm looking for help on how to fix it. Any suggestions?