Hello,
I am trying to build a model that can identify healthy and diseased result values in a distribution of laboratory values. I know that each distribution is a gaussian when boxcox transformed.
My models aim is to estimate the mean and std as well as the appropriate lambda for the boxcox transformation for each distribution.
Here is the dummy data set up…
np.random.seed(12345) # set random seed for reproduciblity
k = 2
ndata = 4500
ndata2 = 500
# simulate data from mixture distribution
data = np.random.normal(loc=3.1, scale=np.sqrt(0.014),size=ndata)
data2 = np.random.normal(loc=2.9, scale=np.sqrt(0.022), size=ndata2)
# lambdas for the boxcox transformation
ld_1 = -0.18
ld_2 = -0.26
# Back convert the guassians to the original data scale
data = sci.special.inv_boxcox(data, ld_1)
data2 = sci.special.inv_boxcox(data2, ld_2)
# Make the final array
combi_data = np.concatenate((data, data2), axis=0)
np.random.shuffle(combi_data)
ndata = 5000
This is data generated using parameters estimated from an actual hospital data set for creatinine - if you are interested. (Concordet et al, Clinica Chimica Acta 405 (2009) 43-48.
The issue I have is that the observed data has to be transformed by the boxcox transformation before it can be passed to the mixture model. I had the idea that I could transform the original data by both the estimated lambda values on each itereation, then index which should be pass to likelihood function, a little like this
here is my helper function which returns a 2d array one for each boxcox transformation…
# helper function for boxcox array stacking
def box_cox(data, ld):
a = sci.stats.boxcox(data, ld[0])
b = sci.stats.boxcox(data, ld[1])
return np.stack((a, b), axis=-1)
And heres my model…
model = pm.Model()
with model:
# cluster sizes
p = pm.Dirichlet('p', a=np.array([1., 1.]))
# ensure all clusters have some points
p_min_potential = pm.Potential('p_min_potential', tt.switch(tt.min(p) < .1, -np.inf, 0))
# cluster centers
means = pm.Normal('means', mu=[0, 0], sd=15, shape=2)
# break symmetry
order_means_potential = pm.Potential('order_means_potential', tt.switch(means[1]-means[0] < 0, -np.inf,
0))
# measurement error
sd = pm.Uniform('sd', lower=0, upper=2)
# latent cluster of each observation
category = pm.Categorical('category', p=p, shape=ndata)
# lambda values are hard coded initially, but eventually I want to use something like Laplace
ld= np.array((-0.18, -0.26))
#BoundedLaplace= pm.Bound(pm.Laplace, lower=-1, upper=1)
#ld = BoundedLaplace('ld',mu=0, b=0.1, shape=2)
# likelihood for each observed value
points = pm.Normal('obs', mu=means[category], sd=sd, observed= box_cox(combi_data, ld)[category])
This approach does not work however. Can you suggest any possible ways to refactor this?