Mixture model with boxcox transformation

Hello,

I am trying to build a model that can identify healthy and diseased result values in a distribution of laboratory values. I know that each distribution is a gaussian when boxcox transformed.

My models aim is to estimate the mean and std as well as the appropriate lambda for the boxcox transformation for each distribution.

Here is the dummy data set up…

np.random.seed(12345) # set random seed for reproduciblity
k = 2 
ndata = 4500
ndata2 = 500

# simulate data from mixture distribution
data = np.random.normal(loc=3.1, scale=np.sqrt(0.014),size=ndata)
data2 = np.random.normal(loc=2.9, scale=np.sqrt(0.022), size=ndata2)

# lambdas for the boxcox transformation
ld_1 = -0.18
ld_2 = -0.26

# Back convert the guassians to the original data scale
data = sci.special.inv_boxcox(data, ld_1)
data2 = sci.special.inv_boxcox(data2, ld_2)

# Make the final array
combi_data =  np.concatenate((data, data2), axis=0)
np.random.shuffle(combi_data)

ndata = 5000

This is data generated using parameters estimated from an actual hospital data set for creatinine - if you are interested. (Concordet et al, Clinica Chimica Acta 405 (2009) 43-48.

The issue I have is that the observed data has to be transformed by the boxcox transformation before it can be passed to the mixture model. I had the idea that I could transform the original data by both the estimated lambda values on each itereation, then index which should be pass to likelihood function, a little like this

here is my helper function which returns a 2d array one for each boxcox transformation…

# helper function for boxcox array stacking
def box_cox(data,  ld):
    a = sci.stats.boxcox(data, ld[0])
    b = sci.stats.boxcox(data, ld[1])
    return np.stack((a, b), axis=-1)

And heres my model…

model = pm.Model()
with model:
    # cluster sizes
    p = pm.Dirichlet('p', a=np.array([1., 1.]))
    # ensure all clusters have some points
    p_min_potential = pm.Potential('p_min_potential', tt.switch(tt.min(p) < .1, -np.inf, 0))

    # cluster centers
    means = pm.Normal('means', mu=[0, 0], sd=15, shape=2)
    # break symmetry
    order_means_potential = pm.Potential('order_means_potential', tt.switch(means[1]-means[0] < 0, -np.inf, 
0))
                                    
    # measurement error
    sd = pm.Uniform('sd', lower=0, upper=2)

    # latent cluster of each observation
    category = pm.Categorical('category', p=p, shape=ndata)
    
    # lambda values are hard coded initially, but eventually I want to use something like Laplace
    ld= np.array((-0.18, -0.26))

    #BoundedLaplace= pm.Bound(pm.Laplace, lower=-1, upper=1)  
    #ld = BoundedLaplace('ld',mu=0, b=0.1, shape=2)

    # likelihood for each observed value
    points = pm.Normal('obs', mu=means[category],  sd=sd,  observed= box_cox(combi_data, ld)[category])

This approach does not work however. Can you suggest any possible ways to refactor this?

I think it might be better to start with a Mixture model - I spent some time to think about what would be the smart way to pass a transformation to Normal to model this, but at the end maybe it is more straightforward to write a Box-Cox Normal distribution:

from pymc3.distributions.continuous import Continuous
from pymc3.theanof import gradient


class BoxCoxNormal(Continuous):
    def __init__(self, mu=0., sd=1., lmbda=1., **kwargs):
        self.sd = tt.as_tensor_variable(sd)
        self.mu = tt.as_tensor_variable(mu)
        self.lmbda = tt.as_tensor_variable(lmbda)

        super(BoxCoxNormal, self).__init__(**kwargs)

    def inv_boxcox_func(self, x):
        return tt.exp(tt.log1p(self.lmbda * x) / self.lmbda)

    def boxcox_func(self, y):
        return tt.expm1(self.lmbda * tt.log(y)) / self.lmbda

    def jacobian_det(self, x):
        x = tt.as_tensor_variable(x)
        grad = tt.reshape(
            gradient(tt.sum(self.boxcox_func(x)), [x]), x.shape)
        return tt.log(tt.abs_(grad))

    def logp(self, value):
        sd = self.sd
        mu = self.mu
        value_ = self.boxcox_func(value)
        return pm.Normal.dist(mu, sd).logp(value_) + self.jacobian_det(value)

I cherry-picked code from the scipy box-cox (and it’s inverse) transformation function, and the jacobian_det computation from pymc3 transformed.

So it works like this:

mu0, sd0 = 3., .05
y = np.random.normal(mu0, sd0, size=200)
lam = -.2
y_tr = special.inv_boxcox(y, lam)

with pm.Model() as m:
    mu = pm.Normal('mu', 0., 10.)
    sd = pm.HalfNormal('sd', 5.)
    y_latent = BoxCoxNormal('y', mu, sd, lmbda=lam, observed=y_tr)
    trace = pm.sample(5000, tune=1000)

pm.traceplot(trace, lines=dict(mu=mu0, sd=sd0));

Which it can recover the mu and sd.

To model the mixture, I go with something like below:

with pm.Model() as m:
    w = pm.Dirichlet('w', a=np.ones(2))
    mus = pm.Normal('mus', 2.8, 5., shape=2)
    sds = pm.HalfNormal('sds', .5, shape=2)
    mix_logp = [BoxCoxNormal.dist(mus[0], sds[0], lmbda=ld_1),
                BoxCoxNormal.dist(mus[1], sds[1], lmbda=ld_2),]
    obs = pm.Mixture('y', w, mix_logp, observed=combi_data)

Inference of mixture is pretty difficult with NUTS (you can do a search on the discourse for some related discussion), but for this model using find_MAP seems to performed quite well:

with m:
    map1 = pm.find_MAP()
    trace = pm.sample(1000, tune=1000, start=map1)

To also inference lambda, however, it’s a bit difficult, as I think the Box-Cox transformation is very sensitive to the lambda parameter. I tried with the model below, but the inference is very slow. I think reparameterization, more informative prior, and/or specify inference for this parameter is needed.

with pm.Model() as m:
    w = pm.Dirichlet('w', a=np.ones(2))
    mus = pm.Normal('mus', 2.8, 5., shape=2)
    sds = pm.HalfNormal('sds', .5, shape=2)
    lmbdas = pm.InverseGamma('lambdas', 8., 2., shape=2,
                             testval=np.asarray([-ld_1, -ld_2]))
    mix_logp = [BoxCoxNormal.dist(mus[0], sds[0], lmbda=-lmbdas[0]),
                BoxCoxNormal.dist(mus[1], sds[1], lmbda=-lmbdas[1]),]
    obs = pm.Mixture('y', w, mix_logp, observed=combi_data)

You can find everything in the notebook here: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/Box-Cox%20transformation.ipynb

3 Likes