Mixture of continuous and discrete logp

The topic is very much related to my work too. My question may be a bit naive: Looking at code, it seems to me that value passed into logp_ of mix_mixlogp contains the whole dataset of both normal and lognormal. So, I am wondering how comp_dist.logp(value) in the list comprehension knows which portion of data is relevant to its distribution for log operation.

Inspired by this thread, I tried to create a mixture model classification case as the following piece of code of mix binomial and normal,

    #imports
    import sys
    import numpy as np
    import pymc3 as pm
    import theano.tensor as tt

    # Generate data
    n, p = 1, .5
    n_data = 500
    data=np.dstack((
        np.random.binomial(n, p, size=n_data)+1,
        np.random.normal(loc=1,scale=1,size=n_data)
    ))
    Y=np.random.randint(0, high=3, size=n_data, dtype='int')

    # Define mixed logp
    def mix_mixlogp(w, comp_dists):
        def logp_(value):
            print(value)
            comp_logp = tt.squeeze(tt.stack([comp_dist.logp(value) 
                                             for comp_dist in comp_dists], axis=1))
            return pm.math.logsumexp(tt.log(w) + comp_logp, axis=-1)
        return logp_

    # Define model
    with pm.Model() as model:
        nbr = 1
        # mixture components of two features
        # A feature of binomial dist data
        p = pm.Beta('p', alpha=1, beta=1)
        Xge = pm.Binomial('Xge', p=p, n=n, shape=nbr, observed=data[0,:,0])
        # A feature of normal dist data
        Xnos = pm.Normal("Xnos", mu=0, sd=1,shape=nbr, observed=data[0,:,1])
        
        # weight vector for the mixtures
        # Assume mixed features have three distribution components according to Y labels.
        mix_w = pm.Dirichlet('mix_w',a=np.array([1]*3))

        # mixtures
        mixed = pm.DensityDist('mixed', mix_mixlogp(mix_w, [Xge, Xnos]), observed=Y)

I got an error:

TypeError: can't turn [TensorConstant{[0. 1. 0. .. 2. 2. 0.]}] and {} into a dict. cannot convert dictionary update sequence element #0 to a sequence
It points to the cause in def logp_(value)
      4         comp_logp = tt.squeeze(tt.stack([comp_dist.logp(value) 
----> 5                                          for comp_dist in comp_dists], axis=1))

It seems try to turn data sequence into a dictionary.
Does my approach make any sense?

Thanks in advance
Chris

Your example doesnt really makes sense to me… Y is a discrete variable but it is a mixture of binomial and normal? Also the mixture component is already observed, I guess you can do comp_dist.distribution.logp(value) in your mix_mixlogp:

    # Define mixed logp
    def mix_mixlogp(w, comp_dists):
        def logp_(value):
            print(value)
            comp_logp = tt.squeeze(tt.stack([comp_dist.distribution.logp(value) 
                                             for comp_dist in comp_dists], axis=1))
            return pm.math.logsumexp(tt.log(w) + comp_logp, axis=-1)
        return logp_

But I am not sure whether it really makes sense.

The advantage of mixture model is that you dont need to know which portion of the data is from which component - you dont need a discrete latent label as the data is evaluated on all component, but the weight (after inferece) inform us which component each data point is more likely to belong to.

Thanks for elaborating both questions. The classification question was about whether a multidimensional space of mixed discrete subspace and continuous subspace can construct a classification problem. Conceptually, I would say yes in that the models parametrized by the instance sets of both discrete subspace and continuous subspace could classify the data in the space if one observes predictors (discrete and continuous) and labels, p(y=c|x_discrete, x_continuous, theta_discrete, theta_continuous), like fixed mixture model except that here one involves discrete subspace additionally. Does it make sense? and is it possible to construct a probabilistic graph model to do it? Maybe, there is a more sensible/correct way in using PyMC3 to model this problem.

Thank you
Chris

Well you can certainly write down the model as such in PyMC3 as we dont enforce the observe data type, besides some distribution that you need to use the observed to index variables. I would rather ask the question that: why would it make sense that the observed is a mixture of discrete and continuous? If so why not model the discrete observed and continuous ones separately?

Thank you for your speedy reply. In real applications for classification and/or clustering, there are often discrete variables and continuous variables from observation, and often also have causal relation between them. An easy way is to model them separately as you suggested, but it may lose the causal inference that would be important for a good estimation in analysis. Actually, it is something I am interested in/researching on. I try to find out if the good packages like PyMC3 can allow me to do as I imagine how I would like to model. I may apply it in a stupid way since I am not familiar with PyMC3. Please point me out if you think that my proposition is still meaningless.

Regards
Chris

Causal inference is most commonly capture by the model graph structure (at least in Bayesian DAG that PyMC3 could model), so even if you model the discrete and continuous observed separately you can still model the causal relationship - for example you can use parameters that shared between the discrete and continuous variable.

I think it would becomes more clear when you have a concrete set up and data - feel free to come back and update this when you start building your model.

Thanks for your suggestion and kind help. The discrete variables and continuous variables rarely share parameters. So, the causal relation is more implicit/explicit among discrete variable and continuous variable themselves. For example, a discrete variable may be blended into a linear function as the mean of a normal distribution of a continuous. In the case I have, there are the categorical diagnosis, the prescription doses (continuous), and the assessment (categorical classes) for each record of the predictors (e.g. diagnosis and prescription). This is how I come up p(y=c|x_discrete, x_continuous, theta_discrete, theta_continuous) as shown before. I could alternatively model prescription depends on diagnosis that may make class labels independent of diagnosis given prescription but I do have cases in the project that the class labels depend on both continuous and discrete. This is an estimation for classification problem so far. I actually also want to turn it to a clustering problem by ignoring labels. I would like to see whether the unlabeled data clustering and classification data group overlap reasonably well in distribution. Hopefully, it provides a convincing set up for why I have discrete and continuous in the same model.

Regards
Chris