Naive Bayes model with PyMC3

Hello,

I’m new to PyMC3 and for getting started I tried to implement a Naive Bayes (NB) model as described in this blog post but for version 3.5 of PyMC. Here’s my initial attempt:

import numpy as np
import pymc3 as pm

K = 3 # number of topics
V = 4 # size of vocabulary

# 13 documents with 5 words each
data = np.array([[0, 0, 0, 0, 0], 
                 [0, 0, 0, 0, 1], 
                 [0, 0, 0, 1, 1], 
                 [0, 0, 0, 0, 1], 
                 [0, 0, 0, 0, 1], 
                 [0, 0, 0, 0, 1], 
                 [0, 0, 0, 0, 0], 
                 [1, 1, 1, 2, 2], 
                 [1, 1, 1, 1, 2], 
                 [1, 1, 1, 2, 1], 
                 [1, 1, 1, 2, 2], 
                 [2, 2, 2, 3, 3], 
                 [2, 2, 2, 3, 3]])

# number of documents
D = data.shape[0]

# Hyper-parameters
alpha = np.ones(K)
beta = np.ones(V)

with pm.Model() as model:
    # Global topic distribution
    theta = pm.Dirichlet("theta", a=alpha)
    
    # Word distributions for K topics
    phi = pm.Dirichlet("phi", a=beta, shape=(K, V))
    
    # Topic of documents
    z = pm.Categorical("z", p=theta, shape=D)
    
    # Words in documents
    for i in range(D):
        pm.Categorical(f"w_{i}", 
                       p=phi[z[i]], 
                       observed=data[i])

The results (not shown here) look reasonable but I’m curious if this implementation is correct at all and if there’s a better way to implement that with PyMC3, especially the for loop at the end of the example. I’m aware of more efficient alternatives for estimating the parameters of a NB model but I thought this is a good example for getting started with PyMC3.

Thanks in advance for any help!

Best Regards,
Martin

I think your problem is quite similar to latent dirichlet allocation - you can have a look at this doc.

Otherwise, a general suggestion on mixture model like this kind is not to explicitly model the latent categories (i.e., avoid doing z = pm.Categorical("z", p=theta, shape=D)), but rewrite it as a marginalized mixture model instead.

Thanks for your response. I’ve seen the document you mentioned but wanted to have a much simpler example running. The implementation was actually inspired by the LDA example you posted earlier.

Regarding

do you mean something comparable to Marginalized Gaussian Mixture Model but using the general Mixture class instead?

I tried using the Mixture class following the examples in the source code but got stuck applying it to my example here. Can you please show me how to re-write my example to a marginalized mixture model? I tried several things but had problems matching shapes of variables and observed data.

Yes you do need to be careful to make the shape work… Maybe rewriting the observed likelihood function as a potential or Densitydist is a bit easier, since you can reuse the loglike function from the doc and the LDA post.

To verify my understanding I implemented a marginalized multivariate GMM. First, I generated some data :

from sklearn.datasets import make_blobs

K = 5
n_samples = 150

X, y = make_blobs(n_samples=n_samples, centers=K)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.viridis, alpha=.25);

fig-1

and then fit the following model to data X (shape=(150, 2)):

cov = np.array([[1, 0], 
                [0, 1]])

with pm.Model() as model:
    w = pm.Dirichlet('w', np.ones(K))

    components = []

    for i in range(K):
        mu = pm.Normal('mu%i'%i, 0, 10, shape=2)
        components.append(pm.MvNormal.dist(mu=mu, cov=cov))

    xobs = pm.Mixture('x_obs', w, components, observed=X)

This works as expected, no problems here.

The marginalized version of the NB model I initially posted has the same structure (except for variable types):

# Definitions of K, V, D, data as in my initial post
components = []

with pm.Model() as model:
    theta = pm.Dirichlet("theta", a=alpha)

    for i in range(K):
        phi = pm.Dirichlet(f"phi_{i}", a=beta)
        components.append(pm.Categorical.dist(p=phi))

    obs = pm.Mixture('obs', w=theta, comp_dists=components, observed=data)        

but in this case I get

ValueError: Input dimension mis-match. (input[0].shape[2] = 3, input[1].shape[2] = 5)

which I do not understand. I can get a working model definition when I define a Mixture for each document:

with pm.Model() as model:
    theta = pm.Dirichlet("theta", a=alpha)

    for i in range(K):
        phi = pm.Dirichlet(f"phi_{i}", a=beta)
        components.append(pm.Categorical.dist(p=phi))

    for i in range(D):
        pm.Mixture(f'obs_{i}', w=theta, comp_dists=components, observed=data[i])

but in this case the sampling results don’t make sense. For example, inferred theta looks like

fig-2

I really would like to understand why

  • obs = pm.Mixture('obs', w=theta, comp_dists=components, observed=data) leads to a dimension mis-match error
  • fitting with the last model definition gives results for theta that do not make sense given the data. (With the non-marginalized NB model version from my initial post I get reasonable values for theta).

Please help me to answer these questions!

I will try using DensityDist later as you suggested but still would like to know why the previous approach using Categorical doesn’t work. I mean the example problem is so simple and I really would like to understand why this can’t be solved with built-in distributions/functions without resorting to DensityDist.

I will try to have a look soon - a bit busy this week :sweat_smile:

No rush, thanks a lot, appreciate it!

So here is how I would approach it:
First, I would rewrite the explicit model, so that the observed is a flatten array - it is easier to handle shape wise:

# flatten and index data
data_flatten = np.reshape(data, data.shape[0]*data.shape[1])
data_index = np.repeat(np.arange(data.shape[0]), data.shape[1])

with pm.Model() as model:
    # Global topic distribution
    theta = pm.Dirichlet("theta", a=alpha)
    
    # Word distributions for K topics
    phi = pm.Dirichlet("phi", a=beta, shape=(K, V))
    
    # Topic of documents
    z = pm.Categorical("z", p=theta, shape=D)
    
    # Words in documents
    p = phi[z][data_index]
    w = pm.Categorical("w", p=p, observed=data_flatten)
    trace = pm.sample(1000, tune=1000)

Now, to turn this into a marginalized model, we need to not do indexing of the p in p = phi[z], but keep the dimensions of the latent topics. Computationally, we want the logp to be evaluated for each latent topic (not just for the “true” latent label):

with pm.Model() as model_marg:
    # Word distributions for K topics
    phi = pm.Dirichlet("phi", a=beta, shape=(K, V))
    
    # Topic of documents
    z = pm.Dirichlet("z", a=alpha/2., shape=(D, K))
    
    # Global topic distribution
    theta = pm.Deterministic("theta", z.mean(axis=0))
    
    # Words in documents
    comp_dists = pm.Categorical.dist(phi)
    w = pm.Mixture("w",
                   w=z[data_index, :],
                   comp_dists=comp_dists,
                   observed=data_flatten)
    trace_marg = pm.sample(1000, tune=1000)

you can check the mixture component shape to make sure that is the case:

w.distribution.w.tag.test_value.shape
w.distribution._comp_logp(data_flatten).tag.test_value.shape

Of course, the two model is not identical, and there are inference problem like multi-modality for both of them… Mixture model is difficult unfortunately.

I put everything in a notebook here: Planet_Sakaar_Data_Science/discourse_2314.ipynb at main · junpenglao/Planet_Sakaar_Data_Science · GitHub

2 Likes

Thanks a lot @junpenglao for taking time writing this up. It helped me a lot!

Regarding inference problems, I found that when using sparse Dirichlet priors i.e.

alpha = np.ones(K) * 0.5
beta = np.ones(V) * 0.5

results significantly improve. For example, for the first model, inferred theta matches much better the actual global topic distribution:

I couldn’t see that improvement for the marginalized mixture model though.

More informative prior almost always helps :wink:

Using more sparse Dirichlet priors also help in the marginal case I think - you can see the mode switching more clearly.

I used sparse Dirichlet priors here as this is assumed for LDA (which is similar to the example here).