Dependent histogram count data

Hey!

This is perhaps a broad question not limited specifically to PyMC, but I figured this is a good place to ask :slight_smile:

My input data is standard (as in, some vector of of covariates for each sample), but my observations are less so. In fact, my observations are a matrix (a vector for each sample) with histogram count data.
Specifically, for each sample, the target is a vector of counts for each bin, which are of course somewhat correlated:

index | X | bin_0 | bin_1 | bin_2 | ... | bin_N
------+---+-------+-------+-------+ ... +-------
1     |   | 0     |   0   |   2   | ... | 0
2     |   | 1     |   3   |   5   | ... | 0
...
K     |   | 5     |   2   |   1   | ... | 2

My modelling attempt thus far has been to consider each such bin as a single target with a Poisson distribution. Since the bins are correlated, I’m trying to constrain them such that each lambda (for each bin) is drawn from a multivariate Normal distribution, where the covariance matrix is fitted and/or depends on the input variables.

With PyMC, I’m running head-first into dimensionality issues. I’m also not very confident in my model, it feels like somewhat of an overkill.

With that in mind, any suggestions? This is my first time trying to model more than a single response variable for each sample, so I’m not even sure what to search for :thinking:

Thanks!

You sound like you’re on the right track. I’ve done similar models with both negative binomial and Poisson likelihoods with ~10K sample vectors, each of which is 100-200 dims long and had no problem, so there may be some optimization we can help you with.

lambda (for each bin) is drawn from a multivariate Normal distribution,

Usually, we model \log \lambda \sim N(\mu, \sigma) instead of modeling \lambda. Are you doing the latter?

Yes, exactly, I’m drawing \log \lambda from the multivariate distribution.

Do you have some example code to share on this? I keep running into dimensionality issues.

EDIT: I managed to resolve the dimensionality issues, but can barely train a model with ~100 samples. This is what I have so far:

with pm.Model(coords={"bin": bins, "index": train_Y.index}) as model:
    mu = pm.Normal(f"Intercept", mu=0, sigma=10, dims=("index", "bin"))
    for feature_name in features:
        mu += pm.Normal(feature_name, mu=0, sigma=10, dims=("index", "bin")) * train_X[[feature_name]]
    
    sd_dist = pm.Exponential.dist(1.0, shape=n_bins)
    chol, corr, stds = pm.LKJCholeskyCov('chol_cov', n=n_bins, eta=2, sd_dist=sd_dist, compute_corr=True)
    vals = pm.MvNormal('vals', mu=mu, chol=chol, dims=("index", "bin"))
    
    lambdas = pm.math.exp(vals)

    y = pm.Poisson("Poisson", lambdas, observed=train_Y[bins].round())

    results = pm.sample(chains=4, return_inferencedata=True)

This part looks pretty inefficient because of the creation of a new RV for each column in X. Can you instead represent this with a single RV with dims ("features", "index","bin") and then do a single multiplication with X[all_feature_cols]?

Otherwise, your model looks pretty good. How many bins do you have?

As an added note, you may want to consider parameterizing the covariance matrix with reduced rank to save on the number of parameters. If you think that is an appropriate strategy here, then I’d be happy to provide further advice.

2 Likes

Ah yeah, that’s a good idea! I’ll give it a try.

I have 31 bins in total. I’m interested in hearing more about the reduced rank approach, though I never tried it out, so some additional resources would be helpful too.

Thank you for your help!

EDIT: Making a single RV as suggested complicated things for me, again running into dimensionality issues. If dims are ("index", "features", "bin") (i.e. 3D tensor), then since X[features] is a 2D matrix, it runs into shape mismatches.
I tried repeating the features so that they’re replicated for each bin (with np.repeat), but then I fail at the multivariate distribution with:

ValueError: operands could not be broadcast together with shapes (315,31) (315,5,31)

To mimic the original behaviour, I then added a mu = pm.math.sum(mu, axis=1). Was this roughly along your suggestion? Then the only thing I’m missing is a bias, which I suppose I can add as a constant feature rather than a separate RV.

Either way, with 5 features and ~300 samples, it takes well over 5h to fit the model. I’m not sure if this matches the expected runtime?

For the first tensor, use array[:, None, :] to give it a shape of (315,1,31) which will then broadcast successfully against the second tensor.

To mimic the original behaviour, I then added a mu = pm.math.sum(mu, axis=1) .

Yup, that looks right.

1 Like

For the first tensor, use array[:, None, :] to give it a shape of (315,1,31) which will then broadcast successfully against the second tensor.

Ah, of course. I tried with np.expand_dims(...) but just picked the wrong axis :man_facepalming:
Thanks again :slight_smile:

Any optimization tricks you could refer me to would be very much appreciated. Those 315 samples are about 1% of my observations, and that alone still takes incredibly long.

To recap, the model now looks like so:

# `bins` is a list of column names with the bin counts
# `train` is a dataframe with ~1% of the data (for now)
# `features` is a list of < 10 column names to use as features
coords = {"bin": bins, "index": train.index, "feature": features}
with pm.Model(coords=coords) as model:
    mu = pm.Normal("features", dims=("index", "feature", "bin"), sigma=0.1)
    mu = mu * train[features].to_numpy()[:, :, None]
    mu = pm.math.sum(mu, axis=1)

    sd_dist = pm.Exponential.dist(1.0, shape=n_bins)
    chol, corr, stds = pm.LKJCholeskyCov('chol_cov', n=n_bins, eta=2, sd_dist=sd_dist, compute_corr=True)
    vals = pm.MvNormal('vals', mu=mu, chol=chol, dims=("index", "bin"))
    
    lambdas = pm.math.exp(vals)
    
    y = pm.Poisson("Poisson", lambdas, observed=train[bins].round())
    results = pm.sample(chains=4, return_inferencedata=True)

I’ve been trying out a couple variants of the desired model in this Colab notebook which is chugging through the comparison in terms of sampling time.

One thing you may want to keep in mind is that every multivariate Gaussian random variable has a representation as a draw from a spherical Gaussian multiplied with a lower triangular matrix (the LT model in the notebook). This can be faster and more stable, depending on the application.

Thank you so much @ckrapu! Your colab also helped me spot an additional error in my own notebooks, namely that I had import pymc3 as pm. That, plus the use of JAX, reduced the sampling time to something more reasonable.

As an aside from the topic, I wonder what you think about a similar model that utilizes a multinomial distribution. So instead of \log \lambda_i \sim \mathcal{N}_M(\mu, \Sigma) and assuming a Poisson distribution, the draws are more like \log\tilde{p}_i \sim \mathcal{N}_M(\mu, \Sigma), and then p_i=\frac{\exp(\log\tilde{p}_i)}{\sum_j\exp(\log\tilde{p}_j)} to be used directly in the multinomial distribution?

I wonder what you think about a similar model that utilizes a multinomial distribution.

Yes, that can work quite well but one thing you need to keep in mind is that the values of \tilde{p_i} are only identified up to an additive constant. Thus, to prevent sampling issues you will need to pick a reference category and force its \tilde{p_i} to be fixed.

The nice thing about this setup is that you can make \tilde{p_i} be the sum of several terms representing different influences like a regression term, a spatially correlated latent variable, etc.

1 Like

To clarify a little more regarding the reference category, the essential element is that, because probability vectors must sum to 1, it only takes K-1 numbers to define a K dimensional probability vector. Thus, what I usually do is make the model for K-1 bins and then prepend a zero element (if modeling a single count) or vector (when modeling many counts) onto the beginning of it with pm.math.concatenate to make a K dimensional array for \log \lambda or similar.

1 Like