Mixture of Censored iid Normals

Continuing on from here:

where I asked about mixture of normals, thanks to @jessegrabowski and @ricardoV94, I have come to realize that due to sampling methods, in mixture models, MvNormals with diagonal cov is not identical to a bunch of univariate normals and actually can sample more efficiently (somewhat remedying label mixing if not completely).

However as it stands out, MvNormals is not applicable to where I want to use the mixture model in. Following again from @ricardoV94’s question I wanted to open a seperate topic for this. A summary of how the data would look like in may case can be described as follows:

1- Assume we are given a cluster’s center coordinates, a n dimensional vector, values between 0 to 10.
2- Any data i around this center is given as center + bias_i + (iid noise on each coordinate)_i. Moreover there is thresholding, any value below 1 is set 1.

Here is how I would generate it:

import numpy as np
import pymc as pm
import pytensor.tensor as ptt
import matplotlib.pyplot as plt

def plot_clusters(centers, data, max_val):

  nclusters = centers.shape[0]
  fig,ax = plt.subplots(1, nclusters, figsize=(5*nclusters, 5))

  for i in range(nclusters):
    ax[i].plot(centers[i,:], color="red", zorder=1)
    ax[i].plot(data[i].T, color="black", alpha=0.3, zorder=0)
    ax[i].set_ylim(0, max_val+3)
    ax[i].set_yticks(range(0, max_val+3))
    ax[i].grid("on", alpha=0.3)

ndims = 7
nclusters = 5
max_val = 8
lower = 1
bias_sd = 1
noise_sd = 0.5
upper = np.inf
seed = 0

rng = np.random.default_rng(seed)
ndata_per_cluster = rng.integers(20,50, size=nclusters)
ndata = int(np.sum(ndata_per_cluster))

cluster_centers = rng.random((nclusters, ndims))*max_val
data = []

for i in range(nclusters):
   biases = rng.normal(0, bias_sd, (ndata_per_cluster[i],))[:,None]
   noise = rng.normal(0, noise_sd, (ndata_per_cluster[i], ndims))
   cluster_data = cluster_centers[i:i+1,:] + biases + noise
   cluster_data[cluster_data<lower] = lower
   data.append(cluster_data)

plot_clusters(cluster_centers, data,  max_val)
data = np.concatenate(data)

So the model that I tried to use initially was as such (assuming I get init and center_mus from k-means or sth but for now I will use the true centers):

#normally init would come from spectral or k-means clustering with some
#bias correction on the data
cluster_centers = cluster_centers[np.argsort(cluster_centers[:,0]),:]
cluster_centers[cluster_centers<lower] = lower
init = [cluster_centers[:,0], cluster_centers[:,1:]]
center_mus =  [cluster_centers[:,0], cluster_centers[:,1:]]
mu_sigma = 1
nclusters_fit = nclusters
bias_sd = 2
conc = 10


with pm.Model() as model:

  μ0 = pm.Normal("μ0",
                 mu=center_mus[0],
                 sigma=mu_sigma,
                 shape=nclusters_fit,
                 transform=pm.distributions.transforms.ordered,
                 initval=init[0])

  μ1 = pm.Normal("μ1",
                 mu=center_mus[1],
                 sigma=mu_sigma,
                 shape=(nclusters_fit, ndims-1),
                 initval=init[1])

  μ = pm.Deterministic("μ", ptt.concatenate([μ0[:,None], μ1], axis=1))
  σ = pm.InverseGamma("σ", 2, 1)
  biases = pm.Normal("biases", 0, bias_sd, shape=(ndata))
  weights = pm.Dirichlet("w", conc*np.ones(nclusters_fit))
  components =\
        [pm.Normal.dist(μ[i,:][None,:]+biases[:,None], σ) for i in range(nclusters_fit)]
  mix = pm.Mixture.dist(w=weights, comp_dists=components)

  pm.Censored("likelihood", mix, lower=lower, upper=upper, observed=data)

  trace = pm.sample(draws=2000, tune=1000, chains=6, target_accept=0.95)

There is quite a bit of cheating here I realize. But I did this as a proof of concept and even though I supplied censored centres as initials, the model correctly guesses centre components that are below 1. So censoring is indeed doing what it is supposed to be doing. Multi-modality of posteriors is not horrible, some here and there. But I know again I am cheating here by supplying true centres though in principle centers from k-means or spectral clustering work just as fine (after some preprocessing for the bias).

The question is how to make this model less prone to centre switching without cheating? I can still supply initials from k-means etc maybe but seems like non-mixing property of MvNormal would also be a quite boon in this direction.

Note that I am fine with using a MvNormal with a diagonal cov in which components are independent (here they are actually iid). In principle I guess one could custom define censored for such a MvNormal. Whenever one component is below lower, likelihood would be CDF of that component multiplied by PDF of the rest. But

1- I don’t know really where to start, I don’t know inner workings of pymc that well to understand how to construct this,
2- If I define a custom distribution like this, would I be again beating the non-mixing of MvNormal?

ps: As a side note, if I deal with less number of clusters such as 3, just supplying the initial values as centers (say from k-means) but keeping the center_mus vague (i.e all 4) with large mu_sigma like 5 also works fine when using init=‘advi+adapt_diag’. Haven’t now tested without advi but feel like that makes a positive impact based on previous experience.

If you are not afraid of trying something more experimental, MarginalModel in pymc-experimental, may do what you need: MarginalModel — pymc_experimental 0.0.13 documentation

import pymc as pm
import numpy as np

from pymc_experimental import MarginalModel
from pymc.distributions.transforms import ordered

n_clusters = 5
coords = {
    "cluster": range(n_clusters), 
    "ndim": ("x", "y"), 
    "obs": range(10),
}
with MarginalModel(coords=coords) as m:
    idx = pm.Categorical("idx", p=np.ones(n_clusters) / n_clusters, dims=("obs",))

    mu_x = pm.Normal("mu_x", dims=("cluster",), transform=ordered, initval=np.linspace(-1, 1, n_clusters))
    mu_y = pm.Normal("mu_y", dims=("cluster",))
    mu = pm.math.concatenate([mu_x[..., None], mu_y[..., None]], axis=-1)

    sigma = pm.HalfNormal("sigma")


    def dist(idx, mu, sigma, lower, upper, _):
        # Define a mixture where the output x, y depends on the value of idx
        rv = pm.Censored.dist(pm.Normal.dist(mu[0][:, None], sigma), lower=lower, upper=upper, shape=data.shape)
        for i in range(n_clusters - 1):
            rv = pm.math.where(
                pm.math.eq(idx, i),
                rv,
                pm.Censored.dist(pm.Normal.dist(mu[i + 1][:, None], sigma), lower=lower, upper=upper, shape=data.shape),
            )
        return rv


    lower = -3
    upper = 3
    data = np.array([[-1, -1], [0, 0], [1, 1], [2, 2], [-3, -3]] * 2).T.astype("float")
    censored_mix = pm.CustomDist(
        "censored_mix",
        idx, mu, sigma, lower, upper,
        dist=dist,
        observed=data,
        dims=("ndim", "obs"),
    )

# Marginalize away the idx, creating a Mixture likelihood
m.marginalize(idx)

with m:
    idata = pm.sample()

Note that we explicitly control how the weight are used, so that the pairs of x, y are related to the same idx variable, and therefore the logp should be equivalent to a “CensoredMvNormal Mixture”

Disclaimer: I didn’t try to actually sample.

2 Likes

Cutting edge technology? Count me in. I will let you know if I manage to get it running and get some results. Thanks

3 Likes

Alright after a hectic month I am back to this problem. I have managed to get the code running without any changes (@ricardoV94 kudos!) however it did not seem to produce the results I wanted. Since it has been a while, I will remind my question:

Can one use multiple pm.Normals in a mixture model and still find a way to leverage the non-mixed sampling property of pm.MVNormals (this being needed because in the end I need to use censored distributions)

Therefore I took a step back and simplified the question, does this model work fine when we apply it to the usual clustering problem (i.e no censoring). I added a bit of extra script to generate data and also modified the code above to remove pm.Censored.dist and did a couple tests (see code below). My initial runs with this also produced weird results for anything with more than 2 clusters, it looked like it was not considering first n-2 clusters at all. Then I realized it was because of

for i in range(n_clusters - 1):
            rv = pm.math.where(
                pm.math.eq(idx, i),
                rv,
                pm.Censored.dist(pm.Normal.dist(mu[i + 1][:, None], sigma), lower=lower, upper=upper, shape=data.shape),
            )

This was essentially discarding any cluster before the last two because of pm.math.eq. The correct expression for the condition was

pm.math.le(idx, i) 

With this it is producing generally good results and quite fast after marginalized. I have compared it to dirichlet mixture model with univariate normals, multivariate normals and the same model above but removing marginalization. The code above out performs dirichlet mixtures of univariate and multivariate. Comparing marginalized vs non-marginalized they seem to perform similar in terms of end results but ofcourse marginalized version is orders of magnitude faster (still interestingly there does not seem to be a big negative impact of sampling continuous and discrete parameters together in this single instance). Mixing is quite obvious in univariate normals. There doesn’t seem any mixing in multivariate normals but more divergences, and quite slow (about 10x times compared to marginalized univariate). So Ricardo’s code is the winner in this single test instance.

Consider for instance the following case with cluster centres. One of the cluster is well separated from others where as the last two isn’t, which is chosen in hopes that it would confuse mixing and non-efficient models.

array([[3.26044117, 2.27982472],
       [6.84336638, 5.49387813],
       [7.14995669, 4.45807473]])

Note that all of the models above come up with three clusters where one cluster is around (7,5) and two clusters around (3,2) which is fine given that last two clusters are barely separated from each other compared to amount of noise (sd=0.6), not so many data points (average about 30 per cluster) and quite broad priors. The goal here is to prevent mixing of sampled clusters.

In the below I test four models using the same cluster centers (and all with init=“advi+adapt_diag” and no initial points supplied).

The trace plots for the code above (Ricardo’s suggestion) is:
uv_mar

The trace plots for the non-marginalized version of above is:
uv_nonmar

Trace plot for the univariate normals (see code for above):
uv

Trace plot for the multivariate normals (code by @jessegrabowski modified to have a single parameter representing covariance, also see link above. note here sigma actually represents chol^2 not sd):
mv

So next I will be looking at the following:

1- How does it fare with the censored data and censored likelihood
2- Since this is not a mixture model and idx is marginalized out, what is a good way to compute which cluster each data belongs to (maybe just evaluate the likelihood of them to each clusters normal and that is it…). Could potentially compare whatever method I come up with idx from non-marginalized version which seems to work.

The final code is a minor modification of Ricardo’s code:

import pymc as pm
import numpy as np
import arviz as az
from pymc_experimental import MarginalModel
from pymc.distributions.transforms import ordered

#params
ndims = 2
n_clusters = 3
upper = 8
lower = 0
noise_sd = 0.5
seed = 111111

#generate data
rng = np.random.default_rng(seed)
ndata_per_cluster = rng.integers(20,50, size=n_clusters)
ndata = int(np.sum(ndata_per_cluster))

cluster_centers = rng.random((n_clusters, ndims))*upper
cluster_centers = cluster_centers[np.argsort(cluster_centers[:,0]),:]
data = []

for i in range(n_clusters):
   noise = rng.normal(0, noise_sd, (ndata_per_cluster[i], ndims))
   cluster_data = cluster_centers[i:i+1,:] + noise
   data.append(cluster_data)

data = np.concatenate(data).T

#setup
coords = {
    "cluster": range(n_clusters),
    "ndim": range(data.shape[0]),
    "obs": range(data.shape[1]),
}


sample_args = {"draws":3000, "chains":6, "tune":1000, "cores":6,
               "target_accept":0.95, "init":"advi+adapt_diag"}


with MarginalModel(coords=coords) as m1:

    idx = pm.Categorical("idx", p=np.ones(n_clusters) / n_clusters, dims=("obs",))

    mu_x = pm.Normal("mu_x", np.ones((n_clusters,))*4, 10,
                     dims=("cluster",), transform=ordered,
                     initval=np.linspace(lower, upper, n_clusters))
    mu_y = pm.Normal("mu_y", np.ones((n_clusters,))*4, 10, dims=("cluster",))


    mu = pm.math.concatenate([mu_x[..., None], mu_y[..., None]], axis=-1)

    sigma = pm.HalfNormal("sigma")

    def dist(idx, mu, sigma, _):
        # Define a mixture where the output x, y depends on the value of idx
        rv = pm.Normal.dist(mu[0][:, None], sigma, shape=data.shape)

        for i in range(n_clusters - 1):
            rv = pm.math.where(
                pm.math.le(idx, i),
                rv,
                pm.Normal.dist(mu[i + 1][:, None], sigma, shape=data.shape)
            )
        return rv


    mix = pm.CustomDist(
        "mix",
        idx, mu, sigma,
        dist=dist,
        observed=data,
        dims=("ndim", "obs"),
    )

# Marginalize away the idx, creating a Mixture likelihood
m1.marginalize(idx)

with m1:
  idata1 = pm.sample(**sample_args)

az.plot_trace(idata1)

You can retrieve the posterior probability of any discrete marginalized variable “easily” by evaluating the whole model logp at the different possible values of the marginalized variable, conditioned on the posterior draws for the remaining parameters. There’s an old gist here: Unmarginalize logp · GitHub with very little explanation and quite possible already broken!

@zaxtax has been working on a routine that does this automatically for MarginalModel that may be coming out some time soon

1 Like

Isn’t this also what the example at the end of your prediction blog post does?

1 Like

wouldn’t that be a lot of evaluations? Are we talking about evaluating some stuff at all possible values for the idx vector? With three clusters and hundered data points that would be like 3**100 possibilities? I will try to implement your linked code but I think for my purposes I think I would be OK to evaluate the likelihood of each data point given separate normals for each cluster and compare them to decide. It does not take the weights into accounts which would be a problem for very imbalanced clusters so obviously if there is an easy way to unmarginalize then it would be better.

No. because the idxs are conditionally independent, so you only need 3 * 100 evaluations per posterior sample

1 Like

Thanks, don’t yet understand this but need to sit down with a paper and pen and think about this more.

It’s the same reason the marginal logp is feasible. Otherwise evaluating would require also 3**100 distinct computations.

Thinking about the logp of pymc.Mixture may help, because you can see that there we are also marginalizing one idx per observation but we can compute the logp (and retrieve the marginalized indexes) without falling into combinatorial explosion.

The easiest way to see is by getting the non-marginalized model and checking that changing the value of idx1 has no impact on the probability of idx2 evaluated at the same fixed value

1 Like

Hello,

I am back! Thanks to the discussion here I have a better understanding of the situation. I did not before think of mixture models as marginalized set of normals where indices (idx) are drawn from a categorical variable. With that understanding it is indeed not so hard to sample idx from the model (a method similar to how one would construct labels in a mixture model suffices).

However while I was thinking about this, one thing I have noticed is that Ricardo’s model above does not have weights derived from a Dirichlet distribution as the other ones do. So I decided to try my chance at adding that to his model. The only change is:

weights = pm.Dirichlet("w", a=10*np.ones(n_clusters), dims=("cluster",))
idx = pm.Categorical("idx", weights, dims=("obs",))

and sampling args are

sample_args = {"draws":3000, "chains":6, "tune":1000, "cores":6,
                "init":"advi+adapt_diag",
                "metropolis": {"nuts": {"target_accept":0.95}}
               }

For simplicity I tested this on a setup with well defined clusters:

array([[0, 0],
       [1, 1],
       [2, 2]])

and unequal weights (data per cluster are respectively [10, 40,50]).

One thing that is different is that with the addition of weights it is now changing the samplers from NUTS to

>>Metropolis: [w]
>>Metropolis: [mu_x]
>>Metropolis: [mu_y]
>Slice: [σ]

I am seeing Slice for the first time! Does making the marginalized variable depend on another prior make the .marginalize functionality harder to apply? Also the posterior traces that I get now are quite wobbly (despite giving correct results):

Is this a result of these samplers? Would you discourage using weights in this context, or is there a method to improve this? Thanks

Not at all.

The wobbliness may just be due to the smaller efficiency of the slice/metropolis samplers. It looks like NUTS should have been used though. Is that the model with the idx marginalized?

Also welcome back!

You may be interested to know we are doing some improvements in the MarginalModel. I actually want to use your model as a test case. It should become easier to write!

We also have merged the functionality to recover marginalized variables automatically (sorry if that’s old news). We have an example notebook in the workings that shows the functionality: Adding example on marginalizing models by zaxtax · Pull Request #628 · pymc-devs/pymc-examples · GitHub

Thanks for the reply. The exact model (with data generation) is below. It is basically the model you suggested + Dirichlet weights. I don’t think there is any major change to it.
I have manually set cluster centres so that the problem is more identifiable (wanted to double check that is not the source of wobblyness).

import pymc as pm
import numpy as np
import arviz as az
from pymc_experimental import MarginalModel
from pymc.distributions.transforms import ordered


def model(coords, n_clusters, data, lower, upper):

  with MarginalModel(coords=coords) as m1:

      weights = pm.Dirichlet("w", a=10*np.ones(n_clusters), dims=("cluster",))
      idx = pm.Categorical("idx", weights, dims=("obs",))

      mu_x = pm.Normal("mu_x", np.ones((n_clusters,))*4, 10,
                       dims=("cluster",), transform=ordered,
                       initval=np.linspace(lower, upper, n_clusters))
      mu_y = pm.Normal("mu_y", np.ones((n_clusters,))*4, 10, dims=("cluster",))

      mu = pm.math.concatenate([mu_x[..., None], mu_y[..., None]], axis=-1)

      sigma = pm.HalfNormal("σ")


      def dist(idx, mu, sigma, _):
          # Define a mixture where the output x, y depends on the value of idx
          rv = pm.Normal.dist(mu[0][:, None], sigma, shape=data.shape)

          for i in range(n_clusters - 1):
              rv = pm.math.where(
                  pm.math.le(idx, i),
                  rv,
                  pm.Normal.dist(mu[i + 1][:, None], sigma, shape=data.shape)
              )
          return rv


      pm.CustomDist(
          "mix",
          idx, mu, sigma,
          dist=dist,
          observed=data,
          dims=("ndim", "obs"),
      )


  m1.marginalize([idx])

  sample_args = {"draws":3000, "chains":6, "tune":1000, "cores":6,
                  "init":"advi+adapt_diag",
                  }

  with m1:
    idata1 = pm.sample(**sample_args)

  return idata1

if __name__ == "__main__":
  ndims = 2
  n_clusters = 3
  upper = 8
  lower = 0
  noise_sd = 0.5
  seed = 1111111

  #generate data
  rng = np.random.default_rng(seed)
  ndata_per_cluster = [10, 40, 50]
  ndata = int(np.sum(ndata_per_cluster))
  cluster_centers = rng.random((n_clusters, ndims))*upper

  cluster_centers = np.array([[0,0],[1,1], [2,2]])

  cluster_centers = cluster_centers[np.argsort(cluster_centers[:,0]),:]
  data = []

  for i in range(n_clusters):
      noise = rng.normal(0, noise_sd, (ndata_per_cluster[i], ndims))
      cluster_data = cluster_centers[i:i+1,:] + noise
      data.append(cluster_data)

  data = np.concatenate(data).T

  #setup


  coords = {
      "cluster": range(n_clusters),
      "ndim": range(data.shape[0]),
      "obs": range(data.shape[1]),
  }



  idata = model(coords, n_clusters, data, lower, upper)
  ax = az.plot_trace(idata, var_names=["mu_x","mu_y","σ","w"])

Nice to know thanks for the update! I will be sure to check the automatic recovery of marginalized variables. My impression so far is that marginalization in this simple case is doing what it is supposed to be doing modulo the weird behaviour of changing samplers.

It is your model really but of course feel free to use it. If you want I can post here the full version with the censored normals once I sort out the issue with wobbly posteriors and finalize the model.

@ricardoV94 Since you were surprised that it defaulted to Metropolis and Slice, I tried to see what would happen when samplers are changed. I first tried all Metropolis with

step1 = pm.Metropolis([weights, mu_x, mu_y, sigma])
idata1 = pm.sample(**sample_args, step=step1)

it samples as

Multiprocess sampling (6 chains in 6 jobs)
CompoundStep
>Metropolis: [w]
>Metropolis: [mu_x]
>Metropolis: [mu_y]
>Metropolis: [σ]

with wobbly posteriors still. If I try to force NUTS on all via

step1 = pm.NUTS([weights, mu_x, mu_y, sigma])
idata1 = pm.sample(**sample_args, step=step1)

I get the following rather length not implemented error

Traceback (most recent call last):

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File ~/Dropbox/data_analysis/MODELING/bayesian_clustering/censored mixture/pymc_experimental_clustering.py:128
    idata1 = model(coords, n_clusters, data, lower, upper)

  File ~/Dropbox/data_analysis/MODELING/bayesian_clustering/censored mixture/pymc_experimental_clustering.py:84 in model
    step1 = pm.NUTS([weights, mu_x, mu_y, sigma])

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pymc/step_methods/hmc/nuts.py:180 in __init__
    super().__init__(vars, **kwargs)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pymc/step_methods/hmc/base_hmc.py:109 in __init__
    super().__init__(vars, blocked=blocked, model=self._model, dtype=dtype, **pytensor_kwargs)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pymc/step_methods/arraystep.py:164 in __init__
    func = model.logp_dlogp_function(vars, dtype=dtype, **pytensor_kwargs)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pymc/model/core.py:611 in logp_dlogp_function
    return ValueGradFunction(costs, grad_vars, extra_vars_and_values, **kwargs)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pymc/model/core.py:341 in __init__
    grads = pytensor.grad(cost, grad_vars, disconnected_inputs="ignore")

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:517 in grad
    var_to_app_to_idx = _populate_var_to_app_to_idx(outputs, _wrt, consider_constant)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:961 in _populate_var_to_app_to_idx
    account_for(output)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:957 in account_for
    account_for(ipt)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:957 in account_for
    account_for(ipt)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:957 in account_for
    account_for(ipt)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:957 in account_for
    account_for(ipt)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:957 in account_for
    account_for(ipt)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:957 in account_for
    account_for(ipt)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:933 in account_for
    connection_pattern = _node_to_pattern(app)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:812 in _node_to_pattern
    connection_pattern = node.op.connection_pattern(node)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/compile/builders.py:881 in connection_pattern
    lop_op = self.get_lop_op()

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/compile/builders.py:730 in get_lop_op
    self._recompute_lop_op()

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/configparser.py:47 in res
    return f(*args, **kwargs)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/compile/builders.py:525 in _recompute_lop_op
    gdefaults_l = fn_grad(wrt=local_inputs)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:619 in grad
    _rval: Sequence[Variable] = _populate_grad_dict(

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:1419 in _populate_grad_dict
    rval = [access_grad_cache(elem) for elem in wrt]

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:1419 in <listcomp>
    rval = [access_grad_cache(elem) for elem in wrt]

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:1374 in access_grad_cache
    term = access_term_cache(node)[idx]

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:1049 in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:1049 in <listcomp>
    output_grads = [access_grad_cache(var) for var in node.outputs]

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:1374 in access_grad_cache
    term = access_term_cache(node)[idx]

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/gradient.py:1204 in access_term_cache
    input_grads = node.op.L_op(inputs, node.outputs, new_output_grads)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/graph/op.py:401 in L_op
    return self.grad(inputs, output_grads)

  File ~/miniconda3/envs/pymc_env_experimental/lib/python3.11/site-packages/pytensor/graph/op.py:372 in grad
    raise NotImplementedError()

NotImplementedError

How would I diagnose this further? I have tried forcing NUTS on just one of the variables and it fails on each of them too.

I suspect it’s a bug but I have to investigate. Thanks for investigating

After Expand models supported by automatic marginalization by ricardoV94 · Pull Request #300 · pymc-devs/pymc-experimental · GitHub gets merged, this should work:

import numpy as np
from pymc_experimental import MarginalModel
from pymc.distributions.transforms import ordered

data = np.array([[-1., -1.], [0., 0.], [1., 1.], [2., 2.], [-3., -3.]] * 2).T
nobs = data.shape[-1]
n_clusters = 5
coords = {
    "cluster": range(n_clusters), 
    "ndim": ("x", "y"), 
    "obs": range(nobs),
}
with MarginalModel(coords=coords) as m:
    idx = pm.Categorical("idx", p=np.ones(n_clusters) / n_clusters, dims=["obs"])
    mu_x = pm.Normal("mu_x", dims=["cluster"], transform=ordered, initval=np.linspace(-1, 1, n_clusters))
    mu_y = pm.Normal("mu_y", dims=["cluster"])
    mu = pm.math.concatenate([mu_x[None], mu_y[None]], axis=0)
        
    sigma = pm.HalfNormal("sigma")
    y = pm.Censored(
        "y",
        dist=pm.Normal.dist(mu[:, idx], sigma),
        lower=-3, 
        upper=3,
        observed=data,
        dims=["ndim", "obs"],
    )

    # Marginalize away the idx, creating a Mixture likelihood
    m.marginalize(idx)
    idata = pm.sample()

No need for the dummy CustomDist. I added this model as one of the new tests.

I also confirmed it’s using NUTS in my machine.

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_x, mu_y, sigma]

I’m able to reproduce your issue with the chosen sampler (in your larger model with Dirichlet).

It has to do with OpFromGraph gradient failing, which is not a suprise.There were already problems before: Gradient of OpFromGraph fails · Issue #1 · pymc-devs/pytensor · GitHub

Looks nice thanks! Is there any work being done to solve the sampler issue for OpFromGraph? If not do you think there is a work around? I mean, I am ok with working with wobbly posteriors if I know what the root cause is not bad model.

Any ways I will soon post a fully working script here with censoring and everything (with tests to understand if censoring behaves the way it should)