Difficulties in using .dist and pm.Potential for inference

Hi all,

Following the remark of @jessegrabowski, I checked if .dist nodes are updated. But it looks like they are not, and I am stuck on how to proceed.

Specifically, I am motivated by the following problem. Suppose that I want distribute N values according to some distribution with the mean \mu and standard deviation \sigma. Then I count how many of them fall in given bins from 0 to D and repeat such experiment T times. I arrive at multinomial likelihood, from which I want to recover the parameters (\mu, \sigma).

I wrote the following code, but it looks like it does not any inference:

import pymc as pm

import aesara as ae
import aesara.tensor as at
import aesara.tensor.math as tt

with pm.Model() as model:    
    # Data
    T = tt.as_tensor_variable(counts.shape[0], dtype="int64") 
    D = tt.as_tensor_variable(counts.shape[1] - 1, dtype="int64") 
    
    counts_matrix = tt.as_tensor_variable(counts, dtype="int64")
    
    # Priors
    μ = pm.HalfNormal('μ', 5.0, initval=2.0)
    σ = pm.HalfNormal('σ', 5.0, initval=2.0)
    
    # Likelihood
    ### Parameters of the lognormal distribution
    σlog = tt.sqrt(tt.log((σ / μ)**2 + 1))
    μlog = tt.log(μ) - σlog**2 / 2
    π, _ = ae.scan(lambda t: tt.exp(pm.logcdf(pm.Lognormal.dist(mu = μlog, sigma = σlog), t + 0.5)) - tt.exp(pm.logcdf(pm.Lognormal.dist(mu = μlog, sigma = σlog), t - 0.5)), 
            sequences = [tt.arange(0, D + 1)])

    llks, _ = ae.scan(lambda t, X: pm.logp(pm.Multinomial.dist(n = tt.sum(X[t, :]), p = π / tt.sum(π)), X[t, :]), 
                     sequences = [tt.arange(0, T)], non_sequences=[counts_matrix])
    pm.Potential('llk', llks)
    trace = pm.sample(tune=1000, chains=4, init='advi')

the gist can be found here 20221015-pymc-question.ipynb · GitHub. You can see that there is no result at the end.

I have done something quite similar using Stan and it worked there, but I have difficulties implementing this in pymc.

functions {
    /* discretized version of gamma, Weibull, and lognormal distributions */
    vector dPMF(real param1, real param2, int K) {
        vector[K] res;
        for (k in 1:K)
             res[k] = lognormal_cdf(k - 0.5 | param1, param2);

        return append_row(res[1] / res[K], (tail(res, K-1) - head(res, K-1)) / res[K]);  
    }
}

data {
    int<lower = 1> D; 
    int<lower = 1> T;
    array[D + 1, T] int<lower=0> n; // counts
}

transformed data {
    array[T] int<lower = 0> n_sum;
    for (t in 1:T)
        n_sum[t] = sum(n[, t]);
        
    real jitter = 1e-9;
}

parameters {
    real<lower=0> mu, sigma;
}

transformed parameters {
    real param1, param2,
        lps = 0.0;
    {
        // Lognormal distribution
        param2 = sqrt(log(square(sigma / mu) + 1.0));
        param1 = log(mu) - square(param2) / 2.0;

        vector[T+1] beta = dPMF(param1, param2, T+1);
        for (t in 1:T-1) if (n_sum[t] > 0) {                
                vector[D+1] beta_current;
                beta_current[1:D] = head(beta, D);
                beta_current[D+1] = max({1.0 - sum(head(beta_current, D)), jitter});
                beta_current /= sum(beta_current);
                    
                lps += multinomial_lpmf(n[, t] | beta_current);
        }
    }
}

model {
    /* priors */
    sigma ~ cauchy(0.0, 5.0);
    mu ~ std_normal();

    /* likelihood */
    target += lps;
}

Many thanks in advance!

Upd1: When I do the same simulation in pymc3, the results are correct. Please see 20221015-pymc3-version.ipynb · GitHub

I suspect that the various pieces connecting \mu and \sigma to the likelihood are not being sampled because they are not registered in the model. @ricardoV94?

I’m curious about the behavior of PyMC here generally, but in your specific model, why do you need the dist at all? It seems like this should work fine, because the parameters of the distributions aren’t actually changing in the scan:

    σlog = tt.sqrt(tt.log((σ / μ)**2 + 1))
    μlog = tt.log(μ) - σlog**2 / 2
    log_norm = pm.Lognormal(mu=μlog, sigma=σlog)
    π, _ = ae.scan(lambda t: tt.exp(pm.logcdf(log_norm, t + 0.5)) - tt.exp(pm.logcdf(log_norm , t - 0.5)), 
            sequences = [tt.arange(0, D + 1)])

If I do that, then it complains that the name should be provided

---> 16 π, _ = ae.scan(lambda t: tt.exp(pm.logcdf(pm.Lognormal(mu = μlog, sigma = σlog), t + 0.5)) - tt.exp(pm.logcdf(pm.Lognormal(mu = μlog, sigma = σlog), t - 0.5)), 
     17         sequences = [tt.arange(0, D + 1)])

TypeError: Distribution.__new__() missing 1 required positional argument: 'name'

Ah yeah I had a typo, you need to give it a name, like: log_norm = pm.Lognormal('log_norm', mu=μlog, sigma=σlog)

Then you should be using the variable log_norm in the scan, not pm.Lognormal.

Something’s wrong anyway. The result has not been changed but there are messages about divergences

Sampling 4 chains for 3_000 tune and 1_000 draw iterations (12_000 + 4_000 draws total) took 39 seconds.
INFO:pymc:Sampling 4 chains for 3_000 tune and 1_000 draw iterations (12_000 + 4_000 draws total) took 39 seconds.
There were 50 divergences after tuning. Increase `target_accept` or reparameterize.
ERROR:pymc:There were 50 divergences after tuning. Increase `target_accept` or reparameterize.
There were 65 divergences after tuning. Increase `target_accept` or reparameterize.
ERROR:pymc:There were 65 divergences after tuning. Increase `target_accept` or reparameterize.
There were 48 divergences after tuning. Increase `target_accept` or reparameterize.
ERROR:pymc:There were 48 divergences after tuning. Increase `target_accept` or reparameterize.
There were 59 divergences after tuning. Increase `target_accept` or reparameterize.
ERROR:pymc:There were 59 divergences after tuning. Increase `target_accept` or reparameterize.

Personally, I can’t interpret this message:

NUTS: [μ, σ, log_norm]
INFO:pymc:NUTS: [μ, σ, log_norm]

What is that value of log_norm there? But probably I am not aware of something

log_norm is the lognormal distribution parameterized by μlog and σlog. It’s a latent variable in your model.

I looked at things more closely, and there are several improvements you can make:

  1. You can dispense with the scans entirely. You are merely iterating over a count variable, so you can accomplish the same thing with broadcasting. These two lines of code are equivalent:
    D_grid = at.arange(0, D+1)
    log_norm = pm.Lognormal('log_norm', mu=μ, sigma=σ)
    
    def cdf(x):
        return at.exp(pm.logcdf(log_norm, x))

# Scan over grid
    π, _ = aesara.scan(lambda t: cdf(t + 0.5) - cdf(t - 0.5), 
                       sequences = [D_grid])

# Broadcast cdf over the grid directly:
    pi = pm.Deterministic('p', cdf(D_grid + 0.5) - cdf(D_grid - 0.5))

You can verify this with np.allclose:

np.allclose(π.eval(), pi.eval())
>>> True
  1. The likelihood function has a similar problem. There should be an alarm bell going off when you call the pm.logp function and then put the result into a pm.Potential – can you use the distribution in question directly?

In you scan you are getting row-wise sums of counts_matrix. You can instead directly pass this into a random variable:

    log_like = pm.Multinomial('obs', 
                              n=counts_matrix.sum(axis=1), 
                              p = pi / pi.sum(),
                              observed=counts_matrix)

After making these changes, the model sampled quickly without divergences. Here is the posterior graph I got with this modified model:

2 Likes

@jessegrabowski, many thanks for checking! I really like your solution.

However, I tried to run a similar code, but I’ve got many divergences 20221016-pymc-question.ipynb · GitHub

Just to confirm, is it not about priors or something like that?

with pm.Model() as model:    
    # Data
    T = tt.as_tensor_variable(counts.shape[0], dtype="int64") 
    D = tt.as_tensor_variable(counts.shape[1] - 1, dtype="int64") 
    D_grid = at.arange(0, D + 1)
    
    counts_matrix = tt.as_tensor_variable(counts, dtype="int64")
    
    # Priors
    μ = pm.HalfNormal('μ', 5.0, initval=2.0)
    σ = pm.HalfNormal('σ', 5.0, initval=2.0)
    
    # Likelihood
    ### Parameters of the lognormal distribution
    σlog = tt.sqrt(tt.log((σ / μ)**2 + 1))
    μlog = tt.log(μ) - σlog**2 / 2
    
    log_norm = pm.Lognormal('log_norm', mu = μlog, sigma = σlog)
    def cdf(x):
        return tt.exp(pm.logcdf(log_norm, x))
    π, _ = ae.scan(lambda t: cdf(t + 0.5) - cdf(t - 0.5), sequences = [D_grid])
    
    pm.Deterministic('p', π)

    log_like = pm.Multinomial('obs', 
                              n=counts_matrix.sum(axis=1), 
                              p = π / π.sum(),
                              observed=counts_matrix)
    
    trace = pm.sample(tune=3000, chains=4, init='advi')

You should not be using a scan to compute pi, as noted above. Broadcast tensor operations should always be preferred to scans for optimization reasons. You can also dispense with the transformation of μ and σ into μlog and σlog, and instead let the log normal distribution handle the transformation of the parameters. I also used init = 'jitter+adapt_diag_grad when I ran pm.sample, although this might not matter after you make the other changes.

1 Like

Thank you! Your solution has indeed worked!

The only confusion I have is that the divergences appear when I used “ae.scan” (see 20221018-pymc-question.ipynb · GitHub). I thought that the use of ae.scan is not recommended because it is slower, but not because it is wrong. For example, if simulations use some external code, then the scan could be an option.

You can also dispense with the transformation of μ and σ into μlog and σlog

I am going to implement the mixture model later with the mean and SD would be shared parameters among all three. That is why I am choosing μ and σ as the starting point in my example.

You’re right that scan should work fine in principle. The performance is also very very good when scanning over 1d vectors in my testing, so I wouldn’t worry about that. If you play around with it more please report back.

My immediate suspicion is that it’s related to the backward pass. The gradient of scan also involves scanning, so things can get complicated. I work a lot with time series models, and I can usually implement fast and efficient forward passes, but the gradient computations take a long time and end up fragile. No idea if this is true for your case.

1 Like

Thanks - I am worried that it worked in pymc3+theano, and aesara is supposed to be a successor of theano. So it is a bit confusing if that’s not the case