AssertionError in PyMC3 code using Deterministic distributions

Hello, I have found this Discourse on the PyMC Gitter. This is a repost from SO since I did not get any luck so far :slight_smile:

I have been trying for the past days to get accustomed to PyMC to eventually do some MCMC sampling of distributions from some models I have direct codes for (I’m ultimately interested in parameters estimation).

To my knowledge there aren’t so many examples out there of people showing their codes (like an external C or FORTRAN code) that they successfully make work with PyMC3. I found only bribes of information about this so far.

So to start with simple problems, I tried to reproduce with PyMC3 some results from existing Python codes doing MCMC with ‘complicated’ (read: more than the doc examples) distributions not using PyMC, namely this simple result which runs in 2 seconds for 10000 samples on my laptop.

To define Deterministic variables in PyMC3, one should use the @theano.compile.ops.as_op decorator (it was @deterministic in PyMC, which is now deprecated in PyMC3), and this is what I did.

Here is my code (I use Spyder and so IPython), inspired by the PyMC documentation, in which I encounter an AssertionError after the second cell (in the Estimation process, so before sampling).
I have been trying to solve this for the last two days but I really do not understand what the error can be. I believe it should be a some kind of PyMC or Theano trick that I did not catch yet, as I believe I’m really close.

    #%% Define model
    import numpy,math
    import matplotlib.pyplot as plt
    import random as random

    import theano.tensor as t
    import theano

    random.seed(1)  # set random seed

    # copy-pasted function from the specific model used in the source and adapted with as_op 
    @theano.compile.ops.as_op(itypes=[t.iscalar, t.dscalar, t.fscalar, t.fscalar],otypes=[t.dvector])
    def sampleFromSalpeter(N, alpha, M_min, M_max):        
        log_M_Min = math.log(M_min)
        log_M_Max = math.log(M_max)
        maxlik = math.pow(M_min, 1.0 - alpha)
        Masses = []
        while (len(Masses) < N):            
            logM = random.uniform(log_M_Min,log_M_Max)
            M    = math.exp(logM)            
            likelihood = math.pow(M, 1.0 - alpha)            
            u = random.uniform(0.0,maxlik)
            if (u < likelihood):
                Masses.append(M)
        return Masses

    # SAME function as above, used to make test data (so no Theano here)
    def sampleFromSalpeterDATA(N, alpha, M_min, M_max):        
        log_M_Min = math.log(M_min)
        log_M_Max = math.log(M_max)        
        maxlik = math.pow(M_min, 1.0 - alpha)        
        Masses = []        
        while (len(Masses) < N):            
            logM = random.uniform(log_M_Min,log_M_Max)
            M    = math.exp(logM)            
            likelihood = math.pow(M, 1.0 - alpha)            
            u = random.uniform(0.0,maxlik)
            if (u < likelihood):
                Masses.append(M)
        return Masses

    # Generate toy data.
    N      = 1000000  # Draw 1 Million stellar masses.
    alpha  = 2.35
    M_min  = 1.0
    M_max  = 100.0
    Masses = sampleFromSalpeterDATA(N, alpha, M_min, M_max)

    #%% Estimation process
    import pymc3 as pm
    basic_model = pm.Model()
    with basic_model:

        # Priors for unknown model parameters
        alpha2 = pm.Normal('alpha2', mu=3, sd=10)
    
        N2=t.constant(1000000)
        M_min2  = t.constant(1.0)
        M_max2  = t.constant(100.0)

        # Expected value of outcome
        m =  sampleFromSalpeter(N2, alpha2, M_min2, M_max2)

        # Likelihood (sampling distribution) of observations
        Y_obs = pm.Normal('Y_obs', mu=m, sd=10, observed=Masses)

    #%% Sample
    with basic_model:
        step = pm.Metropolis()
        trace = pm.sample(10000, step=step)

    #%% Plot posteriors
    _ = pm.traceplot(trace)
    pm.summary(trace)

Here is the error log:

    Traceback (most recent call last):

  File "<ipython-input-2-8f4f5678fc24>", line 17, in <module>
    m =  sampleFromSalpeter(N2, alpha2, M_min2, M_max2)

  File "D:\myname\AppData\Local\Continuum\Anaconda3\lib\site-packages\theano\gof\op.py", line 674, in __call__
    required = thunk()

  File "D:\myname\AppData\Local\Continuum\Anaconda3\lib\site-packages\theano\gof\op.py", line 872, in rval
    r = p(n, [x[0] for x in i], o)

  File "D:\myname\AppData\Local\Continuum\Anaconda3\lib\site-packages\theano\compile\ops.py", line 543, in perform
    assert len(outs) == len(outputs)

AssertionError

You are almost there, you just need to change the last line of the function definition to return [np.array(Masses)]. Basically, the otypes=[tt.dvector] promises that your function returns a list containing one f64 numpy array.

Thanks a lot, it actually worked ! I knew I was close, but not that close :slight_smile:
Now, it is running, yet it is telling me that it is going to take about 300h…any idea why it is so much slower ?
I actually had tried really simple distributions earlier and using as_op, and noticed a considerable slowdown but not even by that much…
I guess the original code works directly on the logp, is it what makes the whole difference?

The original function sampleFromSalpeterDATA itself is quite slow to evaluate (in python3 at least), so it is not very surprising that it is even slower when wrapped in theano.as_op.

The best way forward I think, it’s to rewrite sampleFromSalpeter into a vectorized numpy function (without the while loop), and then extend it into a theano function.

This is true. I actually realize now that this example of a ‘complicated distribution’ was probably not the best example since it is stated on the page where I took the code (see first post) that “The data only enters via […] which is completely independent of the fit parameter alpha and can be computed once at beginning. Therefore, handling huge datasets such as a Gaia catalogue is no problem at all!”

This is definitely not the path I took, since as you mention, the function sampleFromSalpeterDATA is called each time, hence the time needed to compute. In the original code they use the logp, which in that case does not depend on the fit parameter alpha.

This brings me to ask, like I mentioned in the first post, if there are examples out there of people running external codes with PyMC, just using Metropolis (since no gradients are then available). I just found this and this so far.

Is it even the best idea to use PyMC to just use Metropolis-Hastings on an external model?

Yes, in your example above, you could just rewrite the model with a theano logp function:

def loglike(alpha, N, M_min, M_max):
    def logfunc(value):
        D = tt.mean(tt.log(value))*N
        c = (1.0 - alpha)/(tt.pow(M_max, 1.0-alpha)
                        - tt.pow(M_min, 1.0-alpha))
        return N*tt.log(c) - alpha*D
    return logfunc

with pm.Model() as basic_model:

    # Priors for unknown model parameters
    alpha2 = pm.Normal('alpha2', mu=3, sd=10)

    N2 = t.constant(1000000)
    M_min2  = t.constant(1.0)
    M_max2  = t.constant(100.0)
    
    # Likelihood (sampling distribution) of observations
    Y_obs = pm.DensityDist('Y_obs', loglike(alpha2, N2, M_min2, M_max2), observed=Masses)
    trace = pm.sample(1000)

As for cases where likelihood function is not available, there are not yet good solutions to it. You can modify the SMC sampler, as discussed in this github issue.

In general, these kinds of problem are better with a likelihood-free inference approach such as approximate bayesian computation (ABC). There are some libraries allow you to do that (e.g., GitHub - elfi-dev/elfi: ELFI - Engine for Likelihood-Free Inference), and we will likely implement something similar in the near future.

I would say no, as you cannot evaluate the likelihood that is essential for the MH acceptance step.

Just to clarify the issue of integrating external code into a pymc3 model (which is different from the question about likelihood-free approaches): I think this is an entirely valid and useful thing to do. But for larger models (say more than ~20 parameters) you pretty much need the gradients. Metropolis just won’t cut it. How hard they are to come by can be very different depending on the external code, and you also need some familiarity with theano and often quite a bit of (multivariate) calculus to get that to work. We are definitely missing good examples for this.

Thanks a lot, it works indeed at the same speed as the original code (reassuring for PyMC!) with this implementation of of the log likelihood, I was not quite sure of how to implement that, even though I had seen it mentioned.

I had found this the issue #1925 in Github also, but even though I tried yesterday and today to implement the SMC sampler, I couldn’t get it to work with my example.

I have found this post but when I run it, I get this Error :

recover_existing_results() missing 1 required positional argument: ‘n_jobs’

I checked the Github and it seems the SMC sampler has been worked on a lot recently, so I was wondering if there could be a version difference between the PyMC used on the example and the one in the above mentioned post?

Is there a way to execute the test_smc.py present in the test folder (I have the last version of PyMC3, the 3.1) ? Or any other working example?
Because I tried some other ways with my example code here, like:

import pymc3 as pm
from pymc3.step_methods import smc

with basic_model:
    mtrace = smc.sample_smc(1000)

or

import pymc3 as pm
from pymc3.step_methods import smc

with basic_model:
    step=pm.smc()
    mtrace = pm.sample(1000, step=step)

but none of these worked, I got TypeError: Argumenthomepathshould be path to result_directory. and TypeError: 'module' object is not callable for the second case.
I even tried to put new code from the PyMC/step_methods/smc.py file on the Github in my version, but it didn’t work either.

Yes there are some recent change of the SMC. Please update PyMC3 to master and check the current doc

If you want to run the test, you can use pytest, more information in our Guidelines for Contributing. But I think you might find this code more helpful (it is a mixed effect model):

Hi,

I’m having a similar AssertionError issue when using the logcdf method for a Beta, but strangely not for a Normal. Here is my Code:

# Initialize random number generator
np.random.seed(123)

# True parameter values
alpha, sigma = 1, 1
beta = [1, 2.5]

# Size of dataset
size = 100

# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

# Simulate outcome variable
data = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma
>
basic_model = pm.Model()

with basic_model:

    a = pm.Normal('a', mu=2, sd=2)
    b = pm.Normal('b', mu=2, sd=2)
    s = pm.HalfNormal('sigma', sd=0.01)

    m1 = np.exp(pm.Beta.dist(alpha=a, beta=b).logcdf(0.1))
  
    Y_obs = pm.Normal('Y_obs1', mu=m1, sd=s, observed=data)


map_estimate = pm.find_MAP(model=basic_model)
print map_estimate

with basic_model:
    # Sample from the posterior
      trace = pm.sample(draws=1000, chains=2, tune=500, discard_tuned_samples=True)

      pm.plot_posterior(trace)

Hi @pcolo, could you please open a new topic and also paste the error message?

[edit] never mind, I saw your post.