How to troubleshoot sampling error with custom, not differentiable likelihood?

I’m modeling a PDE for wildfire spread and want to use PyMC to recover parameters. I’m using a custom likelihood function that finds the Intersection Over Union metric comparing the predicted and observed burned areas. I want to use map these metrics to a Beta(5,1) distribution and use that as a likelihood estimate. The code should be something like what’s shown below. However, when I try to run it I get an error that the initial evaluation of the model at the starting point failed. These are the results given:

k_log__ -2.31
A_log__ -5.52
B_log__ -5.52
C_log__ 8.29
Cs_log__ -2.02
likelihood -inf

Any ideas for how to approach troubleshooting this would be appreciated.

def get_IoUs(T_preds, T_obs, burn=600):
    '''
    Calculates IOU score between observed and predicted 
    fire perimeters and maps it to a likelihood.  
    '''
    n_per, n_data = T_obs.shape
    burn_pred = T_preds >= burn
    burn_obs = T_obs >= burn
    IoUs = tt.zeros(n_per)
    for p in range(n_per):
        pred = burn_pred[p*n_data:(p+1)*n_data]
        obs = burn_obs[p].flatten()
        
        intersection = pred*obs
        union = pred + obs
        if tt.gt(union.sum(),0):
            IoU = intersection.sum()/union.sum()
        else:
            IoU = 1
        IoUs = tt.set_subtensor(IoUs[p], IoU)
    return IoUs

with pm.Model() as model:
    # Set prior distributions.
    k = pm.Lognormal("k", mu=np.log(1), sigma=4, shape=(1,)) # Collision frequency
    A = pm.Lognormal("A", mu=np.log(9000.), sigma=100, shape=(1,)) # Pre-exponential factor (rate)
    B = pm.Lognormal("B", mu=np.log(300.), sigma=100, shape=(1,)) # Proportionality Coefficient/Activation Energy
    C = pm.Lognormal("C", mu=np.log(1/10000), sigma=1/10000, shape=(1,)) # Scaled Heat Transfer Coefficient
    Cs = pm.Lognormal("Cs", mu=np.log(3), sigma=3, shape=(1,)) # Fuel Relative Disappearance Rate
    
    # Run the model.
    T_preds = theano_fem_solver(k, A, B, C, Cs)
    
    # Calculate the likelihood.
    IoUs = pm.Deterministic("IoUs",get_IoUs(T_preds, Ts_observed))
    likelihood = Beta('likelihood', alpha=5, beta=1, observed=IoUs)
    
    # MCMC:
    trace = pm.sample(tune=5, draws=5)

To directly answer your question, it might be that the initial value chosen automatically when you call pm.sample doesn’t produce valid values. You can set initial values for all your distributions by adding a initval= argument. The sampling will begin from that point, so you could use an initial point you are sure will work. Note that this argument was depreciated in PyMC v4, if you decide to upgrade (it’s recommended).

It might also be related to the fact that the beta distribution disallows the end points for certain combinations of parameters. So if your IoUs have values of exactly 0.0 or 1.0, (which is very likely for IoU of random samples) it will have a logpdf of -inf and you will get this error. Testing a Beta(4, 1), the logpdf of 0.0 is indeed -inf.

But something about the approach more fundamentally seems off to me though, like you’re mixing together two concepts. Why model the data generating process of a model evaluation metric, like IoU, rather than the process that generated the data itself? It’s an additional unnecessary step. I don’t know what the output of theano_fem_solver is, but you could assume the process is centered on the dynamics of the PDE, plus some iid normally distributed noise (or what have you), then use Ts_observed directly as data. You would compute the distribution over model IoU using posterior predictive samples “for free”, after you obtain a data generating process.

Since geography is involved, you might also use the PDE as the mean of some kind of spatial process, so the likelihood function would be multivariate normal with some covariance function.

1 Like

@jessegrabowski , thanks for the response!
I’ve tried setting initial values that I know should produce positive, finite values, and I’ve tried catching IoUs of exactly 0 or 1 and changing them to 0.001 or 0.999. Neither of these approaches change the likelihood from -inf. I don’t know what else about the process could be causing this errors.

More importantly, I really appreciate your insights on my fundamental approach to the problem, but I don’t really understand them. I’ll lay out in more detail what I’m trying to accomplish and then hopefully you could help me better understand what you mean, if you don’t mind.

I have a PDE that maps changes in temperature over a space that would correspond to a burning wildfire. The data I have is a mask of the fire perimeters in the area at specific times. Just binary whether fire is present or not. My metric for comparing how well the predicted fire matches my data is described in get_IoUs(); I consider temperature of above 600 degrees to be on fire in my predicted temperatures, find the intersection between the predicted and actual burning areas, and divide it by the union of the two geographical areas. I would like to recover the parameters for the PDE that produce predicted temperatures at specific times that best match the true wildfire according to IoU score in the hopes of predicting how the fire will develop in the future.

So the output of theano_fem_solver is a vector describing the temperatures at each coordinate of our finite element mesh in 2D space at each time for which we have a true fire perimeter to compare. I don’t understand how Ts_observed can be used directly as data. The IoU metric (or adaptations of it) is the only reliable performance metric I know of and I’d like my likelihood function to be based on it. However, if I’m missing something fundamental here I really want to understand it. What would it look like if I implemented what you’re describing? Using the PDE as the mean of a spatial process intrigues me, however the actual temperature at individual coordinates of the mesh isn’t important so much as how much of the fire is correctly predicted, and how badly the fire is overestimated, which is what IoU captures, so I don’t understand how this would be done.

Thanks so much!

1 Like

It sounds like the PDE outputs a latent variable – temperature – which somehow determines the probability of whether a fire is observed (either True or False) at a given coordinate. So the actual data generating process could be modeled as a 2d matrix of Bernoulli random variables – fire or no fire – with the probability parameter p \propto f(temp). f would be some linear function wrapped in a sigmoid, like \sigma^{-1}(\alpha + \beta * \text{temp}) so the probability is between 0-1. This would match your data, as now the model will just output a 2d mask of 1’s and 0’s, since that’s what a Bernoulli does.

I think this might be similar to another question that was asked about golf. Maybe it could be helpful?

1 Like

@jessegrabowski The discussion in the golf question is helpful for my general understanding, but there’s a very significant difference in how I’m identifying success. The problem I’m seeing is that I don’t care that much about individual coordinates being correct if they aren’t contextualized in a way such as Intersection Over Union. For example, if I count how many coordinates are correctly classified as burning or not, by simply extending the geographical area I’m considering, the model will drastically improve in performance because every new coordinate will just be non-burning. It would be ideal to still be able to use IoU in the data assimilation process, but I don’t understand how to do it. I’ve tried passing the IoUs to the log pdf of the Beta distribution, taking the sum, and using that as a likelihood factor with pm.Potential(). The code runs, but it can’t seem to find any acceptable samples. Is this likelihood just too non-smooth and creates a distribution that’s too difficult to sample?

Perhaps there’s a misunderstanding about what the Bayesian approach is? You’re not optimizing an objective function, you’re generating samples of a fixed posterior distribution. There’s no “improvement” going on as you run the MCMC algorithm, in the typical machine learning sense. It’s mapping out a complex probability distribution that is fixed as soon as you write down the model and provide the data.

This last point is important: you must show the model fixed data. The IoU is not a fixed quantity; it is a function of the current samples. I don’t even think your model as written above will run in PyMC v4, because PyMC now checks that data given to a likelihood function doesn’t link to any random variable nodes in the computation graph. I hope you can see why this would problematic: you have an infinite regression problem. The likelihood depends on samples which depend on the likelihood which depend on samples… If you solved the problem of bad initial values, I suspect this “moving target” problem is related to your current woes.

This is my point when I say you should be modeling your data directly, instead of a model evaluation metric. What does the IOU capture that you think is essential? If it’s spatial dependence, you can run 2D convolution over the temperature, or use a multivariate normal some covariance function (or just model the entire covariance matrix). If it’s something else, reflect on what that something else is and explicitly include it in the latent probability parameter of the Bernoulli distribution.

Or, if you’re really convinced that you must optimize IoU, drop the Bayesian thing and just chuck it into a neural net with an IoU loss function. But I think that reasoning though what you believe IoU is capturing and trying to build that directly into the data generating process is a worthwhile exercise.

2 Likes

Thanks a ton! I think you’re right that I’ve had a fundamental misapprehension about the likelihood function and the Bayesian approach in general. This is really helpful! I’ll give some more focused thought toward how to directly model the data and what that means for my likelihood.

1 Like

@jessegrabowski I’m still dwelling on how to capture the information I expect to be most important in a likelihood function, but I want to try your idea of using a 2d matrix of Bernoulli variables. and I’ve implemented it as follows:

with pm.Model() as model:
    # Set prior distributions.
    k = pm.Lognormal("k", mu=np.log(2), sigma=4, shape=(1,)) # Collision frequency
    A = pm.Lognormal("A", mu=np.log(9800.), sigma=1000, shape=(1,)) # Pre-exponential factor (rate)
    B = pm.Lognormal("B", mu=np.log(320.), sigma=100, shape=(1,)) # Proportionality Coefficient/Activation Energy
    C = pm.Lognormal("C", mu=np.log(1.5/10000), sigma=1/10000, shape=(1,)) # Scaled Heat Transfer Coefficient
    Cs = pm.Lognormal("Cs", mu=np.log(3.5), sigma=3, shape=(1,)) # Fuel Relative Disappearance Rate
    
    # Run the model.
    T_preds = theano_fem_solver(k, A, B, C, Cs)
    
    # Transform temperature to probabilities.
    probs = tt.nnet.sigmoid((T_preds-500)/50)
    
    # Calculate the likelihood.
    like = pm.Bernoulli('likelihood', p=probs, observed=obs)

where obs is a mask where fire is burning.

k_noise = fa.Constant(1.0 + np.random.normal(scale=.2))
A_noise = fa.Constant(10000. + np.random.normal(scale=1000))
B_noise = fa.Constant(300. + np.random.normal(scale=30))
C_noise = fa.Constant((1 + np.random.normal(scale=.2))/10000)
Cs_noise = fa.Constant(3. + np.random.normal(scale=.5))
Ts_noise = solve_pde(k_noise, A_noise, B_noise, C_noise, Cs_noise)
obs = to_numpy(Ts_noise)
obs = obs > 600
obs = obs.astype(int)

However, when I sample I’m getting some odd errors while NUTS is trying to initialize. The errors have to do with the library fenics_pymc3 and have been difficult to troubleshoot so far, so I wanted to ask if this is what you meant about using Bernoulli variables? Is there something obviously wrong with my implementation that you can point to?

If you get the chance, I also wonder if you could shed some light on what you meant when you said “you could assume the process is centered on the dynamics of the PDE.” I don’t understand what is meant by that.

Thanks!

Yeah, that’s the jist of it. You get the output of the PDE, transform them into probabilities, and pass it into Bernoulli. There might be a logit_p parameter in pm.Bernoulli you could use instead of of p, to avoid having to call tt.nnet.sigmoid yourself (I think logit_p does some other stability tricks under the hood, but I don’t remember if it’s available in v3).

All I meant by “centered on the PDE output” was that you can take T_preds and do additional modeling with it. For example you could say that a fire at location (i, j) depends on 1) the predicted temperature at (i,j), and 2) the predicted temperatures in the 8 adjacent locations. So then probs could be a Multivariate Normal with mu=T_pred (that the “centering” I’m refering to), and the covariance matrix has the appropriate structure, maybe with just one diagonal variance and one off-diagonal adjacent variance parameter to estimate.

You could also combine T_pred with other features of locations (i, j) you might have.

Anyway, I guess none of that is helpful given your current troubles. Can you be more specific about the types of errors you are getting? I don’t know anything about fenics_pymc3 specifically, but checking the github repo it looks like its not actively maintained anymore. Have you looked into SunODE, which is compatible with PyMC 4.0 and actively maintained?

1 Like

Okay, that’s helpful, thanks. Right now I’m experimenting with just forcing multiple step functions not to try NUTS. It seems to run but has yet to find a sample.

I looked into SunODE early on, but it seems to only support ODEs, and I need to solve a PDE. fenics_pymc3 is the only suitable package I’ve found, and I couldn’t get it to play nice with PyMC 4, which is the only reason I’m using PyMC3.

Ah I see. @aseyboldt do you know of any PDE solvers for v4?

I don’t know of any packages unfortunately. I’m also not really sure how a package like this would look like, there are a lot of choices you can make if you want to include a pde solution into a model.

If I had a problem involving a pde right now, I’d probably start thinking about it along those lines:

  • Does the PDE have a time component? If so, maybe method of lines and then using an ode solver might be an option. sunode could do this, but unfortunately I still didn’t implement support for aesara right-hand-side functions there, so you either have to manually write the right hand side in numba or you have to work with sympy. The sympy right hand side might lead to compile time problems if the dimensionality gets too large though, which might easily happen with method of lines. (or you implement an aesara backend for the right hand side function in sunode and make me very happy with a PR :slight_smile: )

  • How do I want to solve the pde in the first place: Finite elements/finite diff/finite volume? What solver? If the problem is small enough, and there is no time component, maybe the PDE could be discretized and solved directly in aesara? If not, fenics is probably a good choice.

  • How large is the parameter space? If it is small (say <20), maybe an smc sampler could be a good choice? That way you don’t need gradients.

  • If you want to sample with nuts, first make a rough estimate to see if that is even realistic. If you are very lucky, you’ll need 10_000 gradient evaluations, more if you have bad geometry. Is that realistic? If not, you could start to look into surrogate models? Basically, you train some machine learning model to predict the logp and gradient, and use that during sampling. You might have to retrain it for the actual posterior though, so this would be something of a research project. :slight_smile:
    The tuning improvements in nutpie might also reduce the number of gradient evals, but not by orders of magnitude.

  • How do you get at the gradient? There used to be a package (fenics-adjoint I think) that automatically computes derivatives of pde solutions using fenics, I don’t know if that is still well supported though (fenics recently did an API breaking release). Is that fast enough, or is it worth the effort to write and solve the adjoint PDE manually? Then you need to wrap that in aesara.

It seems however that you already have a fenics implementation, including the gradient? Whether that is wrapped using aesara (pymc) or the old theano (pymc3) probably doesn’t make much of a difference, that seems to me to be the easiest part of the whole exercise. To switch to pymc version 4 you should mainly have to change the imports in the implementation of thenao_fem_solver from theano to aesara. :slight_smile:

If you have problems with initial points as you seem to have, it might help to start with really tight priors, reparameterized so that they are centered at zero with std 1, so something like this:

    k_raw = pm.Normal("k_raw", mu=0, sigma=1)
    k_raw = pm.Deterministic("k", tt.exp(k_raw * 4 + np.log(1)))

    A_raw = pm.Normal("A_raw", mu=0, sigma=1)
    A = pm.Deterministic("A", tt.exp(A_raw) * 100 + np.log(9000))
    ...

but probably tighten the priors if that makes sense (I don’t know what the parameters are).

I also removed the shape=(1,), because that looks strange to me. You are saying that those are arrays of length 1, when they really look like they are just scalars (ie shape=()).

3 Likes

@aseyboldt Thanks a ton.

I may look into an smc sampler. I don’t expect NUTS to actually work since my gradient isn’t very doable, I was just trying to figure out why it was exiting instead of just saying initialization failed and trying other step functions. I think I messed up my environment somewhere on accident because code that was running last week is broken now.

I definitely want to learn more about all you’ve written about, but wrapping my fenics implementation in aesara seems like a quick way to see if pymc v4 will give me fewer issues. Is this something that would be easy for a beginner to take on? Looking at fenics_pymc3, the package is tiny and the author just uses a few functions from theano.tensor, theano.gradient, theano.graph.op and fecr to wrap around fenics. Do you think I would easily be able to find equivalent functions in aesara to do the same thing and try to use pymc v4?

Tightening priors is a great tip, and I don’t understand the reason for the shape=(1,) thing, but it makes fenics_pymc3 work.

Looking at fenics_pymc3, the package is tiny and the author just uses a few functions from theano.tensor, theano.gradient, theano.graph.op and fecr to wrap around fenics. Do you think I would easily be able to find equivalent functions in aesara to do the same thing and try to use pymc v4?

I think so. A few import locations might have changed, but the functions and classes are still almost always the same ones. If you can’t find the correct function for something, feel free to ask.

Tightening priors is a great tip, and I don’t understand the reason for the shape=(1,) thing, but it makes fenics_pymc3 work.

Ah, sounds to me like fenics_pymc3 just expects all parameters to be vectors? You could also call it as

T_preds = theano_fem_solver(k[None], A[None], B[None], C[None], Cs[None])

in that case, so that the strangeness stays close to the source. Won’t make much of a difference though, just keep your final trace object a bit cleaner. :slight_smile:

By the way, I also think changing the likelihood as described by @jessegrabowski here sounds like something that should be worth investigating.

1 Like