# Custom likelihood computation with no direct model-data comparison

Hello, this is perhaps a very trivial question but I could not find a clear anwser yet so I thought i’d ask anyway.

I am running Bayesian inference on a problem where I have a set of about 50 variables. The variables are defined on uniform priors centered around a given value. I then define a custom likelihood to compute the posterior. Each time a new set of parameters is proposed, the custom likelihood function compares the realization to give out the log-likelihood.

Posterior
Now the “tricky” part is that I am not directly comparing to any data. The input data gives me the center and width of the uniform prior, but that’s it. Every time I create a parameter realization, I run a forward model where the each proposed parameter is compared (after some mathematical computations) to sub-sets of the remaining proposed parameters. The comparison in the end is made by taking the difference and assuming a Normal distribution. So my likelihood function gives a number based on this comparison and assuming normally distributed errors between parameter estimates, but I only use the priors during the sampling stage and not explicitly as it appears in Bayes’ formula.

Technical info:
I define my custom likelihood function (similar to here) and use pm.Potential, and 50 chains with pm.DMetropolisZ as a step. I add the code in the end if it helps.

Questions
Based on the description above I have a few questions:

1. The likelihood between the proposed model parameters should follow a normal distribution, and pyMC needs the log-likelihood, then I simply return `-np.sum((error)**2)` where error is the difference between proposed model parameters and their respective subsets. Is this correct? I know there is also a constant factor of (0.5/sigma**2) in between that I omit here because I assume the same sigma for all parameters.

2. It is not clear to me where the prior comes into play, since I do not explicity multiply prior and likelihood at any step in my calculation. Am I missing something here?

3. I tried using pm.Deterministic instead of pm.Potential so that the likelihood value is also stored during the inference, but even though the code runs, it does not save any draws. The progressbar is showing and running, but when I try to extract the inference data I get 0 draws. Any hint why this could be?

4. When I load the inference data in arviz. I only get posterior and sample_stats, which I suppose is normal for pm.Potential. But in the posterior structure I notice that there are not only the variables I defined, but also a new set of variables (of the same size) that have the same name with an appended _interval__dim_0 in the end. Is this expected behavior or am I doing something wrong here?

Thank you for the feedback! I attach my code below (using mcbackend).

``````import random
from bedretto import data, geometry, OUTPUT_DIR
from bedretto.core.experiments import probabilities
import pandas as pd
import os
import pytensor.tensor as at
import numpy as np
import pymc as pm
import cloudpickle
import clickhouse_driver
import mcbackend
import datetime

output_directory = OUTPUT_DIR + 'bayes-fault' + os.sep

borehole_data = data.BoreholeData()
logging_data = data.LoggingData(borehole_data)

# initialize classes
p_samp = probabilities.ProbabilisticSampling(borehole_data)
borehole_geometry = geometry.BoreholeGeometry(borehole_data)
logging_geometry = geometry.LoggingGeometry(logging_data)

logging_df = pd.read_csv(output_directory + 'priors' + os.sep + 'logging_structures.csv')
logging_df = logging_df[['Borehole', 'Depth (m)', 'Dip (deg)', 'Azimuth (deg)', 'Structure ID']]
boreholes = logging_df['Borehole'].unique()
borehole_dict = borehole_data.boreholes_kde_dict

def return_log_likelihood(vector_1,
vector_2,
lik_type: str = 'gaussian',
variable: str = None,
scale: float = 1):
if variable == 'azimuth':
vector_1 = np.minimum(360 - vector_1, vector_1)
vector_2 = np.minimum(360 - vector_2, vector_2)
distance = np.abs(np.subtract.outer(vector_1, vector_2))
if lik_type == 'exponential':
likelihood = - distance / scale
if lik_type == 'gaussian':
likelihood = - (distance / scale) ** 2
if lik_type == 'uniform':
likelihood = np.log(distance)
likelihood[likelihood == -np.inf] = np.nan
likelihood[likelihood == np.inf] = np.nan
return likelihood, distance

def compute_likelihood(dips: np.array,
azimuths: np.array,
logging_dataframe: pd.DataFrame,
weight_distance: float = 1,
weight_dip: float = 1,
weight_azimuth: float = 1,
scale_with_distance: bool = True,
# sample logging structures - basically get the normal vector for each structure given deviations
# in dip and azimuth that are sampled from pymc
updated_logging_structures = p_samp.sample_logging_structures(structures_df=logging_dataframe,
azimuth_deviations=azimuths,
dip_deviations=dips)

intersection_df = borehole_geometry.plane_intersection_with_boreholes(boreholes=boreholes,
borehole_trajectories=borehole_dict,
surface_df=updated_logging_structures,
plane_names=updated_logging_structures[
'Structure ID'].to_list())
# use mask to only compare values where intersecting borehole is equal to the borehole where a structure exists
mask = updated_logging_structures['Borehole'].to_numpy()[:, None] == intersection_df[
'Intersecting borehole'].to_numpy()

depth_lik, distances = return_log_likelihood(updated_logging_structures['Depth (m)'].to_numpy(),
intersection_df['Intersection depth (m)'].to_numpy(),
lik_type='gaussian',
variable='depth',
scale=1)
dip_lik, dips = return_log_likelihood(updated_logging_structures['Dip (deg)'].to_numpy(),
intersection_df['Dip (deg)'].to_numpy(),
lik_type='gaussian',
scale=3)
azimuth_lik, azimuths = return_log_likelihood(updated_logging_structures['Azimuth (deg)'].to_numpy(),
intersection_df['Azimuth (deg)'].to_numpy(),
lik_type='gaussian',
variable='azimuth',
scale=3)

log_lik = weight_distance * depth_lik + \
weight_dip * dip_lik + \
weight_azimuth * azimuth_lik
if scale_with_distance:
distance_scaling = 1 / intersection_df['Distance (m)'].to_numpy()
log_lik = log_lik * distance_scaling
likelihood = pd.DataFrame(index=updated_logging_structures['Structure ID'],
columns=intersection_df['Structure ID'],
data=log_lik)

return likelihood, intersection_df, distances, dips, azimuths
else:
return likelihood

# define an aesara Op for our likelihood function
class LogLike(at.Op):
"""
Specify what type of object will be passed and returned to the Op when it is
called. In our case we will be passing it a vector of values (the parameters
that define our model) and returning a single "scalar" value (the
log-likelihood)
"""

itypes = [at.dvector, at.dvector]  # expects a vector of parameter values when called
otypes = [at.dscalar]  # outputs a single scalar value (the log likelihood)

def __init__(self, custom_function, observations, sigma):
"""
Initialise the Op with various things that our log-likelihood function
requires. Below are the things that are needed in this particular
example.

Parameters
----------
custom_function:
The log-likelihood (or whatever) function we've defined
observations:
The "observed" data that our log-likelihood function takes in
sigma:
The noise standard deviation that our function requires.
"""

# add inputs as class attributes
self.likelihood = custom_function
self.data = observations
self.sigma = sigma

def perform(self, node, inputs, outputs):
# the method that is used when calling the Op
(dips, azimuths) = inputs  # this will contain my variables
# call the log-likelihood function
logl = self.likelihood(dips, azimuths, self.data, self.sigma)
outputs[0][0] = np.array(logl)  # output the log-likelihood

def log_likelihood(dips: np.array,
azimuths: np.array,
logging_data: pd.DataFrame,
sigma: float = 1,
testing: bool = False):
if testing:
return random.random()
else:
model = compute_likelihood(dips=dips,
azimuths=azimuths,
logging_dataframe=logging_data)
val = model.sum(skipna=True).sum(skipna=True)
return val

def sample_posterior(logging_df: pd.DataFrame,
inf_mode: str = 'sample',
prior: str = 'uniform',
prior_range: float = 5):
dimension_names = logging_df['Structure ID'].to_list()
dip_names = ['dip_' + name for name in dimension_names]
azimuth_names = ['azi_' + name for name in dimension_names]

with pm.Model(coords={"dip": dip_names,
"azimuth": azimuth_names}) as model:
# initialize parameter vectors
if prior == 'normal':
dips = pm.Normal("dips", mu=0, sigma=prior_range, dims='dip')
azimuths = pm.Normal("azimuths", mu=0, sigma=prior_range, dims='azimuth')
elif prior == 'uniform':
dips = pm.Uniform("dips", upper=prior_range, lower=-prior_range, dims='dip')
azimuths = pm.Uniform("azimuths", upper=prior_range, lower=-prior_range, dims='azimuth')
else:
print("Select prior distribution as normal or uniform")
return
dips = at.as_tensor_variable(dips)
azimuths = at.as_tensor_variable(azimuths)
log_l = LogLike(log_likelihood, logging_df, sigma=1)
pm.Potential("likelihood", log_l(dips, azimuths))
start_date = str(datetime.date.today())
if inf_mode == 'sample':
ch_client = clickhouse_driver.Client("localhost")
backend = mcbackend.ClickHouseBackend(ch_client)
trace = mcbackend.pymc.TraceBackend(backend)
idata = pm.sample(draws=100000,
step=pm.DEMetropolisZ(),
chains=48,
cores=48,
trace=trace,
tune=10000,
progressbar=True,
model=model,
random_seed=100722,
compute_convergence_checks=False)
output_dir = OUTPUT_DIR + 'bayes-fault' + os.sep + 'clickhouse' + os.sep
idata.to_dataframe().to_csv(output_dir + 'clickhouse_dataframe_' + start_date + '.csv')
with open(output_dir + 'clickhouse_dictionary_' + start_date + '.pkl', mode='wb') as file:
cloudpickle.dump(idata.to_dict(), file)

if __name__ == '__main__':
sample_posterior(logging_df=logging_df,
inf_mode='sample',
prior='uniform',
prior_range=20)

``````

Your written logic sounds correct, I haven’t looked at the code to check if it does what you explained.

If you want to store the potential logp you can use both a `Deterministic` and a `Potential` (not one or the other).

``````with pm.Model() as m:
potential_term = ...
pm.Deterministic("potential_term", potential_term)
pm.Potential("potential", potential_term)
``````

About the `__interval` suffix, Uniform variables are by default transformed from the constrained space to an unconstrained one using an IntervalTransform. What you see are probably the values proposed on this unconstrained space, or at least the dims are based on the name of the variable on this space. The default PyMC traces/ InferenceData conversion would not show you these variables (or the dims with these names).

About the Prior question. The prior of each Uniform variable is considered automatically by whatever sampler is responsible for those variables (likely NUTS). The Potential is then added on top of the prior terms when the samplers considers accepting or rejecting a proposal.

Hi Ricardo, thank you for the quick response! You covered pretty much all of it.

point 1. I guess this is trivial and I only need to supply the log-likelihood, in the Gaussian case this is just the sum of the square of the differences with a minus sign in front, omitting the scaling.

point 2. OK, so when I call pm.sample and give the log-likelihood, the posteriors are computed by taking into account the priors, so proper bayesian way.

point 3. I tried running what you suggest, basically the following:
log_l = LogLike(log_likelihood, logging_df, sigma=1)
pm.Deterministic(“likelihood_term”, log_l(dips, azimuths))
pm.Potential(“likelihood”, log_l(dips, azimuths))
but I still do not get the likelihood term in my inference_data object…

point 4. I see, so I just need to omit the `__interval` suffix variables, got it!

Many thanks. Maybe if you can just elaborate a bit more on how to use both pm.Deterministic and pm.Potential together, if you have the time!

I suggest you try just a few samples with the default PyMC trace. I am confident the Deterministic will show in the trace obtained that way. Then you have to figure out why it is not showing up with your custom trace. CC @michaelosthege

I ran into the same problem of deterministics not showing up in the McBackend trace a few days ago.
I didn’t have the time to look into it yet, but I pushed the WIP branch: Optional McBackend support by michaelosthege · Pull Request #6510 · pymc-devs/pymc · GitHub

The tests will most likely fail with the same error.

I think the problem is in this part where it determines which models variables will be traced:

Somehow it’s skipping the deterministics.

1 Like