Custom likelihood computation with no direct model-data comparison

Hello, this is perhaps a very trivial question but I could not find a clear anwser yet so I thought i’d ask anyway.

I am running Bayesian inference on a problem where I have a set of about 50 variables. The variables are defined on uniform priors centered around a given value. I then define a custom likelihood to compute the posterior. Each time a new set of parameters is proposed, the custom likelihood function compares the realization to give out the log-likelihood.

Posterior
Now the “tricky” part is that I am not directly comparing to any data. The input data gives me the center and width of the uniform prior, but that’s it. Every time I create a parameter realization, I run a forward model where the each proposed parameter is compared (after some mathematical computations) to sub-sets of the remaining proposed parameters. The comparison in the end is made by taking the difference and assuming a Normal distribution. So my likelihood function gives a number based on this comparison and assuming normally distributed errors between parameter estimates, but I only use the priors during the sampling stage and not explicitly as it appears in Bayes’ formula.

Technical info:
I define my custom likelihood function (similar to here) and use pm.Potential, and 50 chains with pm.DMetropolisZ as a step. I add the code in the end if it helps.

Questions
Based on the description above I have a few questions:

  1. The likelihood between the proposed model parameters should follow a normal distribution, and pyMC needs the log-likelihood, then I simply return -np.sum((error)**2) where error is the difference between proposed model parameters and their respective subsets. Is this correct? I know there is also a constant factor of (0.5/sigma**2) in between that I omit here because I assume the same sigma for all parameters.

  2. It is not clear to me where the prior comes into play, since I do not explicity multiply prior and likelihood at any step in my calculation. Am I missing something here?

  3. I tried using pm.Deterministic instead of pm.Potential so that the likelihood value is also stored during the inference, but even though the code runs, it does not save any draws. The progressbar is showing and running, but when I try to extract the inference data I get 0 draws. Any hint why this could be?

  4. When I load the inference data in arviz. I only get posterior and sample_stats, which I suppose is normal for pm.Potential. But in the posterior structure I notice that there are not only the variables I defined, but also a new set of variables (of the same size) that have the same name with an appended _interval__dim_0 in the end. Is this expected behavior or am I doing something wrong here?

Thank you for the feedback! I attach my code below (using mcbackend).

import random
from bedretto import data, geometry, OUTPUT_DIR
from bedretto.core.experiments import probabilities
import pandas as pd
import os
import pytensor.tensor as at
import numpy as np
import pymc as pm
import cloudpickle
import clickhouse_driver
import mcbackend
import datetime

output_directory = OUTPUT_DIR + 'bayes-fault' + os.sep

# Load data
borehole_data = data.BoreholeData()
logging_data = data.LoggingData(borehole_data)

# initialize classes
p_samp = probabilities.ProbabilisticSampling(borehole_data)
borehole_geometry = geometry.BoreholeGeometry(borehole_data)
logging_geometry = geometry.LoggingGeometry(logging_data)

logging_df = pd.read_csv(output_directory + 'priors' + os.sep + 'logging_structures.csv')
logging_df = logging_df[['Borehole', 'Depth (m)', 'Dip (deg)', 'Azimuth (deg)', 'Structure ID']]
logging_df = logging_geometry.add_center_and_normal_to_structures(logging_df, borehole_data.boreholes_kde_df)
boreholes = logging_df['Borehole'].unique()
borehole_dict = borehole_data.boreholes_kde_dict


def return_log_likelihood(vector_1,
                          vector_2,
                          lik_type: str = 'gaussian',
                          mask: np.array = None,
                          variable: str = None,
                          scale: float = 1):
    if variable == 'azimuth':
        vector_1 = np.minimum(360 - vector_1, vector_1)
        vector_2 = np.minimum(360 - vector_2, vector_2)
    distance = np.abs(np.subtract.outer(vector_1, vector_2))
    distance[~mask] = np.nan
    if lik_type == 'exponential':
        likelihood = - distance / scale
    if lik_type == 'gaussian':
        likelihood = - (distance / scale) ** 2
    if lik_type == 'uniform':
        likelihood = np.log(distance)
    likelihood[likelihood == -np.inf] = np.nan
    likelihood[likelihood == np.inf] = np.nan
    return likelihood, distance


def compute_likelihood(dips: np.array,
                       azimuths: np.array,
                       logging_dataframe: pd.DataFrame,
                       weight_distance: float = 1,
                       weight_dip: float = 1,
                       weight_azimuth: float = 1,
                       scale_with_distance: bool = True,
                       return_metadata: bool = False):
    # sample logging structures - basically get the normal vector for each structure given deviations
    # in dip and azimuth that are sampled from pymc
    updated_logging_structures = p_samp.sample_logging_structures(structures_df=logging_dataframe,
                                                                  azimuth_deviations=azimuths,
                                                                  dip_deviations=dips)

    intersection_df = borehole_geometry.plane_intersection_with_boreholes(boreholes=boreholes,
                                                                          borehole_trajectories=borehole_dict,
                                                                          surface_df=updated_logging_structures,
                                                                          plane_names=updated_logging_structures[
                                                                              'Structure ID'].to_list())
    # use mask to only compare values where intersecting borehole is equal to the borehole where a structure exists
    mask = updated_logging_structures['Borehole'].to_numpy()[:, None] == intersection_df[
        'Intersecting borehole'].to_numpy()

    depth_lik, distances = return_log_likelihood(updated_logging_structures['Depth (m)'].to_numpy(),
                                                 intersection_df['Intersection depth (m)'].to_numpy(),
                                                 mask=mask,
                                                 lik_type='gaussian',
                                                 variable='depth',
                                                 scale=1)
    dip_lik, dips = return_log_likelihood(updated_logging_structures['Dip (deg)'].to_numpy(),
                                          intersection_df['Dip (deg)'].to_numpy(),
                                          mask=mask,
                                          lik_type='gaussian',
                                          scale=3)
    azimuth_lik, azimuths = return_log_likelihood(updated_logging_structures['Azimuth (deg)'].to_numpy(),
                                                  intersection_df['Azimuth (deg)'].to_numpy(),
                                                  mask=mask,
                                                  lik_type='gaussian',
                                                  variable='azimuth',
                                                  scale=3)

    log_lik = weight_distance * depth_lik + \
              weight_dip * dip_lik + \
              weight_azimuth * azimuth_lik
    if scale_with_distance:
        distance_scaling = 1 / intersection_df['Distance (m)'].to_numpy()
        log_lik = log_lik * distance_scaling
    likelihood = pd.DataFrame(index=updated_logging_structures['Structure ID'],
                              columns=intersection_df['Structure ID'],
                              data=log_lik)

    if return_metadata:
        return likelihood, intersection_df, distances, dips, azimuths
    else:
        return likelihood


# define an aesara Op for our likelihood function
class LogLike(at.Op):
    """
    Specify what type of object will be passed and returned to the Op when it is
    called. In our case we will be passing it a vector of values (the parameters
    that define our model) and returning a single "scalar" value (the
    log-likelihood)
    """

    itypes = [at.dvector, at.dvector]  # expects a vector of parameter values when called
    otypes = [at.dscalar]  # outputs a single scalar value (the log likelihood)

    def __init__(self, custom_function, observations, sigma):
        """
        Initialise the Op with various things that our log-likelihood function
        requires. Below are the things that are needed in this particular
        example.

        Parameters
        ----------
        custom_function:
            The log-likelihood (or whatever) function we've defined
        observations:
            The "observed" data that our log-likelihood function takes in
        sigma:
            The noise standard deviation that our function requires.
        """

        # add inputs as class attributes
        self.likelihood = custom_function
        self.data = observations
        self.sigma = sigma

    def perform(self, node, inputs, outputs):
        # the method that is used when calling the Op
        (dips, azimuths) = inputs  # this will contain my variables
        # call the log-likelihood function
        logl = self.likelihood(dips, azimuths, self.data, self.sigma)
        outputs[0][0] = np.array(logl)  # output the log-likelihood


def log_likelihood(dips: np.array,
                   azimuths: np.array,
                   logging_data: pd.DataFrame,
                   sigma: float = 1,
                   testing: bool = False):
    if testing:
        return random.random()
    else:
        model = compute_likelihood(dips=dips,
                                   azimuths=azimuths,
                                   logging_dataframe=logging_data)
        val = model.sum(skipna=True).sum(skipna=True)
        return val

def sample_posterior(logging_df: pd.DataFrame,
                     inf_mode: str = 'sample',
                     prior: str = 'uniform',
                     prior_range: float = 5):
    dimension_names = logging_df['Structure ID'].to_list()
    dip_names = ['dip_' + name for name in dimension_names]
    azimuth_names = ['azi_' + name for name in dimension_names]

    with pm.Model(coords={"dip": dip_names,
                          "azimuth": azimuth_names}) as model:
        # initialize parameter vectors
        if prior == 'normal':
            dips = pm.Normal("dips", mu=0, sigma=prior_range, dims='dip')
            azimuths = pm.Normal("azimuths", mu=0, sigma=prior_range, dims='azimuth')
        elif prior == 'uniform':
            dips = pm.Uniform("dips", upper=prior_range, lower=-prior_range, dims='dip')
            azimuths = pm.Uniform("azimuths", upper=prior_range, lower=-prior_range, dims='azimuth')
        else:
            print("Select prior distribution as normal or uniform")
            return
        dips = at.as_tensor_variable(dips)
        azimuths = at.as_tensor_variable(azimuths)
        log_l = LogLike(log_likelihood, logging_df, sigma=1)
        pm.Potential("likelihood", log_l(dips, azimuths))
        start_date = str(datetime.date.today())
        if inf_mode == 'sample':
            ch_client = clickhouse_driver.Client("localhost")
            backend = mcbackend.ClickHouseBackend(ch_client)
            trace = mcbackend.pymc.TraceBackend(backend)
            idata = pm.sample(draws=100000,
                              step=pm.DEMetropolisZ(),
                              chains=48,
                              cores=48,
                              trace=trace,
                              tune=10000,
                              progressbar=True,
                              model=model,
                              random_seed=100722,
                              discard_tuned_samples=False,
                              compute_convergence_checks=False)
            output_dir = OUTPUT_DIR + 'bayes-fault' + os.sep + 'clickhouse' + os.sep
            idata.to_dataframe().to_csv(output_dir + 'clickhouse_dataframe_' + start_date + '.csv')
            with open(output_dir + 'clickhouse_dictionary_' + start_date + '.pkl', mode='wb') as file:
                cloudpickle.dump(idata.to_dict(), file)


if __name__ == '__main__':
    sample_posterior(logging_df=logging_df,
                     inf_mode='sample',
                     prior='uniform',
                     prior_range=20)

Your written logic sounds correct, I haven’t looked at the code to check if it does what you explained.

To answer some of your questions:

If you want to store the potential logp you can use both a Deterministic and a Potential (not one or the other).

with pm.Model() as m:
  potential_term = ...
  pm.Deterministic("potential_term", potential_term)
  pm.Potential("potential", potential_term)

About the __interval suffix, Uniform variables are by default transformed from the constrained space to an unconstrained one using an IntervalTransform. What you see are probably the values proposed on this unconstrained space, or at least the dims are based on the name of the variable on this space. The default PyMC traces/ InferenceData conversion would not show you these variables (or the dims with these names).

About the Prior question. The prior of each Uniform variable is considered automatically by whatever sampler is responsible for those variables (likely NUTS). The Potential is then added on top of the prior terms when the samplers considers accepting or rejecting a proposal.

Hi Ricardo, thank you for the quick response! You covered pretty much all of it.

point 1. I guess this is trivial and I only need to supply the log-likelihood, in the Gaussian case this is just the sum of the square of the differences with a minus sign in front, omitting the scaling.

point 2. OK, so when I call pm.sample and give the log-likelihood, the posteriors are computed by taking into account the priors, so proper bayesian way.

point 3. I tried running what you suggest, basically the following:
log_l = LogLike(log_likelihood, logging_df, sigma=1)
pm.Deterministic(“likelihood_term”, log_l(dips, azimuths))
pm.Potential(“likelihood”, log_l(dips, azimuths))
but I still do not get the likelihood term in my inference_data object…

point 4. I see, so I just need to omit the __interval suffix variables, got it!

Many thanks. Maybe if you can just elaborate a bit more on how to use both pm.Deterministic and pm.Potential together, if you have the time!

I suggest you try just a few samples with the default PyMC trace. I am confident the Deterministic will show in the trace obtained that way. Then you have to figure out why it is not showing up with your custom trace. CC @michaelosthege

I ran into the same problem of deterministics not showing up in the McBackend trace a few days ago.
I didn’t have the time to look into it yet, but I pushed the WIP branch: Optional McBackend support by michaelosthege · Pull Request #6510 · pymc-devs/pymc · GitHub

The tests will most likely fail with the same error.

I think the problem is in this part where it determines which models variables will be traced:

Somehow it’s skipping the deterministics.

1 Like