Hello, this is perhaps a very trivial question but I could not find a clear anwser yet so I thought i’d ask anyway.
I am running Bayesian inference on a problem where I have a set of about 50 variables. The variables are defined on uniform priors centered around a given value. I then define a custom likelihood to compute the posterior. Each time a new set of parameters is proposed, the custom likelihood function compares the realization to give out the log-likelihood.
Posterior
Now the “tricky” part is that I am not directly comparing to any data. The input data gives me the center and width of the uniform prior, but that’s it. Every time I create a parameter realization, I run a forward model where the each proposed parameter is compared (after some mathematical computations) to sub-sets of the remaining proposed parameters. The comparison in the end is made by taking the difference and assuming a Normal distribution. So my likelihood function gives a number based on this comparison and assuming normally distributed errors between parameter estimates, but I only use the priors during the sampling stage and not explicitly as it appears in Bayes’ formula.
Technical info:
I define my custom likelihood function (similar to here) and use pm.Potential, and 50 chains with pm.DMetropolisZ as a step. I add the code in the end if it helps.
Questions
Based on the description above I have a few questions:
-
The likelihood between the proposed model parameters should follow a normal distribution, and pyMC needs the log-likelihood, then I simply return
-np.sum((error)**2)
where error is the difference between proposed model parameters and their respective subsets. Is this correct? I know there is also a constant factor of (0.5/sigma**2) in between that I omit here because I assume the same sigma for all parameters. -
It is not clear to me where the prior comes into play, since I do not explicity multiply prior and likelihood at any step in my calculation. Am I missing something here?
-
I tried using pm.Deterministic instead of pm.Potential so that the likelihood value is also stored during the inference, but even though the code runs, it does not save any draws. The progressbar is showing and running, but when I try to extract the inference data I get 0 draws. Any hint why this could be?
-
When I load the inference data in arviz. I only get posterior and sample_stats, which I suppose is normal for pm.Potential. But in the posterior structure I notice that there are not only the variables I defined, but also a new set of variables (of the same size) that have the same name with an appended _interval__dim_0 in the end. Is this expected behavior or am I doing something wrong here?
Thank you for the feedback! I attach my code below (using mcbackend).
import random
from bedretto import data, geometry, OUTPUT_DIR
from bedretto.core.experiments import probabilities
import pandas as pd
import os
import pytensor.tensor as at
import numpy as np
import pymc as pm
import cloudpickle
import clickhouse_driver
import mcbackend
import datetime
output_directory = OUTPUT_DIR + 'bayes-fault' + os.sep
# Load data
borehole_data = data.BoreholeData()
logging_data = data.LoggingData(borehole_data)
# initialize classes
p_samp = probabilities.ProbabilisticSampling(borehole_data)
borehole_geometry = geometry.BoreholeGeometry(borehole_data)
logging_geometry = geometry.LoggingGeometry(logging_data)
logging_df = pd.read_csv(output_directory + 'priors' + os.sep + 'logging_structures.csv')
logging_df = logging_df[['Borehole', 'Depth (m)', 'Dip (deg)', 'Azimuth (deg)', 'Structure ID']]
logging_df = logging_geometry.add_center_and_normal_to_structures(logging_df, borehole_data.boreholes_kde_df)
boreholes = logging_df['Borehole'].unique()
borehole_dict = borehole_data.boreholes_kde_dict
def return_log_likelihood(vector_1,
vector_2,
lik_type: str = 'gaussian',
mask: np.array = None,
variable: str = None,
scale: float = 1):
if variable == 'azimuth':
vector_1 = np.minimum(360 - vector_1, vector_1)
vector_2 = np.minimum(360 - vector_2, vector_2)
distance = np.abs(np.subtract.outer(vector_1, vector_2))
distance[~mask] = np.nan
if lik_type == 'exponential':
likelihood = - distance / scale
if lik_type == 'gaussian':
likelihood = - (distance / scale) ** 2
if lik_type == 'uniform':
likelihood = np.log(distance)
likelihood[likelihood == -np.inf] = np.nan
likelihood[likelihood == np.inf] = np.nan
return likelihood, distance
def compute_likelihood(dips: np.array,
azimuths: np.array,
logging_dataframe: pd.DataFrame,
weight_distance: float = 1,
weight_dip: float = 1,
weight_azimuth: float = 1,
scale_with_distance: bool = True,
return_metadata: bool = False):
# sample logging structures - basically get the normal vector for each structure given deviations
# in dip and azimuth that are sampled from pymc
updated_logging_structures = p_samp.sample_logging_structures(structures_df=logging_dataframe,
azimuth_deviations=azimuths,
dip_deviations=dips)
intersection_df = borehole_geometry.plane_intersection_with_boreholes(boreholes=boreholes,
borehole_trajectories=borehole_dict,
surface_df=updated_logging_structures,
plane_names=updated_logging_structures[
'Structure ID'].to_list())
# use mask to only compare values where intersecting borehole is equal to the borehole where a structure exists
mask = updated_logging_structures['Borehole'].to_numpy()[:, None] == intersection_df[
'Intersecting borehole'].to_numpy()
depth_lik, distances = return_log_likelihood(updated_logging_structures['Depth (m)'].to_numpy(),
intersection_df['Intersection depth (m)'].to_numpy(),
mask=mask,
lik_type='gaussian',
variable='depth',
scale=1)
dip_lik, dips = return_log_likelihood(updated_logging_structures['Dip (deg)'].to_numpy(),
intersection_df['Dip (deg)'].to_numpy(),
mask=mask,
lik_type='gaussian',
scale=3)
azimuth_lik, azimuths = return_log_likelihood(updated_logging_structures['Azimuth (deg)'].to_numpy(),
intersection_df['Azimuth (deg)'].to_numpy(),
mask=mask,
lik_type='gaussian',
variable='azimuth',
scale=3)
log_lik = weight_distance * depth_lik + \
weight_dip * dip_lik + \
weight_azimuth * azimuth_lik
if scale_with_distance:
distance_scaling = 1 / intersection_df['Distance (m)'].to_numpy()
log_lik = log_lik * distance_scaling
likelihood = pd.DataFrame(index=updated_logging_structures['Structure ID'],
columns=intersection_df['Structure ID'],
data=log_lik)
if return_metadata:
return likelihood, intersection_df, distances, dips, azimuths
else:
return likelihood
# define an aesara Op for our likelihood function
class LogLike(at.Op):
"""
Specify what type of object will be passed and returned to the Op when it is
called. In our case we will be passing it a vector of values (the parameters
that define our model) and returning a single "scalar" value (the
log-likelihood)
"""
itypes = [at.dvector, at.dvector] # expects a vector of parameter values when called
otypes = [at.dscalar] # outputs a single scalar value (the log likelihood)
def __init__(self, custom_function, observations, sigma):
"""
Initialise the Op with various things that our log-likelihood function
requires. Below are the things that are needed in this particular
example.
Parameters
----------
custom_function:
The log-likelihood (or whatever) function we've defined
observations:
The "observed" data that our log-likelihood function takes in
sigma:
The noise standard deviation that our function requires.
"""
# add inputs as class attributes
self.likelihood = custom_function
self.data = observations
self.sigma = sigma
def perform(self, node, inputs, outputs):
# the method that is used when calling the Op
(dips, azimuths) = inputs # this will contain my variables
# call the log-likelihood function
logl = self.likelihood(dips, azimuths, self.data, self.sigma)
outputs[0][0] = np.array(logl) # output the log-likelihood
def log_likelihood(dips: np.array,
azimuths: np.array,
logging_data: pd.DataFrame,
sigma: float = 1,
testing: bool = False):
if testing:
return random.random()
else:
model = compute_likelihood(dips=dips,
azimuths=azimuths,
logging_dataframe=logging_data)
val = model.sum(skipna=True).sum(skipna=True)
return val
def sample_posterior(logging_df: pd.DataFrame,
inf_mode: str = 'sample',
prior: str = 'uniform',
prior_range: float = 5):
dimension_names = logging_df['Structure ID'].to_list()
dip_names = ['dip_' + name for name in dimension_names]
azimuth_names = ['azi_' + name for name in dimension_names]
with pm.Model(coords={"dip": dip_names,
"azimuth": azimuth_names}) as model:
# initialize parameter vectors
if prior == 'normal':
dips = pm.Normal("dips", mu=0, sigma=prior_range, dims='dip')
azimuths = pm.Normal("azimuths", mu=0, sigma=prior_range, dims='azimuth')
elif prior == 'uniform':
dips = pm.Uniform("dips", upper=prior_range, lower=-prior_range, dims='dip')
azimuths = pm.Uniform("azimuths", upper=prior_range, lower=-prior_range, dims='azimuth')
else:
print("Select prior distribution as normal or uniform")
return
dips = at.as_tensor_variable(dips)
azimuths = at.as_tensor_variable(azimuths)
log_l = LogLike(log_likelihood, logging_df, sigma=1)
pm.Potential("likelihood", log_l(dips, azimuths))
start_date = str(datetime.date.today())
if inf_mode == 'sample':
ch_client = clickhouse_driver.Client("localhost")
backend = mcbackend.ClickHouseBackend(ch_client)
trace = mcbackend.pymc.TraceBackend(backend)
idata = pm.sample(draws=100000,
step=pm.DEMetropolisZ(),
chains=48,
cores=48,
trace=trace,
tune=10000,
progressbar=True,
model=model,
random_seed=100722,
discard_tuned_samples=False,
compute_convergence_checks=False)
output_dir = OUTPUT_DIR + 'bayes-fault' + os.sep + 'clickhouse' + os.sep
idata.to_dataframe().to_csv(output_dir + 'clickhouse_dataframe_' + start_date + '.csv')
with open(output_dir + 'clickhouse_dictionary_' + start_date + '.pkl', mode='wb') as file:
cloudpickle.dump(idata.to_dict(), file)
if __name__ == '__main__':
sample_posterior(logging_df=logging_df,
inf_mode='sample',
prior='uniform',
prior_range=20)