Saving intermediate results using MCMC in pyMC4

shakasaki · July 18, 2022, 4:33pm

Hello,

There is a similar question asked here but it remained unanswered and is also relevant to Metropolis sampling.

I am sampling a custom likelihood function that takes about 20 seconds to run, so a full inference takes days (no way to make the likelihood faster since I run a physics based forward solver that is mesh dependent). I would like to save intermediate results, say ever 300 iterations or so. Right now I am using Sequential Monte Carlo, but I was actually wondering if there is a general way to do this for all MCMC samplers.

I would imagine it looking like this, based on the tagged example above:

runs = 10 # how many runs to do and save each intermediate run
for i1 in range(runs):
    with pm.Model() as model:
        X = {initialize variables}
        log_likelihood = { call custom likelihood}

        if i1 == 0:
            # first iteration
            trace = pm.sample_smc(draws=300, ....) # add arguments here to initialize variables 
        else:
            trace = pm.sample_smc(draws=300, ...) # give the previous trace as a starting point
        # save the iteration
        with open('iteration' + str(i1) + '.pkl', mode='wb') as file:
            cloudpickle.dump(trace, file)

Is there a way to generalize this for all MCMC samplers? Thank you for the feedback!

cluhmann · July 18, 2022, 9:18pm

I would check out the documentation for iter_sample() and see if that gets you what you need. I’m not sure how it interacts with SMC, but I think it’s the recommended way of doing things “during” sampling. Someone else can chime in if there are other approaches.

shakasaki · July 18, 2022, 10:39pm

Thanks, Thats a great suggestion. I think I got it to work, i’m just not sure if I am properly supplying the starting point for the iter_sample() method. The documentation on this is not great.

start dict
Starting point in parameter space (or partial point).

I have the following code, that is actually saving every 10th iteration. It’s not the actual likelihood I want to use, but just a self-sufficient example taken from tutorials and previous examples.

import aesara.tensor as at
import numpy as np
import pymc as pm
import cloudpickle
from pymc import iter_sample

n = 4

mu1 = np.ones(n) * (1.0 / 2)
mu2 = -mu1

stdev = 0.1
sigma = np.power(stdev, 2) * np.eye(n)
isigma = np.linalg.inv(sigma)
dsigma = np.linalg.det(sigma)

w1 = 0.1  # one mode with 0.1 of the mass
w2 = 1 - w1  # the other mode with 0.9 of the mass


def two_gaussians(x):
    log_like1 = (
            -0.5 * n * at.log(2 * np.pi)
            - 0.5 * at.log(dsigma)
            - 0.5 * (x - mu1).T.dot(isigma).dot(x - mu1)
    )
    log_like2 = (
            -0.5 * n * at.log(2 * np.pi)
            - 0.5 * at.log(dsigma)
            - 0.5 * (x - mu2).T.dot(isigma).dot(x - mu2)
    )
    return pm.math.logsumexp([at.log(w1) + log_like1, at.log(w2) + log_like2])

with pm.Model() as model:
    X = pm.Uniform(
        "X",
        shape=n,
        lower=-2.0 * np.ones_like(mu1),
        upper=2.0 * np.ones_like(mu1),
        initval=-1.0 * np.ones_like(mu1),
    )
    llk = pm.Potential("llk", two_gaussians(X))
    step = pm.Metropolis()
    iter_counter = 0
    for trace in iter_sample(draws=100,
                             start=model.initial_point(),
                             step=pm.Metropolis(),
                             model=model):
        iter_counter += 1
        if iter_counter % 10 == 0:
            print('Saving trace for iteration ' + str(iter_counter))
            with open('iteration' + str(iter_counter) + '.pkl', mode='wb') as file:
                cloudpickle.dump(trace, file)

michaelosthege · July 19, 2022, 10:47pm

Check out mcbackend, it was designed to solve exactly this problem.
The README explains how you can use McBackend to stream MCMC draws to a ClickHouse database and pull them in real-time.

Note that the idea of restarting/continuing an MCMC run sounds much much easier then it is. To do it properly one needs to capture not only the internal state of samplers, but also all random numbers generators…

shakasaki · July 21, 2022, 9:19pm

Just a comment here: If you try to save with cloudpickle it works. If you try to save with arviz, it doesn’t. If I replace the cloudpickle save:

with open('iteration' + str(iter_counter) + '.pkl', mode='wb') as file:
                cloudpickle.dump(trace, file)

with

arviz.InferenceData.to_netcdf(trace, filename='trace_' + str(iter_counter) + '.netcdf')

Then I get the following error:

AttributeError: ‘MultiTrace’ object has no attribute ‘_groups_all’

shakasaki · July 21, 2022, 9:20pm

Thanks, that’s great. I’ll try to implement this. Do you know if this approach increases the computation time, since it saves the results on the go?

michaelosthege · July 22, 2022, 4:03pm

In mcbackend PR #20 I added speed tests.

Here are the results for the ClickHouseBackend with a server running locally in Docker:

title	bytes_per_draw	append_speed	description
big_variables	44400	10.9 MiB/s (257.3 draws/s)	One chain with 3 variables of shapes (100,), (1000,) and (100, 100).
many_draws	56	1.1 MiB/s (21288.2 draws/s)	One chain of (), (3,) and (5,2) float32 variables.
many_variables	3600	3.8 MiB/s (1116.0 draws/s)	One chain with 300 variables of shapes (), (3,) and (5,2).

shakasaki · July 27, 2022, 9:14pm

I tried mcbackend and could get it to work, but I still have some questions. I adapted the instructions from the git project and used this inference example to create a test case. I use the clickhouse backend and would like to access the inference results while my model is running (a proper inference takes several days, almost a week for me).

Is this only possible to do with a clickhouse backend, or can I also do this with a NumPy backend? The problem is that I am working on a server with no sudo access so I can not initiate a local clickhouse server (with the command on the git project, which requires sudo).

Thanks for the help!

Here’s the example that works:

import clickhouse_driver
import mcbackend
import pymc as pm
import arviz as az
import numpy as np

Initialize random number generator

RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use(“arviz-darkgrid”)

True parameter values

alpha, sigma = 1, 1
beta = [1, 2.5]

Size of dataset

size = 100

Predictor variable

X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

Simulate outcome variable

Y = alpha + beta[0] * X1 + beta[1] * X2 + np.random.randn(size) * sigma

1. Create any kind of backend

ch_client = clickhouse_driver.Client(“localhost”)
backend = mcbackend.ClickHouseBackend(ch_client)

with pm.Model():
# Priors for unknown model parameters
alpha = pm.Normal(“alpha”, mu=0, sigma=10)
beta = pm.Normal(“beta”, mu=0, sigma=10, shape=2)
sigma = pm.HalfNormal(“sigma”, sigma=1)
# Expected value of outcome
mu = alpha + beta[0] * X1 + beta[1] * X2

# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)
# 4. Wrap the PyMC adapter around an `mcbackend.Backend`
#    This generates and prints a short `trace.run_id` by which
#    this MCMC run is identified in the (database) backend.
trace = mcbackend.pymc.TraceBackend(backend)

pm.sample(trace=trace)

michaelosthege · August 6, 2022, 4:50pm

@shakasaki you’ll need the ClickHouse backend, or you could implement another kind of backend that saves to a file periodically. The NumPy backend is only for in-memory situations.

If you can’t run a ClickHouse database on the server where the MCMC is running, you could still launch it on another machine, as long as your MCMC run can reach the port.
In any case I would recommend to run ClickHouse in a Docker container and take the usual steps for running a database in Docker: persistent volume so don’t loose the data, password authentication…

shakasaki · August 8, 2022, 11:12am

Great, thank you for clarifying. I think the two solutions suggested here are both relevant. I can not check two posts as solutions, so I summarize here so that readers can follow these two solutions.

Solution of using iter_sample() and cloudpickle to save intermediate runs, proposed by @cluhmann
Using the mcbackend and a Clickhouse database, proposed by @michaelosthege

Thank both for the feedback!

Topic		Replies	Views
Saving intermediate traces Questions	3	1338	July 19, 2022
Saving model and inference results (traces?) in pyMC4 v5	5	2358	July 18, 2022
Resuming sampling from a previous trace v5	12	1617	September 1, 2023
Stopping and restarting sample_smc Questions sampling	6	998	January 27, 2022
Bayesian Data Production Operations version agnostic modeling	2	368	October 30, 2022

Saving intermediate results using MCMC in pyMC4

Initialize random number generator

True parameter values

Size of dataset

Predictor variable

Simulate outcome variable

1. Create any kind of backend

Related topics