Saving intermediate results using MCMC in pyMC4

Hello,

There is a similar question asked here but it remained unanswered and is also relevant to Metropolis sampling.

I am sampling a custom likelihood function that takes about 20 seconds to run, so a full inference takes days (no way to make the likelihood faster since I run a physics based forward solver that is mesh dependent). I would like to save intermediate results, say ever 300 iterations or so. Right now I am using Sequential Monte Carlo, but I was actually wondering if there is a general way to do this for all MCMC samplers.

I would imagine it looking like this, based on the tagged example above:

runs = 10 # how many runs to do and save each intermediate run
for i1 in range(runs):
    with pm.Model() as model:
        X = {initialize variables}
        log_likelihood = { call custom likelihood}

        if i1 == 0:
            # first iteration
            trace = pm.sample_smc(draws=300, ....) # add arguments here to initialize variables 
        else:
            trace = pm.sample_smc(draws=300, ...) # give the previous trace as a starting point
        # save the iteration
        with open('iteration' + str(i1) + '.pkl', mode='wb') as file:
            cloudpickle.dump(trace, file)

Is there a way to generalize this for all MCMC samplers? Thank you for the feedback!

1 Like

I would check out the documentation for iter_sample() and see if that gets you what you need. I’m not sure how it interacts with SMC, but I think it’s the recommended way of doing things “during” sampling. Someone else can chime in if there are other approaches.

Thanks, Thats a great suggestion. I think I got it to work, i’m just not sure if I am properly supplying the starting point for the iter_sample() method. The documentation on this is not great.

start dict
Starting point in parameter space (or partial point).

I have the following code, that is actually saving every 10th iteration. It’s not the actual likelihood I want to use, but just a self-sufficient example taken from tutorials and previous examples.

import aesara.tensor as at
import numpy as np
import pymc as pm
import cloudpickle
from pymc import iter_sample

n = 4

mu1 = np.ones(n) * (1.0 / 2)
mu2 = -mu1

stdev = 0.1
sigma = np.power(stdev, 2) * np.eye(n)
isigma = np.linalg.inv(sigma)
dsigma = np.linalg.det(sigma)

w1 = 0.1  # one mode with 0.1 of the mass
w2 = 1 - w1  # the other mode with 0.9 of the mass


def two_gaussians(x):
    log_like1 = (
            -0.5 * n * at.log(2 * np.pi)
            - 0.5 * at.log(dsigma)
            - 0.5 * (x - mu1).T.dot(isigma).dot(x - mu1)
    )
    log_like2 = (
            -0.5 * n * at.log(2 * np.pi)
            - 0.5 * at.log(dsigma)
            - 0.5 * (x - mu2).T.dot(isigma).dot(x - mu2)
    )
    return pm.math.logsumexp([at.log(w1) + log_like1, at.log(w2) + log_like2])

with pm.Model() as model:
    X = pm.Uniform(
        "X",
        shape=n,
        lower=-2.0 * np.ones_like(mu1),
        upper=2.0 * np.ones_like(mu1),
        initval=-1.0 * np.ones_like(mu1),
    )
    llk = pm.Potential("llk", two_gaussians(X))
    step = pm.Metropolis()
    iter_counter = 0
    for trace in iter_sample(draws=100,
                             start=model.initial_point(),
                             step=pm.Metropolis(),
                             model=model):
        iter_counter += 1
        if iter_counter % 10 == 0:
            print('Saving trace for iteration ' + str(iter_counter))
            with open('iteration' + str(iter_counter) + '.pkl', mode='wb') as file:
                cloudpickle.dump(trace, file)
1 Like

Check out mcbackend, it was designed to solve exactly this problem.
The README explains how you can use McBackend to stream MCMC draws to a ClickHouse database and pull them in real-time.

Note that the idea of restarting/continuing an MCMC run sounds much much easier then it is. To do it properly one needs to capture not only the internal state of samplers, but also all random numbers generators…

2 Likes

Just a comment here: If you try to save with cloudpickle it works. If you try to save with arviz, it doesn’t. If I replace the cloudpickle save:

with open('iteration' + str(iter_counter) + '.pkl', mode='wb') as file:
                cloudpickle.dump(trace, file)

with

arviz.InferenceData.to_netcdf(trace, filename='trace_' + str(iter_counter) + '.netcdf')

Then I get the following error:

AttributeError: ‘MultiTrace’ object has no attribute ‘_groups_all’

Thanks, that’s great. I’ll try to implement this. Do you know if this approach increases the computation time, since it saves the results on the go?

In mcbackend PR #20 I added speed tests.

Here are the results for the ClickHouseBackend with a server running locally in Docker:

title bytes_per_draw append_speed description
big_variables 44400 10.9 MiB/s (257.3 draws/s) One chain with 3 variables of shapes (100,), (1000,) and (100, 100).
many_draws 56 1.1 MiB/s (21288.2 draws/s) One chain of (), (3,) and (5,2) float32 variables.
many_variables 3600 3.8 MiB/s (1116.0 draws/s) One chain with 300 variables of shapes (), (3,) and (5,2).

I tried mcbackend and could get it to work, but I still have some questions. I adapted the instructions from the git project and used this inference example to create a test case. I use the clickhouse backend and would like to access the inference results while my model is running (a proper inference takes several days, almost a week for me).

Is this only possible to do with a clickhouse backend, or can I also do this with a NumPy backend? The problem is that I am working on a server with no sudo access so I can not initiate a local clickhouse server (with the command on the git project, which requires sudo).

Thanks for the help!

Here’s the example that works:

import clickhouse_driver
import mcbackend
import pymc as pm
import arviz as az
import numpy as np

Initialize random number generator

RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use(“arviz-darkgrid”)

True parameter values

alpha, sigma = 1, 1
beta = [1, 2.5]

Size of dataset

size = 100

Predictor variable

X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

Simulate outcome variable

Y = alpha + beta[0] * X1 + beta[1] * X2 + np.random.randn(size) * sigma

1. Create any kind of backend

ch_client = clickhouse_driver.Client(“localhost”)
backend = mcbackend.ClickHouseBackend(ch_client)

with pm.Model():
# Priors for unknown model parameters
alpha = pm.Normal(“alpha”, mu=0, sigma=10)
beta = pm.Normal(“beta”, mu=0, sigma=10, shape=2)
sigma = pm.HalfNormal(“sigma”, sigma=1)

# Expected value of outcome
mu = alpha + beta[0] * X1 + beta[1] * X2

# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)
# 4. Wrap the PyMC adapter around an `mcbackend.Backend`
#    This generates and prints a short `trace.run_id` by which
#    this MCMC run is identified in the (database) backend.
trace = mcbackend.pymc.TraceBackend(backend)

pm.sample(trace=trace)

@shakasaki you’ll need the ClickHouse backend, or you could implement another kind of backend that saves to a file periodically. The NumPy backend is only for in-memory situations.

If you can’t run a ClickHouse database on the server where the MCMC is running, you could still launch it on another machine, as long as your MCMC run can reach the port.
In any case I would recommend to run ClickHouse in a Docker container and take the usual steps for running a database in Docker: persistent volume so don’t loose the data, password authentication…

Great, thank you for clarifying. I think the two solutions suggested here are both relevant. I can not check two posts as solutions, so I summarize here so that readers can follow these two solutions.

  1. Solution of using iter_sample() and cloudpickle to save intermediate runs, proposed by @cluhmann

  2. Using the mcbackend and a Clickhouse database, proposed by @michaelosthege

Thank both for the feedback!

2 Likes