Hello,
There is a similar question asked here but it remained unanswered and is also relevant to Metropolis sampling.
I am sampling a custom likelihood function that takes about 20 seconds to run, so a full inference takes days (no way to make the likelihood faster since I run a physics based forward solver that is mesh dependent). I would like to save intermediate results, say ever 300 iterations or so. Right now I am using Sequential Monte Carlo, but I was actually wondering if there is a general way to do this for all MCMC samplers.
I would imagine it looking like this, based on the tagged example above:
runs = 10 # how many runs to do and save each intermediate run
for i1 in range(runs):
with pm.Model() as model:
X = {initialize variables}
log_likelihood = { call custom likelihood}
if i1 == 0:
# first iteration
trace = pm.sample_smc(draws=300, ....) # add arguments here to initialize variables
else:
trace = pm.sample_smc(draws=300, ...) # give the previous trace as a starting point
# save the iteration
with open('iteration' + str(i1) + '.pkl', mode='wb') as file:
cloudpickle.dump(trace, file)
Is there a way to generalize this for all MCMC samplers? Thank you for the feedback!
1 Like
I would check out the documentation for iter_sample()
and see if that gets you what you need. Iām not sure how it interacts with SMC, but I think itās the recommended way of doing things āduringā sampling. Someone else can chime in if there are other approaches.
Thanks, Thats a great suggestion. I think I got it to work, iām just not sure if I am properly supplying the starting point for the iter_sample() method. The documentation on this is not great.
start dict
Starting point in parameter space (or partial point).
I have the following code, that is actually saving every 10th iteration. Itās not the actual likelihood I want to use, but just a self-sufficient example taken from tutorials and previous examples.
import aesara.tensor as at
import numpy as np
import pymc as pm
import cloudpickle
from pymc import iter_sample
n = 4
mu1 = np.ones(n) * (1.0 / 2)
mu2 = -mu1
stdev = 0.1
sigma = np.power(stdev, 2) * np.eye(n)
isigma = np.linalg.inv(sigma)
dsigma = np.linalg.det(sigma)
w1 = 0.1 # one mode with 0.1 of the mass
w2 = 1 - w1 # the other mode with 0.9 of the mass
def two_gaussians(x):
log_like1 = (
-0.5 * n * at.log(2 * np.pi)
- 0.5 * at.log(dsigma)
- 0.5 * (x - mu1).T.dot(isigma).dot(x - mu1)
)
log_like2 = (
-0.5 * n * at.log(2 * np.pi)
- 0.5 * at.log(dsigma)
- 0.5 * (x - mu2).T.dot(isigma).dot(x - mu2)
)
return pm.math.logsumexp([at.log(w1) + log_like1, at.log(w2) + log_like2])
with pm.Model() as model:
X = pm.Uniform(
"X",
shape=n,
lower=-2.0 * np.ones_like(mu1),
upper=2.0 * np.ones_like(mu1),
initval=-1.0 * np.ones_like(mu1),
)
llk = pm.Potential("llk", two_gaussians(X))
step = pm.Metropolis()
iter_counter = 0
for trace in iter_sample(draws=100,
start=model.initial_point(),
step=pm.Metropolis(),
model=model):
iter_counter += 1
if iter_counter % 10 == 0:
print('Saving trace for iteration ' + str(iter_counter))
with open('iteration' + str(iter_counter) + '.pkl', mode='wb') as file:
cloudpickle.dump(trace, file)
1 Like
Check out mcbackend
, it was designed to solve exactly this problem.
The README explains how you can use McBackend to stream MCMC draws to a ClickHouse database and pull them in real-time.
Note that the idea of restarting/continuing an MCMC run sounds much much easier then it is. To do it properly one needs to capture not only the internal state of samplers, but also all random numbers generatorsā¦
2 Likes
Just a comment here: If you try to save with cloudpickle it works. If you try to save with arviz, it doesnāt. If I replace the cloudpickle save:
with open('iteration' + str(iter_counter) + '.pkl', mode='wb') as file:
cloudpickle.dump(trace, file)
with
arviz.InferenceData.to_netcdf(trace, filename='trace_' + str(iter_counter) + '.netcdf')
Then I get the following error:
AttributeError: āMultiTraceā object has no attribute ā_groups_allā
1 Like
Thanks, thatās great. Iāll try to implement this. Do you know if this approach increases the computation time, since it saves the results on the go?
In mcbackend
PR #20 I added speed tests.
Here are the results for the ClickHouseBackend
with a server running locally in Docker:
title |
bytes_per_draw |
append_speed |
description |
big_variables |
44400 |
10.9 MiB/s (257.3 draws/s) |
One chain with 3 variables of shapes (100,), (1000,) and (100, 100). |
many_draws |
56 |
1.1 MiB/s (21288.2 draws/s) |
One chain of (), (3,) and (5,2) float32 variables. |
many_variables |
3600 |
3.8 MiB/s (1116.0 draws/s) |
One chain with 300 variables of shapes (), (3,) and (5,2). |
I tried mcbackend and could get it to work, but I still have some questions. I adapted the instructions from the git project and used this inference example to create a test case. I use the clickhouse backend and would like to access the inference results while my model is running (a proper inference takes several days, almost a week for me).
Is this only possible to do with a clickhouse backend, or can I also do this with a NumPy backend? The problem is that I am working on a server with no sudo access so I can not initiate a local clickhouse server (with the command on the git project, which requires sudo).
Thanks for the help!
Hereās the example that works:
import clickhouse_driver
import mcbackend
import pymc as pm
import arviz as az
import numpy as np
Initialize random number generator
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use(āarviz-darkgridā)
True parameter values
alpha, sigma = 1, 1
beta = [1, 2.5]
Size of dataset
size = 100
Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2
Simulate outcome variable
Y = alpha + beta[0] * X1 + beta[1] * X2 + np.random.randn(size) * sigma
1. Create any kind of backend
ch_client = clickhouse_driver.Client(ālocalhostā)
backend = mcbackend.ClickHouseBackend(ch_client)
with pm.Model():
# Priors for unknown model parameters
alpha = pm.Normal(āalphaā, mu=0, sigma=10)
beta = pm.Normal(ābetaā, mu=0, sigma=10, shape=2)
sigma = pm.HalfNormal(āsigmaā, sigma=1)
# Expected value of outcome
mu = alpha + beta[0] * X1 + beta[1] * X2
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)
# 4. Wrap the PyMC adapter around an `mcbackend.Backend`
# This generates and prints a short `trace.run_id` by which
# this MCMC run is identified in the (database) backend.
trace = mcbackend.pymc.TraceBackend(backend)
pm.sample(trace=trace)
@shakasaki youāll need the ClickHouse backend, or you could implement another kind of backend that saves to a file periodically. The NumPy backend is only for in-memory situations.
If you canāt run a ClickHouse database on the server where the MCMC is running, you could still launch it on another machine, as long as your MCMC run can reach the port.
In any case I would recommend to run ClickHouse in a Docker container and take the usual steps for running a database in Docker: persistent volume so donāt loose the data, password authenticationā¦
Great, thank you for clarifying. I think the two solutions suggested here are both relevant. I can not check two posts as solutions, so I summarize here so that readers can follow these two solutions.
-
Solution of using iter_sample() and cloudpickle to save intermediate runs, proposed by @cluhmann
-
Using the mcbackend and a Clickhouse database, proposed by @michaelosthege
Thank both for the feedback!
2 Likes