Using mcbackend to store samples

mamacneil · December 23, 2023, 1:54pm

Hi folks -

I’m running a big model that is crashing memory mid-sampling (killled 9 errors) and I am attempting to move to use mcbackend (thanks @michaelosthege for it). However, in getting things working on a minimal model on my machine, the to_inferencedata function is coming up empty, even where get_run has values. In short I could use help getting an example going so I can try it on my big model. My example (basic analytics model from @fonnesbeck):

import arviz as az
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt
import clickhouse_driver
import mcbackend

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

RANDOM_SEED = 42

print(f"Running on PyMC v{pm.__version__}")

baseball_data = pd.read_csv('https://raw.githubusercontent.com/fonnesbeck/hierarchical_models_sports_analytics/main/data/stats_by_player_team.csv')

baseball_data['label'] = baseball_data.batter_name + ' (' + baseball_data.name_abbrev + ') ' + baseball_data.season.astype(str)
baseball_data = baseball_data.rename(columns={'name_abbrev': 'team'})
fitting_subset = baseball_data[baseball_data.season<2023].dropna()

pa, hr = fitting_subset[['pa', 'hr']].astype(int).values.T
coords = {'batter':fitting_subset.label.values}

with pm.Model(coords=coords) as uninformative_prior_model:
    
    p = pm.Uniform('p', 0, 1, dims='batter')
    
    y = pm.Binomial('y', n=pa, p=p, observed=hr, dims='batter')

# Create clickhouse backend
ch_client = clickhouse_driver.Client("localhost")
ch_backend = mcbackend.ClickHouseBackend(ch_client)

with uninformative_prior_model:
    pm.sample(draws=100, tune=100, cores=4, chains=4, random_seed=RANDOM_SEED, trace=ch_backend)

This all goes fine in the sense that I can retrieve the run (note there is no trace to access the rid in @michaelosthege’s code example: GitHub - pymc-devs/mcbackend: A backend for storing MCMC draws., so I have used get_runs):

# Fetch the most recent run from the database 
model_run = ch_backend.get_run(ch_backend.get_runs().index[-1])

ch_trace = model_run.to_inferencedata()

and with model_run.get_chains()[0].get_draws('p') I get:

array([[0.31278036, 0.52744033, 0.27108869, ..., 0.48431582, 0.45812994,
        0.29023263],
       [0.31278036, 0.52744033, 0.27108869, ..., 0.48431582, 0.45812994,
        0.29023263],
       [0.08720657, 0.31698537, 0.12071515, ..., 0.14424979, 0.34796451,
        0.13165968],
       ...,
       [0.0802733 , 0.03178383, 0.06493201, ..., 0.02980206, 0.03055315,
        0.01064518],
       [0.07095099, 0.06097259, 0.05025255, ..., 0.03906811, 0.02481613,
        0.01069069],
       [0.08256711, 0.06653404, 0.08217616, ..., 0.01457395, 0.0243532 ,
        0.02938725]])`

while with ch_trace.posterior.p i get:

xarray.DataArray
'p'

    chain: 4draw: 0batter: 1494

    array([], shape=(4, 0, 1494), dtype=float64)

    Coordinates:
        chain
        (chain)
        int64
        0 1 2 3
        draw
        (draw)
        int64
        batter
        (batter)
        <U33
        'Pujols, Albert (STL) 2022' ... ...
    Indexes: (3)
    Attributes: (0)

which as zero draws and an empty xarray. Any thoughts on what’s going on here? Thanks much,

AlexAndorra · January 16, 2024, 9:32pm

Mmmh, that’s weird indeed
Maybe @michaelosthege will have some ideas, as I seem to remember he worked on this. Otherwise, maybe @OriolAbril ?

michaelosthege · January 17, 2024, 11:35pm

Thanks for pinging me!
This was indeed a bug, caused by a refactor (by me) of how the "tune" stat is emitted from PyMC runs.

I fixed it and made a patch release: Release v0.5.2 · pymc-devs/mcbackend · GitHub

mamacneil · January 19, 2024, 6:09pm

This works on a small example (thanks again) - however on the full large example it brings up another memory related issue - I’ll set up a new question for that.

Baba_Yara_Fahiz · July 22, 2024, 5:47pm

I ran into the same problem today.
It seems the sample function still runs the convergence test even when you specify that it should not.

Topic		Replies	Views
Mcbackend memory fail	5	355	February 11, 2024
Using mcbackend to store samples from Blackjax sampler v5	2	47	December 23, 2024
Resuming sampling from a previous trace v5	12	1595	September 1, 2023
Save posterior samples to backend rather than holding in RAM? v5	3	303	October 9, 2023
Saving intermediate results using MCMC in pyMC4 v5	9	1553	August 8, 2022

Using mcbackend to store samples

Related topics