Using mcbackend to store samples

Hi folks -

I’m running a big model that is crashing memory mid-sampling (killled 9 errors) and I am attempting to move to use mcbackend (thanks @michaelosthege for it). However, in getting things working on a minimal model on my machine, the to_inferencedata function is coming up empty, even where get_run has values. In short I could use help getting an example going so I can try it on my big model. My example (basic analytics model from @fonnesbeck):

import arviz as az
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt
import clickhouse_driver
import mcbackend

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

RANDOM_SEED = 42

print(f"Running on PyMC v{pm.__version__}")

baseball_data = pd.read_csv('https://raw.githubusercontent.com/fonnesbeck/hierarchical_models_sports_analytics/main/data/stats_by_player_team.csv')

baseball_data['label'] = baseball_data.batter_name + ' (' + baseball_data.name_abbrev + ') ' + baseball_data.season.astype(str)
baseball_data = baseball_data.rename(columns={'name_abbrev': 'team'})
fitting_subset = baseball_data[baseball_data.season<2023].dropna()

pa, hr = fitting_subset[['pa', 'hr']].astype(int).values.T
coords = {'batter':fitting_subset.label.values}

with pm.Model(coords=coords) as uninformative_prior_model:
    
    p = pm.Uniform('p', 0, 1, dims='batter')
    
    y = pm.Binomial('y', n=pa, p=p, observed=hr, dims='batter')

# Create clickhouse backend
ch_client = clickhouse_driver.Client("localhost")
ch_backend = mcbackend.ClickHouseBackend(ch_client)

with uninformative_prior_model:
    pm.sample(draws=100, tune=100, cores=4, chains=4, random_seed=RANDOM_SEED, trace=ch_backend)

This all goes fine in the sense that I can retrieve the run (note there is no trace to access the rid in @michaelosthege’s code example: GitHub - pymc-devs/mcbackend: A backend for storing MCMC draws., so I have used get_runs):

# Fetch the most recent run from the database 
model_run = ch_backend.get_run(ch_backend.get_runs().index[-1])

ch_trace = model_run.to_inferencedata()

and with model_run.get_chains()[0].get_draws('p') I get:

array([[0.31278036, 0.52744033, 0.27108869, ..., 0.48431582, 0.45812994,
        0.29023263],
       [0.31278036, 0.52744033, 0.27108869, ..., 0.48431582, 0.45812994,
        0.29023263],
       [0.08720657, 0.31698537, 0.12071515, ..., 0.14424979, 0.34796451,
        0.13165968],
       ...,
       [0.0802733 , 0.03178383, 0.06493201, ..., 0.02980206, 0.03055315,
        0.01064518],
       [0.07095099, 0.06097259, 0.05025255, ..., 0.03906811, 0.02481613,
        0.01069069],
       [0.08256711, 0.06653404, 0.08217616, ..., 0.01457395, 0.0243532 ,
        0.02938725]])`

while with ch_trace.posterior.p i get:

xarray.DataArray
'p'

    chain: 4draw: 0batter: 1494

    array([], shape=(4, 0, 1494), dtype=float64)

    Coordinates:
        chain
        (chain)
        int64
        0 1 2 3
        draw
        (draw)
        int64
        batter
        (batter)
        <U33
        'Pujols, Albert (STL) 2022' ... ...
    Indexes: (3)
    Attributes: (0)

which as zero draws and an empty xarray. Any thoughts on what’s going on here? Thanks much,

1 Like

Mmmh, that’s weird indeed :thinking:
Maybe @michaelosthege will have some ideas, as I seem to remember he worked on this. Otherwise, maybe @OriolAbril ?

Thanks for pinging me!
This was indeed a bug, caused by a refactor (by me) of how the "tune" stat is emitted from PyMC runs.

I fixed it and made a patch release: Release v0.5.2 · pymc-devs/mcbackend · GitHub

2 Likes

This works on a small example (thanks again) - however on the full large example it brings up another memory related issue - I’ll set up a new question for that.