Pymc5 out of sample

In the end it just comes down to preference I suppose. You can still iterate over the columns of a MutableData object using range. I’d split df into features and controls before modeling, make each its own data container then do the looping as follows (warning: untested code)

df_features  = df[feature_vars]
n_obs, n_features = df_features.shape
coords = {'features':feature_vars, 
          'all_vars':feature_vars}

if control_vars:
    df_controls  = df[control_vars]
    _, n_controls = df_controls.shape
    coords.update({'controls':control_vars,
                   'all_vars':feature_vars + control_vars})


with pm.Model(coords=coords) as basic_model:
    X = pm.MutableData('feature_data', df_features.values)
    y = pm.MutableData('targets', df.target.values)

    n_obs = X.shape[0]
    
    betas = pm.HalfNormal('beta', sigma = 2, dims=['features'])
    decays = pm.Beta('decay', alpha=3, beta=3, dims=['features'])

    contributions = pt.zeros((n_obs, n_features + n_controls))
    
    for i in range(n_features):
       x = geometric_adstock_tt(X[:, i], decays[i]))*beta[i]
       contributions = pt.set_subtensor(contributions[:, i], x)
   
    if n_controls > 0:
        Z = pm.MutableData('control_data', df_controls.values)
        control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
        pt.set_subtensor(contributions[:, n_features:],  Z @ control_betas)
    
    pm.Deterministic("contributions", contributions)
     
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat=  pm.Normal("y_hat", mu=contributions.sum(axis=-1), sigma=sigma, observed=y, shape=X.shape[0])
     idata = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'all_vars']}})

It’s basically the same as yours but I use the data containers. I also wanted to use coords to tag the different dimensions instead of f-strings – this is helpful when plotting with arviz, because it makes it easy to ask for a subset of coordinates. One wrinkle here was that I don’t know a way to only label a subset of the dimensions – for contributions, we want to label only the 2nd dimension (since the 1st dimension is a batch dim). As a result, I had to include the idata_kwargs argument in pm.sample to get the right coords on contributions.

Again, this is all just a matter of preference. Another way to do it would be to just include pm.MutableData in your loops, and use an f-string to name each mutable data object. The downside of this approach is that you will need to include each variable individually when you call pm.set_data, though admittedly you could write a function to ease that pain point.

I also didn’t look very carefully at the geometric_adstock_tt function, but I’m sure you could vectorize it with a bit of work. That would be my personal approach, but yet again, I emphasize that these are mostly style choices.

Also, have you checked out pymc-marketing? It’s a new project that is focusing on mixed-media models. It might be useful in your case.

1 Like