Shape mismatch error for out of sample inference

Hello, I know there’s a few questions on this topic and I (believe!) I have tried all the suggested solutions but none work for me.

I’m running a simple Bayesian linear model, and want to perform inference on a 2d (data frame) that has a different number of rows to my training data. This doesn’t work - if they have the same shape it’s all fine, but as soon as I change the number of rows, well …

In my opinion I have tried everything, coords, mutable coords, adding the observed values for the sampling as a Data argument and re-setting those during inference (even though I don’t know these values of course!) - but nothing works. Any help or pointers massively appreciated, I feel this shouldn’t be so difficult?!

Minimal working example:

covariates = pd.DataFrame(np.random.randint(1,5,(38, 9)), columns=[f'feature_{i}' for i in range(9)])

coords = {'obs': covariates.index, 'features':covariates.columns}
coords_mutable = {'obs': np.arange(len(covariates))}

with pm.Model(coords=coords, coords_mutable=coords_mutable) as base_model:

    covars_data = pm.Data("covars_data", covariates, dims=['obs', 'features'])

    mu = pm.Normal("mu", 0, sigma=1)
    sigma = pm.HalfCauchy("sigma", beta=10)
    n = covariates.shape[0]
    n_cov = covariates.shape[1]
    
    intercept = pm.Normal("intercept", mu=mu, sigma=sigma, shape=n)

    beta = pm.Normal("beta", mu=0, sigma=1, shape=n_cov, dims="features")
    mean_val = intercept + covars_data @ beta

    data = pd.DataFrame(np.random.randint(1,5,(38, 5)), columns=['a','b','c','d','e'])

    endpt1_data = pm.Data(
        "endpt1_data", data.a
    )
    se1_data = pm.Data("se1_data", data.b)

    sigma_data = pm.Data("sigma_data", data.c, dims='obs')
    obs2_data = pm.Data('obs2_data', data.e, dims='obs')

    # Meta-analytic likelihood of endpoint 2
    est2_dist = pm.Normal(
        "endpt2",
        mu=mean_val,
        sigma=sigma_data,
        observed=obs2_data,
        dims='obs',
    )

    idata = pm.sample(100)


### prediction
covariates = pd.DataFrame(np.random.randint(1,5,(1, 9)), columns=[f'feature_{i}' for i in range(9)])

with base_model:
            
    pm.set_data({"covars_data": covariates}, coords={'obs': np.arange(len(covariates))})
    # pm.set_data({"obs2_data": [0]}, coords={'obs': np.arange(len(covariates))}) # this seems nonsense?? 
    # pm.set_data({"sigma_data": [1]}, coords={'obs': np.arange(len(covariates))}) 

    ppc = pm.sample_posterior_predictive(idata)

If I just delete est2_dist from the model training this runs, and even if I keep it and just delete observed= it still runs … so that seems to be the issue!? But then how do I solve this?

And error is:
ValueError: Shape mismatch: A.shape[0] != y.shape[0] Apply node that caused the error: CGemv{no_inplace}(intercept, 1.0, covars_data, beta, 1.0) Toposort index: 0 Inputs types: [TensorType(float64, shape=(38,)), TensorType(float64, shape=()), TensorType(float64, shape=(None, None)), TensorType(float64, shape=(9,)), TensorType(float64, shape=())] Inputs shapes: [(38,), (), (1, 9), (9,), ()] Inputs strides: [(8,), (), (72, 8), (8,), ()] Inputs values: ['not shown', array(1.), 'not shown', 'not shown', array(1.)]

Thank you!

Are you sure your intercept should have shape=n?

1 Like

wow … that may have fixed it - are you able to explain why? Agreed I only need a single intercept, but just for my pymc understanding, why would this have broken anything?

The reason it broke is that you defined it with a static shape, based on covariates, instead of a dynamic one based on covars_data, so PyMC wouldn’t resize it after you updated the MutableData.

But it wouldn’t make sense to have a RV size change between sample and posterior predictive, as it would be unclear how to reuse the psoterior inferences (pymc would end up sampling the posterior predictive of that RV as well, which in you case would be a prior draw from the distribution)

Sorry I have one further question. Weirdly, this now only works for me if I set obs2_data and sigma_data to some dummy values … I read that in some other post here, but that cannot be correct, right?! Cause how would it then differentiate between these and actual variables I have observed and need to update (such as covars_data here)? Appreciate the help!