How to find shape mismatch on out of sample (OOS) data?

Hello,

When I run oos data with my model, I’m getting a data mismatch error.

Here is the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/aesara/compile/function/types.py in __call__(self, *args, **kwargs)
    975                 self.vm()
--> 976                 if output_subset is None
    977                 else self.vm(output_subset=output_subset)

ValueError: Input dimension mismatch. One other input has shape[0] = 7690, but input[2].shape[0] = 27085.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_77223/2168918377.py in <module>
     60                     'month': test_month_idx})
     61         print("sampling test ppc...")
---> 62         test_ppc = pm.sample_posterior_predictive(idata)
     63 
     64         #Adding columns to the test dataset dataframe

/opt/conda/lib/python3.7/site-packages/pymc/sampling.py in sample_posterior_predictive(trace, samples, model, var_names, keep_size, random_seed, progressbar, return_inferencedata, extend_inferencedata, predictions, idata_kwargs, compile_kwargs)
   1955                 param = _trace[idx % len_trace]
   1956 
-> 1957             values = sampler_fn(**param)
   1958 
   1959             for k, v in zip(vars_, values):

/opt/conda/lib/python3.7/site-packages/pymc/util.py in wrapped(**kwargs)
    364     def wrapped(**kwargs):
    365         input_point = {k: v for k, v in kwargs.items() if k in ins}
--> 366         return core_function(**input_point)
    367 
    368     return wrapped

/opt/conda/lib/python3.7/site-packages/aesara/compile/function/types.py in __call__(self, *args, **kwargs)
    990                     node=self.vm.nodes[self.vm.position_of_error],
    991                     thunk=thunk,
--> 992                     storage_map=getattr(self.vm, "storage_map", None),
    993                 )
    994             else:

/opt/conda/lib/python3.7/site-packages/aesara/link/utils.py in raise_with_op(fgraph, node, thunk, exc_info, storage_map)
    532         # Some exception need extra parameter in inputs. So forget the
    533         # extra long error message in that case.
--> 534     raise exc_value.with_traceback(exc_trace)
    535 
    536 

/opt/conda/lib/python3.7/site-packages/aesara/compile/function/types.py in __call__(self, *args, **kwargs)
    974             outputs = (
    975                 self.vm()
--> 976                 if output_subset is None
    977                 else self.vm(output_subset=output_subset)
    978             )

ValueError: Input dimension mismatch. One other input has shape[0] = 7690, but input[2].shape[0] = 27085.
Apply node that caused the error: Elemwise{Composite{(i0 + i1 + (i2 * i3) + (i4 * i5) + (i6 * i7) + (i8 * i9) + (i10 * i11) + (i12 * i13) + (i14 * i15))}}[(0, 0)](AdvancedSubtensor.0, AdvancedSubtensor.0, AdvancedSubtensor1.0, promotion, AdvancedSubtensor1.0, cannibalization, AdvancedSubtensor1.0, dc_discount, AdvancedSubtensor1.0, free_fin, AdvancedSubtensor1.0, pvbv, AdvancedSubtensor1.0, giftset, InplaceDimShuffle{x}.0, month)
Toposort index: 9
Inputs types: [TensorType(float64, (None,)), TensorType(float64, (None,)), TensorType(float64, (None,)), TensorType(int32, (None,)), TensorType(float64, (None,)), TensorType(int32, (None,)), TensorType(float64, (None,)), TensorType(int32, (None,)), TensorType(float64, (None,)), TensorType(int32, (None,)), TensorType(float64, (None,)), TensorType(int32, (None,)), TensorType(float64, (None,)), TensorType(int32, (None,)), TensorType(float64, (1,)), TensorType(int32, (None,))]
Inputs shapes: [(7690,), (7690,), (27085,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (1,), (7690,)]
Inputs strides: [(8,), (8,), (8,), (4,), (8,), (4,), (8,), (4,), (8,), (4,), (8,), (4,), (8,), (4,), (8,), (4,)]
Inputs values: ['not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', array([-0.55082063]), 'not shown']
Outputs clients: [[normal_rv{0, (0, 0), floatX, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F63207A97D0>), TensorConstant{[]}, TensorConstant{11}, Elemwise{Composite{(i0 + i1 + (i2 * i3) + (i4 * i5) + (i6 * i7) + (i8 * i9) + (i10 * i11) + (i12 * i13) + (i14 * i15))}}[(0, 0)].0, sigma)]]

HINT: Re-running with most Aesara optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the Aesara flag 'optimizer=fast_compile'. If that does not work, Aesara optimizations can be disabled with 'optimizer=None'.
HINT: Use the Aesara flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

However, when I print the shapes of all the new OOS data, I get the following:

print(test_promo_pvbv_idx.shape,
     test_giftset_idx.shape,
     test_free_fin_idx.shape,
     test_dc_idx.shape,
     test_cann_idx.shape,
     test_promo_idx.shape,
     test_location_idx.shape,
     test_item_idx.shape,
     test_month_idx.shape,
     test_time_idx.shape,
     df_test['residual'].shape)
Output:
(7690,) (7690,) (7690,) (7690,) (7690,) (7690,) (7690,) (7690,) (7690,) (7690,) (7690,)

So I’m not sure where the error is getting the shape 27085.

Code used:

    test_time_idx, test_times = pd.factorize(df_test.index.get_level_values(0))
    test_month_idx, test_month = pd.factorize(df_test['month'])
    test_item_idx =  np.array(list(map(item_to_idx_dict.get, df_test.index.get_level_values(1))))
    test_location_idx, test_locations = pd.factorize(df_test.index.get_level_values(2))
    test_promo_idx, test_promo = pd.factorize(df_test['promo_status_metric_measure'])
    test_cann_idx, test_cannibalization = pd.factorize(df_test['cannibalized'])
    test_dc_idx, test_dc_discount = pd.factorize(df_test['promo_desc_dcdiscount'])
    test_free_fin_idx, test_free_fin = pd.factorize(df_test['promo_desc_freefinancing'])
    test_giftset_idx, test_giftset = pd.factorize(df_test['promo_desc_giftset'])
    test_promo_pvbv_idx, test_promo_pvbv = pd.factorize(df_test['promo_desc_pvbv'])
    #bring in new data
    with constant_model:

        
        pm.set_data({'loc_idx': test_location_idx,
                    'item_idx': test_item_idx,
                    'time_idx': test_time_idx,
                    'observed_eaches': df_test['residual'],
                    't': t_test,
                    'promotion': test_promo_idx,
                    'cannibalization': test_cann_idx,
                    'dc_discount':test_dc_idx,
                    'free_fin': test_free_fin_idx,
                    'pvbv': test_promo_pvbv_idx,
                    'giftset': test_giftset_idx,
                    'month': test_month_idx})
        print("sampling test ppc...")
        test_ppc = pm.sample_posterior_predictive(idata)

Does anyone have an idea of how I can hunt this error down?

1 Like

What is the shape of your “in sample” data? Is it 27085 by any chance? If so, it might suggest that you are mixing some “in sample” elements with some “out of sample” elements.

1 Like

You are on it today! Thank you. I was mixing on one line of code.

1 Like

If I seem wise, it’s only because I recognize the many, many mistakes I have made before. :grimacing:

I thought that was it but I still don’t see where the training data is getting through. Question:

In this part of the error trace:
Inputs shapes: [(7690,), (7690,), (27085,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (7690,), (1,), (7690,)],
it shows me the third shape is what is throwing the OOS data sampling off. Is there a way to figure out what object is driving that shape?

There are 16 objects there, but i’m only have 10 objects that are set to mutable = True. How can I figure out what is driving that shape?

Making some data constant isn’t going to help with the shape problems. For example if I do this it will throw the same kind of error.

d1 = rng.random(size=100)
d2 = rng.random(size=100)

with pm.Model() as model:
    x = pm.MutableData("x", value=d1)
    y = pm.ConstantData("obs", value=d2)
    a = pm.Normal("a")
    b = pm.Normal("b", mu=a * x, sigma=1, observed = y)
    idata = pm.sample()

with model:
    pm.set_data({"x": rng.random(size=10)})
    # 10 x values, but still 100 y values
    test_ppc = pm.sample_posterior_predictive(idata)

There may be a way to inspect the shape of things more directly (someone else would have to chime in), but not necessarily. In the example above, the shapes of x and y are both “known” to the model (because we wrapped them in pm.Data objects. But we didn’t have to do that to get the same error:

d1 = rng.random(size=100)
d2 = rng.random(size=100)

with pm.Model() as model:
    x = pm.MutableData("x", value=d1)
    #y = pm.ConstantData("obs", value=d2)
    a = pm.Normal("a")
    b = pm.Normal("b", mu=a * x, sigma=1, observed = d2)
    idata = pm.sample()

with model:
    pm.set_data({"x": rng.random(size=10)})
    # 10 x values, but still 100 y values
    test_ppc = pm.sample_posterior_predictive(idata)

To check any registered variables (e.g., pm.Data, RVs like pm.Normal, etc.) you can inspect the plate notation generated by pm.model_to_graphviz(model). But if you have other model components that the model doesn’t know the shape of ahead of time (e.g., my second example above) it won’t help much.