Unable to predict using set_value with an errors-in-variables model

I have a model based on this answer that I am using to perform a linear regression between environmental observations. The model works fine, but does not allow me to predict on test data with a different length, due to both the reasons suggested here.

Removing the deterministic part seems fine, but not the explicit shape part. How could I reformulate this model to still capture the errors-in-variables and allow for prediction. Any help would be much appreciated.

Training data:

# True parameter values
alpha_true = 0
beta_true = 2

# Size of dataset
size = 100

# True data
x_true = np.array(np.linspace(-5,5,size))
y_true = alpha_true + beta_true * x_true

# Add noise to data
x = x_true + np.random.normal(loc=0, scale=1, size=size)
y = y_true + np.random.normal(loc=0, scale=1, size=size)

Model

x_in = shared(x)
y_in = shared(y)

with pm.Model() as model:
    
    err_odr = pm.HalfNormal('err_odr', 5.)
    err_param = pm.HalfNormal('err_param', 5.)
    a = pm.Normal('intercept', 0, err_param)
    b = pm.Normal('slope', 0, err_param)
    
    x_lat = pm.Normal('x_lat', 0, 5., shape=x.shape[0])
    x_obs = pm.Normal('x_obs', mu=x_lat, sd=err_odr, observed=x_in, shape=x.shape[0])

    y_lat = a + b * x_lat
    y_obs = pm.Normal('y_obs', mu=y_lat, sd=err_odr, observed=y_in)

    trace = pm.sample(2000, tune=2000, cores=1)

Hi William,
IIUC, you need to change the shape of x_lat and x_obs? In which case you should be able to use a theano shared variable, as you do for x_in and y_in. You can also use PyMC’s Data container.
Hope this helps :vulcan_salute:

Hi Alex, thanks for your reply!

I have tried both suggestions without success.

First I tried adding B_len = shared(len(x)) and changing shape=x.shape[0] to shape=int(B_len.get_value()) because shape requires a real number.

I also tried…

with pm.Model() as model:    
    x_in = pm.Data('x_in', x)
    y_in = pm.Data('y_in', y)
    B_len = pm.Data('B_len', len(x))
    ...

and

x_new = np.linspace(0,20,200)
with model:
    pm.set_data({'x_in': x_new})
    pm.set_data({'B_len': 200})
    ppc = pm.sample_posterior_predictive(trace, samples=1000)

It looks like the values do change in model['x_in'].get_value() and B_len but still sample_posterior_predictive gives predictions on the original data. Is this what you were suggesting?

1 Like

Hi William,
Yeah, that’s exactly what I was talking about, and I can replicate your issue – i.e pm.set_data doesn’t update the shape of x_lat:

with pm.Model() as model:
    a = pm.Normal('intercept', 0., 1.)
    b = pm.Normal('slope', 0., 1.)
    
    x_len = pm.Data('x_len', x.shape[0])
    x_in = pm.Data('x_in', x)
    x_lat = pm.Normal('x_lat', 0., 5., shape=x_len.get_value().astype(int))
    x_obs = pm.Normal('x_obs', mu=x_lat, sd=1., observed=x_in)

    y_lat = a + b * x_lat
    y_in = pm.Data('y_in', y)
    y_obs = pm.Normal('y_obs', mu=y_lat, sd=1., observed=y_in)

    trace = pm.sample(2000, tune=2000)

This samples correctly and gives:

model['x_len'].get_value(), model['x_in'].get_value().shape, model['x_lat'].tag.test_value.shape
(array(100.), (100,), (100,))

But updating the data with a new shape doesn’t trickle down to x_lat:

x_new = np.linspace(0, 2, 200)
with model:
    pm.set_data({'x_len': 200, 'x_in': x_new})
    ppc = pm.fast_sample_posterior_predictive(trace, var_names=["x_lat", "x_obs"])

model['x_len'].get_value(), model['x_in'].get_value().shape, ppc["x_lat"].shape
(array(200.), (200,), (8000, 100))

You can see the shape of x_lat not updating above.
@junpenglao @lucianopaz I think you know more than me about this: is it even possible to update the shape of an RV for posterior pred sampling? Should this issue be reported on our GitHub?

Hi All,

I am new to PyMC3 and found it to be wonderful software but I have also observed the same even though I have tried every possible pointers available of the web for predicting with X_Test, Y_Test of different lengths from X_Train, Y_Train. There are many situations in system identification (or at least in my experience) where the length of the Training and Test datasets is different. So,it will be nice if a solution to this issue can be found or at least a work around can be suggested .

Thanks,
Rishi

set_data does not always work when the input change shape, especially if the shape is specified or hard coded in the model. The best way in this case is to reinitialized the same model but conditioned on the test input, for example, a good practice is to have:

def generate_model(x, y):
    with pm.Model() as m:
        ...
    return m

and then call it as:

train_model = generate_model(train_x, train_y)
with train_model:
    trace = pm.sample(...)

for posterior predictive conditioned on the test set, you can then do:

test_model = generate_model(test_x, test_y)
with test_model:
    ppc = pm.sample_posterior_predictive(trace, ...)
5 Likes

Thanks for the pointer and the prompt reply @junpenglao…I will definitely try this…

Cheers

Thanks Junpeng :ok_hand: IIUC, this means we need to recompile the model in this case (basically using two different although related models)?

Exactly - usually for small model it is fine, but if the model compilation is slow itself it could have serious overhead.

2 Likes

Hello! I try this solution but without luck. My problem is that the X input of generate_model(x,y) is not a matrix in my data. In fact, I have several input for x (as numpy array), and in the pm.Model, I have something like log_rr = Deterministic(‘rr’, np.log(nominator/denominator)) and y = pm.Normal(‘y’, mu=log_rr ,sigma = SE, observed = log_rr_array). The ‘nominator’ and ‘denominator’ involve numpy arrays as input. I can run the model on original training data. But when I switch out for the test dataset, the modeling trace keep giving me shape errors such as this:
ValueError: Input dimension mis-match. (input[0].shape[0] = 9003, input[1].shape[0] = 82)