Dealing with multivariate x, errors in both x and y, and posterior predictive for y

I am fitting an artificial neural network to a set of of observed x and y, using the following linear regression model as my guide:

with pm.Model() as model:
    intercept = pm.Normal('intercept', 0, sd=20)
    gradient = pm.Normal('gradient', 0, sd=20)
    true_x = pm.Normal('true_x', mu=0, sd=20, shape=len(x)) 
    likelihood_x = pm.Normal('x', mu=true_x, sd=x_err, observed=x)
    true_y = pm.Deterministic('true_y', intercept + gradient * true_x)
    likelihood_y = pm.Normal('y', mu=true_y, sd=y_err, observed=y)

I still have some questions:

  1. How would you deal with multiple x variables?
  2. If I don’t know much about true_x, what is the best prior to use?
  3. If I would like to calculate the posterior predictive distribution given new observed x, there is now no new observed y, does this mean I only update likelihood_x but not likelihood_y? Is this a problem if new observed x has a different sample size?


  1. You can represent multiple x variables (e.g., repeated measurement) as a matrix, and do * true_x) for matrix multiplication. Of course, you need to be careful of the shape in gradient and true_x

  2. It is hard to say, in general use a normal distribution with large standard deviation (e.g., 10x std of x) is a good starting point

  3. You can have a look at this blogpost and related discussion in OOS predictions with missing input values for some inspiration.

1 Like

Thanks so much! In particular for the link to the missing input values. I think my problem was letting go of the idea of ‘inputs’ and ‘outputs’, and just think of observables.

1 Like

I used to implement my model but am still having some issues! I think it’s to do with how I define the shape of some of the random variables. To explain my model briefly, I have 12 input random variables, and one output random variable that I estimate through a neural network. This results in 13 observed variables each with measured noise. My training set has ~1000 (nobs) sets of measurements of these 13 observed variables. I would like to fit the training set, and then examine the performance on a further test set, before using the Bayesian neural network to calculate posterior predictive distributions for new observations for which the 13th observable(i.e. the output) is missing. From your example, it seemed that I don’t need to give the information on the number of measurements, but in that case I don’t get the right distribution of weights. When I do give the nobs information, I get posteriors that look right but cannot then apply the model to new datasets with different numbers of measurements. Here’s my code:

# Number of input variables (fixed)
ninputs = inputsTrainScale.shape[1]
# Number of neurons on each hidden layer (fixed)
neurons_per_hiddenlayer = 5

# Set observed random variables to be shared (initially set to training set)
inputsShared     = theano.shared(inputsTrainScale)
errInputsShared  = theano.shared(errInputsTrainScale)
targetShared     = theano.shared(targetTrain)
errTargetShared  = theano.shared(errTargetTrain)

# Initialize random weights on neurons
init_hid = np.random.randn(inputsTrainScale.shape[1],neurons_per_hiddenlayer)
init_out = np.random.randn(neurons_per_hiddenlayer)

with pm.Model() as neural_network:
    nobs = len(targetShared.eval())
    # Weights from input to hidden layer
    wts_in_hid = pm.Normal('wts_in_hid',mu=0,sd=1,shape=(ninputs,neurons_per_hiddenlayer),testval=init_hid)
    # Weights from hidden layer to output
    wts_hid_out = pm.Normal('wts_hid_out',mu=0,sd=1,shape=(neurons_per_hiddenlayer,),testval=init_out)
    # Deterministic spectral parameters (mean zeros assuming they have been scaled, and std 1 means similar to measured values)
    true_x  = pm.Normal('true_x', mu=0,sd=1, shape=(ninputs)) 
    # Deterministic mass from neural network (assuming sigmoid and linear activation functions for the hidden and outer layers)
    act_hid   = TT.nnet.sigmoid(,wts_in_hid))
    act_out   =,wts_hid_out)
    true_mass = pm.Deterministic('true_mass',act_out)

    # Define likelihoods using observations
    likelihood_x    = pm.Normal('likelihood_x',mu=true_x,sd=errInputsShared,observed=inputsShared, shape=(ninputs))
    likelihood_mass = pm.Normal('likelihood_mass',mu=true_mass,sd=errTargetShared,observed=targetShared, shape=(nobs))

You should avoid specifying the ninputs in the shape, which is unnecessary as the computation should be broadcasted on that dimension (same idea as the batch size of using minibatch). Say that in your neural net you have the inputX shape=(1000, 12) and outputY shape = (1000, 1), you should specify the weight with the shapes such as (12, a), (a, b), (b, c) … (k, 1), and after a chain of matrix multiplication and non-linear transformation, you arrive at a Y-hat with shape = (1000, 1). Notice that in the model specification the shape of the weight (i.e., the RVs in pm.Model) does not contain 1000, the number of observations.

In my blog post, instead of inputing inputX I use a latent variable inputX_obs = pm.Normal('X', mu=Xmu, sd=1., observed=inputX). The matrix multiplication between the input and the weights at each layer of the NN is then operate on inputX_obs instead of inputX. Here the prior Xmu will have a shape of (12, 1) in your case. Notices that if there is no missing value in inputX, doing so it is equivalent as inputX_obs = inputX.

Back to your code, instead of
act_hid = TT.nnet.sigmoid(,wts_in_hid))
you should do
act_hid = TT.nnet.sigmoid(,wts_in_hid)),
and add additional Random variable to the model for posterior prediction.

1 Like

Ok I completely missed that I should be inputting inputX rather than inputX_obs. It now works as expected for predicting posterior distributions for the seen observables, but only when inputsTrainScale and errInputsTrainScale are not theano.shared. I’m not sure what sharing them with theano changes to stop it from working. The other thing I’m not sure about is what to set the sd parameter in likelihood_mass for new measurements of inputX. In the original model I assign it to the measured noise. Obviously for new inputX I don’t have a measurement of the noise in Y. Should I treat it like a missing variable?

Hmm, if you dont have any missing value (ie., the input is not a numpy mask array), setting it to theano.shared should still works. But if you are adding a new node for posterior prediction you dont need theano.shared the input as you are not going to reset the value.

I would use the measurement noise of the original Y.

Ok I think that all works now - thank you!

1 Like

I’ve come back to this after a few months. I realized there is still one thing that confuses me. What is the difference between posterior predictive checks and prediction ( It seems that the difference lies in the specification of the size parameter. In my estimates of the posterior distributions for new data predicted by the neural network, I would like to carry through the parameter errors, and the errors in new data. Doesn’t this mean I should specify size too? When I do, I get a shape error.

Posterior prediction check usually is used in the context that after you fitted the model, you generate prediction from the model to and see how the observed overlapped with the prediction from the fit. One typical way to do is plot the (smoothed) histogram and plot the real data on top. For example, see figure 6 in Visualization in Bayesian workflow.

Notice that the observed here could be the training set or the test set.

As for the shape error when specifying the size, you can try creating a new node with a specific shape and do sample_ppc(...,vars=[new_node],...).