How Does the Sampler Use Covariate Observed Data?

Hi everyone,
I am trying to understand the way that PyMC sampler draws from an observed random variable acting as a covariate in a regression. I am still a beginner in statistical modeling so I apologize if there are any terminological issues in my question.
Let’s say I have the following model, where I want to put priors (x_mu, x_sigma) on the observed data of the covariate (x) in order to account for potential measurement errors:

x_mu = pm.Normal("x_mu",0,1)
x_sigma = pm.HalfNormal("x_sigma",1)
x = pm.Normal("x",x_mu,x_sigma,observed=x_observed)

a = pm.Normal("alpha",0,1)
b = pm.Normal("beta",0,1)

y_mu = pm.Deterministic("y_mu",a+b*x)
y_sigma = pm.HalfNormal("y_sigma",1)
y = pm.Normal("y",y_mu,y_sigma,observed=y_observed)

My understanding is that, depending on the implementation, the sampler could:

  1. use the observed data from X directly to compute the posterior of Y, essentially ignoring the mu and sigma terms for the downstream inference, while separately computing the posterior of X from the given prior/observed data combo. In this scenario, the posterior of X is the same regardless of the values of Y.

  2. jointly compute the posterior distributions of both X and Y based on their observed data/priors. In this scenario, the posterior of X depends on the values of Y.

  3. do something else that I have not understood.

While I have generally assumed that #2 is what is happening under the hood, I have empirically observed that when I run a model like the below code (computing X without actually doing the regression), the posterior of x is the same as in the model above which includes the regression, which to me indicates that something closer to #1 is happening under the hood.

x_mu = pm.Normal("x_mu",0,1)
x_sigma = pm.HalfNormal("x_sigma",1)
x = pm.Normal("x",x_mu,x_sigma,observed=x_observed)

If someone could help me to understand, or let me know whether my question is completely missing the point, it would be great! Also happy to read whatever documentation already exists that might give me a clue.

2 Likes

In general, observed variables and unobserved variables are treated differently. In a standard regression set up, you would have a and b and y_sigma as your parameters (not directly observed) and you would take x and y as observed:

# observed data
x = [17, 7, 2, 21, 42]
y = [21, 23, 45, 31, 101]

# model parameters
a = pm.Normal("a")
b = pm.Normal("b")
y_sigma = pm.HalfNormal("y_sigma", 1)

# likelihood
likelihood = pm.Normal("likelihood", a+(b*x), y_sigma, observed=y_observed)

If you model x_mu and x_sigma, then you are doing something closer to an error-in-variables model, which is a thing, but is probably not what you’re trying to do.

1 Like

I am not sure how the sampler uses the observed value in conjunction with regression downstream but one suggested way to add noise to such a variable is a bit different than what you are doing. It is to create a latent variable, see for instance the answer here:

So in your case it would be:

import pymc as pm
import arviz as az

x_observed = [17, 7, 2, 21, 42]
y_observed = [21, 23, 45, 31, 101]

with pm.Model() as model:
  x_mu = pm.Normal("x_mu",0,1)
  x_sigma = pm.HalfNormal("x_sigma",1)
  x_err = pm.HalfNormal("x_err", 1)
  x_lat = pm.Normal("x_lat", x_mu, x_sigma)
  x = pm.Normal("x",x_lat, x_err, observed=x_observed)

  a = pm.Normal("alpha",0,1)
  b = pm.Normal("beta",0,1)

  y_mu = pm.Deterministic("y_mu",a+b*x_lat)
  y_sigma = pm.HalfNormal("y_sigma",1)
  y = pm.Normal("y",y_mu,y_sigma,observed=y_observed)

  trace = pm.sample()

The posterior I get from here is
test1

where as if I run this

with pm.Model() as model:
  x_mu = pm.Normal("x_mu",0,1)
  x_sigma = pm.HalfNormal("x_sigma",1)
  x_err = pm.HalfNormal("x_err", 1)
  x_lat = pm.Normal("x_lat", x_mu, x_sigma)
  x = pm.Normal("x",x_lat, x_err, observed=x_observed)

  trace = pm.sample()

the posterior I get is
test2

3 Likes

You are correct that posterior sampling corresponds to #1. In general every model RV is given a value during sampling, for observed RVs this is fixed to the “observed values”.

Prior and posterior predictive sampling will still have uncertainty flowing, but not posterior where those are fixed.

The model @iavicenna described is a valid alternative, but both are fine. It’s a model choice

1 Like

@junpenglao talk on multivariate imputation touches on this question as well: Partial Missing Multivariate Observation and What to Do With Them by Junpeng Lao

Interesting to know that the model where observed RVs are used for x is a valid choice and just replaces the input x, never thought of it before from that perspective. Updated my answer from “the suggested way” to “one suggested way” :slight_smile:

1 Like

This was super helpful - thanks everyone who offered some help!

1 Like

Hi @ricardoV94, thanks again for your helpful response here. I have a quick follow up question if you have a moment…what if the co-variate X has missing values? My assumption is that the non-missing values of X will be passed into the regression as deterministic scalars, and the missing values will be passed as distributions based on the prior put on X, but I wasn’t sure and haven’t figured out how to test this empirically yet.

Yes that’s what should happen. I suggest writing a very simple model and looking at model.to_graphviz to confirm

1 Like

Thanks!