Linear Regression - Difference between these two models?


#1

Hello,

I’m doing linear regression and I’m wondering which is the correct / better way of observing independent variable (weight in my case, the dependent variable is height).

One way is to model independent variable as a (normal) distribution and conditioning on observed data:

m = pm.Model()

with m:

alpha = pm.Uniform('alpha', lower=0, upper=300)
beta = pm.Normal('beta', mu=0, sd=50)
sigma_height = pm.Uniform('sigma_height', lower=0, upper=50)

weight = pm.Normal('weight', mu=np.mean(d2['weight']), sd=np.std(d2['weight']), observed=d2['weight'])
mu_height = alpha + beta * (weight-np.mean(d2['weight']))

height = pm.Normal('height', mu=mu_height, sd=sigma_height, observed=d2['height'])

But I discovered that I can simply add weight directly from dataframe (without first modeling it as a normal distribution)

m = pm.Model()

with m:

alpha = pm.Uniform('alpha', lower=0, upper=300)
beta = pm.Normal('beta', mu=0, sd=50)
sigma_height = pm.Uniform('sigma_height', lower=0, upper=50)

mu_height = alpha + beta * (d2['weight']-np.mean(d2['weight']))

height = pm.Normal('height', mu=mu_height, sd=sigma_height, observed=d2['height'])

My questions are:

  • How are these two models treated by PyMC3 during MCMC inference? What’s the difference? (Specifically for the second one, since dataframe (d2[‘weight’]) has a list of values, how is mu_height calculated (since it is scalar)? During sampling, Is dataframe also sampled (uniformly?) to arrive at mu_height?
  • Which one should I be using for regression? Are there any pros and cons of each approach?

By the way, both of the approaches give similar estimations for unknowns (alpha, beta, sigma_height).


#2

The two model will be treated the same in MCMC, because

weight = pm.Normal('weight', mu=np.mean(d2['weight']), sd=np.std(d2['weight']), observed=d2['weight'])

Adds a constant to the model logp, thus has not effect during MCMC. Moreover, weight will be a tensor with value set to d2['weight'].

You should use the first one, because the second one does not add any new information and could be confusing. But if you have uncertainty in the covariates you want to model you can also do:

mu_weight = pm.Normal(...)
sd_weight = pm.HalfNormal(...)
weight = pm.Normal('weight', mu=mu_weight, sd=sd_weight, observed=d2['weight'])

#3

Thanks. A quick follow up question on your comment:

Moreover, weight will be a tensor with value set to d2['weight'] .

Since, mu_height is determined by:

mu_height = alpha + beta * (d2[‘weight’]-np.mean(d2[‘weight’]))

I’m assuming mu_height will also be a tensor (not tensor) during sampling? Is my understanding correct?

In that case, what does mu=mu_height imply in:

height = pm.Normal(‘height’, mu=mu_height, sd=sigma_height, observed=d2[‘height’])

My assumption was that mu took a scalar value. How is mu_height tensor treated? (Also sigma_height isn’t a tensor and I’m not sure why that won’t cause a problem)


#4

Yes

It will be a vector tensor with the shape same as the observed. Its value is propagated from other free parameters, so it is essentially invisible to the sampler (as it is not a free parameter).

It will be convert into a tensor constant during run time


#5

Thanks.

So this means if mu_height is of shape (100) containing 100 different data points, sigma_height will also be recasted as shape of (100) with the values replicated 100 times?


#6

It will be broadcasted during the computation of logp. But will not be a shape (100) tensor the same sense as mu_height.