I am just making the jump to PyMC3. Loving it so far, but struggling to work out something I suspect is rather basic: how to include measurement uncertainties on all variables in multiple regression. For example, in a toy system:
Where x_{n,i} and y_i are measurements of samples with known uncertainties (different for each datum). I want to find p_0 and p_1, accounting for the uncertainties in all variables. Some toy data to go with the problem:
import numpy as np
np.random.seed(123)
n = 50
x0 = np.random.uniform(-1, 1, n)
x0_std = 0.01 + np.random.exponential(0.05, n)
x1 = np.random.uniform(-1, 1, n)
x1_std = 0.01 + np.random.exponential(0.05, n)
p0_true = 0.54
p1_true = -0.73
y = p0_true * x0 + p1_true * x1 + np.random.normal(0, 0.15)
y_std = 0.01 + np.random.exponential(0.05, n)
I’m unsure how to model this correctly in PyMC3. What I’ve come up with is modelling the x_n variables as distributions that are sampled in each iteration:
import pymc3 as pm
with pm.Model() as m_toy:
# model uncertainteis in the x variables as distributions
mx0 = pm.Normal('x0', x0, x0_std, shape=len(x0))
mx1 = pm.Normal('x1', x1, x1_std, shape=len(x1))
p0 = pm.Normal('p0', 0, 1)
p1 = pm.Normal('p1', 0, 1)
mu = p0 * mx0 + p1 * mx1
sigma = pm.Exponential('sigma', 1) # estimate the population sigma: y ~ N(mu_i, [sigma])
y_pred = pm.Normal('y_pred', mu, y_std, shape=len(y)) # include the y uncertainty on the prediction here
likelihood = pm.Normal('likelihood', y_pred, sigma, observed=y) # calculate likelihood from y_pred and sigma
trace_toy = pm.sample(1000)
This produces something that looks sane:
# posterior predictive
isamples = np.random.randint(0, len(trace_toy) - 1, 1000)
post_pred = (x0.reshape(-1,1) * trace_toy['p0'][isamples] + x1.reshape(-1,1) * trace_toy['p1'][isamples] + np.random.normal(0, trace_toy['sigma'][isamples])).T
But is my approach correct/valid, or have I completely misunderstood something? If the latter, how should I implement this correctly? In either case, how might I vectorise this to increase efficiency when dealing with many more variables?
Thanks in advance!