Measurement Uncertainties in Multiple Regression

I am just making the jump to PyMC3. Loving it so far, but struggling to work out something I suspect is rather basic: how to include measurement uncertainties on all variables in multiple regression. For example, in a toy system:

\begin{align} y_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\ \mu_i &= p_0 x_{0,i} + p_1 x_{1,i} \\ p_0 &\sim \mathrm{Normal}(0, 1) \\ p_1 &\sim \mathrm{Normal}(0, 1) \end{align}

Where x_{n,i} and y_i are measurements of samples with known uncertainties (different for each datum). I want to find p_0 and p_1, accounting for the uncertainties in all variables. Some toy data to go with the problem:

import numpy as np
np.random.seed(123)
n = 50
x0 = np.random.uniform(-1, 1, n)
x0_std = 0.01 + np.random.exponential(0.05, n)

x1 = np.random.uniform(-1, 1, n)
x1_std = 0.01 + np.random.exponential(0.05, n)

p0_true = 0.54
p1_true = -0.73

y = p0_true * x0 + p1_true * x1 + np.random.normal(0, 0.15)
y_std = 0.01 + np.random.exponential(0.05, n)

I’m unsure how to model this correctly in PyMC3. What I’ve come up with is modelling the x_n variables as distributions that are sampled in each iteration:

import pymc3 as pm
with pm.Model() as m_toy:
    # model uncertainteis in the x variables as distributions
    mx0 = pm.Normal('x0', x0, x0_std, shape=len(x0))
    mx1 = pm.Normal('x1', x1, x1_std, shape=len(x1))
    
    p0 = pm.Normal('p0', 0, 1)
    p1 = pm.Normal('p1', 0, 1)
    
    mu = p0 * mx0 + p1 * mx1
    sigma = pm.Exponential('sigma', 1)  # estimate the population sigma: y ~ N(mu_i, [sigma])

    y_pred = pm.Normal('y_pred', mu, y_std, shape=len(y))  # include the y uncertainty on the prediction here
    likelihood = pm.Normal('likelihood', y_pred, sigma, observed=y) # calculate likelihood from y_pred and sigma
        
    trace_toy = pm.sample(1000)

This produces something that looks sane:

# posterior predictive
isamples = np.random.randint(0, len(trace_toy) - 1, 1000)
post_pred = (x0.reshape(-1,1) * trace_toy['p0'][isamples] + x1.reshape(-1,1) * trace_toy['p1'][isamples] + np.random.normal(0, trace_toy['sigma'][isamples])).T

image

But is my approach correct/valid, or have I completely misunderstood something? If the latter, how should I implement this correctly? In either case, how might I vectorise this to increase efficiency when dealing with many more variables?

Thanks in advance!

This looks mostly right to me although the model needs a few alterations. I’ve made some changes below:

Here’s the part for generating fake data:

import numpy as np
np.random.seed(123)
n = 50
p = 10 # Number of coefficients

# Create PxN matrix of covariates
X     = np.random.uniform(-1,1,size=(p,n))
X_std = 0.01 + np.random.exponential(0.05,size=(p,n))

# Create vector of coefficients with P elements
coefs = np.random.randn(p)

# Create observation with non-measurement noise added
y = coefs.dot(X) +  np.random.normal(0, 0.15,n)

# Create observation measurement noise
y_std = 0.01 + np.random.exponential(0.05, n)

Then here’s the PyMC3 model. I rearranged the order of the model variables a bit and changed the observed keyword to instead just using all latent variables. This makes it a little easier to understand in my opinion. We model some true but unobserved y_true which is then used in the standard likelihood calculation (and which should incorporate sigma, not y_std).

import pymc3 as pm
with pm.Model() as m_toy:
    
    X_pm     = pm.Normal('X_pm', mu=X, sd=X_std,shape=(p,n))
    coefs_pm = pm.Normal('coefs_pm',mu=0.,sd=10,shape=p)
    
    mu = pm.math.dot(coefs_pm,X_pm)
    sigma = pm.Exponential('sigma',1)  
    
    y_true = pm.Normal('y_true',y, y_std, shape=n)
    
    likelihood = pm.Normal('likelihood', mu-y_true, sigma, shape=n) 
        
    trace_toy = pm.sample(1000)

Using this setup, I was able to draw about 60 samples / s across 4 cores though some of the chains had divergences. I’m interested to see how other people would solve this problem.

1 Like

Thanks!

I’d parsed out the variables because in my real case some of the uncertainties have different distributions, but your formulation is definitely more elegant!

Out of interest, is there a reason you calculate the likelihood using mu - y_true rather than using mu and setting observed=y_true? I’d always assumed that setting observed was necessary to tell the model what the target was… how would this model know what data it’s fitting?

My impression is that the underlying computation won’t change much between our two cases because the entire model’s log posterior density will depend on the difference between mu and y_true which is used the same either way under the hood.

By having uncertain measurements, you’re really saying that you have a prior belief as to what the true values are, so including that prior distribution in the model is the same thing as saying you have some noisily observed data. The model “sees” all the information that you have - it’s been supplied the y and the y_std. I wrote it the way I did largely out of personal preference. I find it easier to follow the flow of the code if all the information about the noisy measurements is contained in the single line y_true = pm.Normal('y_true',y, y_std, shape=n) and then I can forgot about it when designing the rest of the model.

I will note that this shows off how PyMC3 is useful not just for sampling from complicated posterior distributions (as is usually the desired case) with observed data but also for sampling from prior distributions with complicated structure as well. In the latter case, there’s no update from prior-to-posterior, yet the calculations can still be done and we can draw samples.