Measurement Uncertainties in Multiple Regression

This looks mostly right to me although the model needs a few alterations. I’ve made some changes below:

Here’s the part for generating fake data:

import numpy as np
np.random.seed(123)
n = 50
p = 10 # Number of coefficients

# Create PxN matrix of covariates
X     = np.random.uniform(-1,1,size=(p,n))
X_std = 0.01 + np.random.exponential(0.05,size=(p,n))

# Create vector of coefficients with P elements
coefs = np.random.randn(p)

# Create observation with non-measurement noise added
y = coefs.dot(X) +  np.random.normal(0, 0.15,n)

# Create observation measurement noise
y_std = 0.01 + np.random.exponential(0.05, n)

Then here’s the PyMC3 model. I rearranged the order of the model variables a bit and changed the observed keyword to instead just using all latent variables. This makes it a little easier to understand in my opinion. We model some true but unobserved y_true which is then used in the standard likelihood calculation (and which should incorporate sigma, not y_std).

import pymc3 as pm
with pm.Model() as m_toy:
    
    X_pm     = pm.Normal('X_pm', mu=X, sd=X_std,shape=(p,n))
    coefs_pm = pm.Normal('coefs_pm',mu=0.,sd=10,shape=p)
    
    mu = pm.math.dot(coefs_pm,X_pm)
    sigma = pm.Exponential('sigma',1)  
    
    y_true = pm.Normal('y_true',y, y_std, shape=n)
    
    likelihood = pm.Normal('likelihood', mu-y_true, sigma, shape=n) 
        
    trace_toy = pm.sample(1000)

Using this setup, I was able to draw about 60 samples / s across 4 cores though some of the chains had divergences. I’m interested to see how other people would solve this problem.

1 Like