Multiple regression caveats

phildias · July 23, 2018, 3:12pm

Hi there!

I’m trying to get a multiple regression off the ground, but I’m running into some problems.

The general structure of the model is like so:

y_{i,1} = \mathbf{x}_{i,1} \cdot \mathbf{\beta}_{1}' + \epsilon_{i,1}
y_{i,2} = \mathbf{x}_{i,2} \cdot \mathbf{\beta}_{2}' + \epsilon_{i,2}

dim(\mathbf{x}_{i,1})=dim(\mathbf{\beta}_{i,1})=(k_1,1)
dim(\mathbf{x}_{i,2})=dim(\mathbf{\beta}_{i,2})=(k_2,1)
dim(y_{i,1})=dim(y_{i,2})=(1,1)

\bf{Likelihood}
\mathbf{\epsilon}_i = (\epsilon_{i,1},\epsilon_{i,2}) \stackrel{iid}{\sim} MVN(mean=\mathbf{0}_2,cov=\mathbf{\Sigma_\epsilon})

\bf{Priors}
(\beta_{1,1},...,\beta_{{k_1},1}) \stackrel{iid}{\sim} Normal(mean=0,stdev=10)
(\beta_{1,2},...,\beta_{{k_2},2}) \stackrel{iid}{\sim} Normal(mean=0,stdev=10)
\mathbf{sd}=(sd_1,sd_2) \stackrel{iid}{\sim} HalfCauchy(beta=2.5)
Chol(\mathbf{\Sigma}_\epsilon) \sim LKJCholeskyCov(eta=2, n=2, sd=\mathbf{sd})

In the example I’m giving above, I have 2 regression lines, \mathbf{y}_1 which is explained by variables \mathbf{x}_1 and \mathbf{y}_2 which is explained by variables \mathbf{x}_2. Notice that this form is general enough that \mathbf{x}_1 and \mathbf{x}_2 can be constituted by completely different columns.

What is making my life a bit difficult is translating this into PyMC3. Let me show you my example using some fake data

import pymc3 as pm
import numpy as np
import pandas as pd
import theano
import theano.tensor as tt

# Reading in dataset
dataset = pd.read_csv("fake_multiple_multivariate_regression_data_covmtx.csv")

num_rows = len(dataset)

# Naming the dependent categorical variables
dep_vars = ["y1_cont","y2_cont"]

num_dep_vars = len(dep_vars)

# Exogenous variables impacting each depvar. 
indep_vars = [["uno","x1","x2","x3","x4"],
              ["uno","x3","x4","x5","x6","x7"]]

indep_vars_y1 = len(indep_vars[0])
indep_vars_y2 = len(indep_vars[1])

x_data_y1 = dataset[indep_vars[0]].values
x_data_y2 = dataset[indep_vars[1]].values

y_data = dataset[dep_vars].values

with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Priors for betas
    betas_y1 = pm.Normal('Betas_Y1', 0, sd=20, shape=indep_vars_y1)
    betas_y2 = pm.Normal('Betas_Y2', 0, sd=20, shape=indep_vars_y2)

    # Priors for error covariance matrix
    sd_dist = pm.HalfCauchy.dist(beta=2.5, shape=num_dep_vars)
    packed_chol = pm.LKJCholeskyCov('chol_cov', eta=2, n=num_dep_vars, sd_dist=sd_dist)
    chol = pm.expand_packed_triangular(num_dep_vars, packed_chol, lower=True)
    
    y1_hat = tt.dot(x_data_y1, betas_y1)
    y2_hat = tt.dot(x_data_y2, betas_y2)

    # Here is my main problem!!!
    mu = pm.Deterministic('mu', tt.stack([y1_hat,y2_hat])) 
    
    # Define a new MvNormal with the given covariance
    vals = pm.MvNormal('vals', mu=mu, chol=chol, observed=y_data)

In the example above, I’m having a hard time re-packing y1_hat and y2_hat in a single array-type element so I can use the MvNormal.

Does anyone have any insight on how to do so?

Thank you!
.

bwengals · July 23, 2018, 3:50pm

All you need is a small fix:

mu = pm.Deterministic('mu', tt.stack([y1_hat, y2_hat], axis=1)

tt.stack by default uses axis=0, which takes your two arrays horizontally, so the output shape is (2, 5000) instead of (5000,2).

phildias · July 23, 2018, 4:13pm

Thanks, that worked like a charm =)

For some reason, though, another issue popped up: when I try to sample from this model, pymc3 just hangs on me.

So when I call this:

with model:
    trace = pm.sample()

Python then hangs on this screen after calculating the first quarter of the job:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [chol_cov, Betas_Y2, Betas_Y1]
Sampling 4 chains:  25%|██▌       | 1000/4000 [03:00<01:02, 47.98draws/s]

Any thoughts?

Thanks again for the help!!!

junpenglao · July 23, 2018, 4:31pm

Try using a more informative prior with lighter tail might help:

sd_dist = pm.HalfNormal.dist(sd=1, shape=num_dep_vars)

phildias · July 23, 2018, 4:54pm

Thanks for the tip, @junpenglao!

Unfortunately, though, it still hangs… =/

bwengals · July 23, 2018, 5:31pm

Might be a problem with the multiprocess sampling. This is likely fixed in the newest version of PyMC3. Before you update, try setting the cores=1.

with model:
    trace = pm.sample(1000, chains=4, cores=1)

phildias · July 23, 2018, 8:21pm

Yeah, @bwengals, it really does seem to be a problem with the multiprocess sampling. When I run the snippet you suggested, things work out fine: all four chains get sampled sequentially and no problems are encountered.

I hope the multiprocess issue gets solved in the next version!

Thanks =)

Topic		Replies	Views
Multivariant linear regression with uncertainty (error propagation) Questions modeling	2	1528	September 7, 2022
Multiple Linear Regression Questions	2	664	April 26, 2019
Multiple priors in regression model Questions	2	943	January 22, 2021
Setting up a multiple input multiple output regularized Bayesian regression Questions	5	696	February 17, 2022
Multiple output regression with cross-dependent variables Questions	0	496	October 5, 2020

Multiple regression caveats

Related topics