Multiple regression caveats

Hi there!

I’m trying to get a multiple regression off the ground, but I’m running into some problems.

The general structure of the model is like so:

y_{i,1} = \mathbf{x}_{i,1} \cdot \mathbf{\beta}_{1}' + \epsilon_{i,1}
y_{i,2} = \mathbf{x}_{i,2} \cdot \mathbf{\beta}_{2}' + \epsilon_{i,2}

dim(\mathbf{x}_{i,1})=dim(\mathbf{\beta}_{i,1})=(k_1,1)
dim(\mathbf{x}_{i,2})=dim(\mathbf{\beta}_{i,2})=(k_2,1)
dim(y_{i,1})=dim(y_{i,2})=(1,1)

\bf{Likelihood}
\mathbf{\epsilon}_i = (\epsilon_{i,1},\epsilon_{i,2}) \stackrel{iid}{\sim} MVN(mean=\mathbf{0}_2,cov=\mathbf{\Sigma_\epsilon})

\bf{Priors}
(\beta_{1,1},...,\beta_{{k_1},1}) \stackrel{iid}{\sim} Normal(mean=0,stdev=10)
(\beta_{1,2},...,\beta_{{k_2},2}) \stackrel{iid}{\sim} Normal(mean=0,stdev=10)
\mathbf{sd}=(sd_1,sd_2) \stackrel{iid}{\sim} HalfCauchy(beta=2.5)
Chol(\mathbf{\Sigma}_\epsilon) \sim LKJCholeskyCov(eta=2, n=2, sd=\mathbf{sd})

In the example I’m giving above, I have 2 regression lines, \mathbf{y}_1 which is explained by variables \mathbf{x}_1 and \mathbf{y}_2 which is explained by variables \mathbf{x}_2. Notice that this form is general enough that \mathbf{x}_1 and \mathbf{x}_2 can be constituted by completely different columns.

What is making my life a bit difficult is translating this into PyMC3. Let me show you my example using some fake data

import pymc3 as pm
import numpy as np
import pandas as pd
import theano
import theano.tensor as tt

# Reading in dataset
dataset = pd.read_csv("fake_multiple_multivariate_regression_data_covmtx.csv")

num_rows = len(dataset)

# Naming the dependent categorical variables
dep_vars = ["y1_cont","y2_cont"]

num_dep_vars = len(dep_vars)

# Exogenous variables impacting each depvar. 
indep_vars = [["uno","x1","x2","x3","x4"],
              ["uno","x3","x4","x5","x6","x7"]]

indep_vars_y1 = len(indep_vars[0])
indep_vars_y2 = len(indep_vars[1])

x_data_y1 = dataset[indep_vars[0]].values
x_data_y2 = dataset[indep_vars[1]].values

y_data = dataset[dep_vars].values

with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Priors for betas
    betas_y1 = pm.Normal('Betas_Y1', 0, sd=20, shape=indep_vars_y1)
    betas_y2 = pm.Normal('Betas_Y2', 0, sd=20, shape=indep_vars_y2)

    # Priors for error covariance matrix
    sd_dist = pm.HalfCauchy.dist(beta=2.5, shape=num_dep_vars)
    packed_chol = pm.LKJCholeskyCov('chol_cov', eta=2, n=num_dep_vars, sd_dist=sd_dist)
    chol = pm.expand_packed_triangular(num_dep_vars, packed_chol, lower=True)
    
    y1_hat = tt.dot(x_data_y1, betas_y1)
    y2_hat = tt.dot(x_data_y2, betas_y2)

    # Here is my main problem!!!
    mu = pm.Deterministic('mu', tt.stack([y1_hat,y2_hat])) 
    
    # Define a new MvNormal with the given covariance
    vals = pm.MvNormal('vals', mu=mu, chol=chol, observed=y_data)

In the example above, I’m having a hard time re-packing y1_hat and y2_hat in a single array-type element so I can use the MvNormal.

Does anyone have any insight on how to do so?

Thank you!
.

All you need is a small fix:

mu = pm.Deterministic('mu', tt.stack([y1_hat, y2_hat], axis=1)

tt.stack by default uses axis=0, which takes your two arrays horizontally, so the output shape is (2, 5000) instead of (5000,2).

1 Like

Thanks, that worked like a charm =)

For some reason, though, another issue popped up: when I try to sample from this model, pymc3 just hangs on me.

So when I call this:

with model:
    trace = pm.sample()

Python then hangs on this screen after calculating the first quarter of the job:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [chol_cov, Betas_Y2, Betas_Y1]
Sampling 4 chains:  25%|██▌       | 1000/4000 [03:00<01:02, 47.98draws/s] 

Any thoughts?

Thanks again for the help!!!

Try using a more informative prior with lighter tail might help:

sd_dist = pm.HalfNormal.dist(sd=1, shape=num_dep_vars)

Thanks for the tip, @junpenglao!

Unfortunately, though, it still hangs… =/

Might be a problem with the multiprocess sampling. This is likely fixed in the newest version of PyMC3. Before you update, try setting the cores=1.

with model:
    trace = pm.sample(1000, chains=4, cores=1)
1 Like

Yeah, @bwengals, it really does seem to be a problem with the multiprocess sampling. When I run the snippet you suggested, things work out fine: all four chains get sampled sequentially and no problems are encountered.

I hope the multiprocess issue gets solved in the next version!

Thanks =)