I am trying to perform an analysis on a multivariate data set that I will describe below.
My goal is to account for the covariance structure within some of the parameters (“predictors” which influence the observables, e.g. age, temperature, etc.; maybe sometimes referred to as “covariates”) which I acquire along with my measurements of interest (“observables”). The predictor estimates shall then serve as the slopes in a subsequent linear model to infer the observables. I am not fully certain whether that makes sense and if so, how to construct the model.
Your advice to give me a start for this is much appreciated!
I figured that there is different strategies of decomposing the covariance matrix. In the following examples, I found two different groups of model structures in pymc3, using either
(2) Notebook 12 from video session 6 of the recent presentation by @junpenglao
(3) an older blog post by Austin Rochford
(4) this and this stackoverflow question are related, especially the latter one, but it does not use the values for the predictors to estimate their correlation (see below).
In all of these examples, my trouble starts when attempting to use the observed values of the predictors for modelling of the observables.
A hierarchical GLM with independent predictors would be no problem for me. However, I assume that covariance of the predictors is relevant (but please correct me), based on the notion that the sampling procedure will depend on the coupling of variables (see, for example, here this blog entry by @twiecki ).
The data structure is as follows:
- there are multiple observables (multivariate data), but for simplicity let’s start trying to model only one outcome variable
- there are
n_predictor(e.g. 3) predictors that are correlated and potentially influence the observables
- there are groups of measurements
For simplicity, let’s ignore hierarchy and reduce to a single observable.
Here is toy data:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import pymc3 as pm import theano.tensor as tt np.random.seed(42) pd.set_option('precision', 2) #_______________________________________________________________________________ ### generate data observations_per_group = 100 # we assessed values of three predictors # which are correlated predictor_names = ['pred1','pred2','pred3'] mu_actual = np.array([1, -2, 0.5]) n_predictors = len(mu_actual) covmatrix_actual = np.array([ \ [ 1.0, -0.3, 0.0] \ , [-0.3, 1.0, 0.6] \ , [ 0.0, 0.6, 1.0] \ ]) # data was measured from certain groups, giving it hierarchical structure. n_groups = 2**4 intercepts = np.random.normal(1.,0.4,n_groups) # print (intercepts) # we obtained two measures of interest. n_measures = 2 data =  for grp_idx in range(n_groups): x = np.random.multivariate_normal(mu_actual, covmatrix_actual, size = observations_per_group) x = pd.DataFrame( x \ , columns = predictor_names \ , index = grp_idx*observations_per_group + np.arange(observations_per_group) \ ) x['group'] = 'g%02.0f' % (grp_idx+1) for meas in range(n_measures): x['measure%i' % (meas+1)] = np.random.normal(intercepts[grp_idx], 0.1, observations_per_group) \ + meas * np.random.normal(0, 0.1, 1) # slight difference of measures data.append(x.loc[:, ['group'] \ + predictor_names \ + ['measure%i' % (meas+1) for meas in range(n_measures)] \ ]) data = pd.concat(data, axis = 0) data['group'] = pd.Categorical(data['group'], ordered = False) groups = list(data['group'].cat.categories) data['group_index'] = data['group'].cat.codes.values print (data) n_observations = data.shape
Here is my best attempt on the model, following (1):
#_______________________________________________________________________________ ### model definition with pm.Model() as model: ### following https://docs.pymc.io/notebooks/LKJ.html packed_chol = pm.LKJCholeskyCov('packed_chol', n=n_predictors, eta=1., sd_dist=pm.HalfCauchy.dist(2.5)) chol = pm.expand_packed_triangular(n_predictors, packed_chol) mu = pm.Normal('mu', 0., 10., shape = n_predictors \ , testval=data.loc[:, predictor_names].mean(axis=0) \ ) predictors = pm.MvNormal('predictors', mu, chol=chol \ , observed=data.loc[:, predictor_names].values \ ) ### glm model assembly intercept = pm.Normal('intercept', mu = 0, sd = 10) residual = pm.HalfCauchy('residual', 5) estimator = intercept for pred_idx, pred_name in enumerate(predictor_names): estimator = estimator \ + mu[pred_idx] \ * data[pred_name].values ### likelihood likelihood = pm.Normal( 'likelihood' \ , mu = estimator \ , sd = residual \ , observed = data['measure1'].values \ ) #_______________________________________________________________________________ ### sample and inspect with model: trace = pm.sample(1000, tune = 1000, cores = 4, random_seed = 42) pm.traceplot(trace) plt.show()
The trace plot shows
mu's of approximately
[-0.18, -0.06, 0.15] (actual was:
[1, -2, 0.5]), whereas when commenting out the sections of “glm model assembly” and “likelihood”, the inference is correct.
The problem is that observed data is entered twice (in
predictors = ... for the model to estimate their correlation, and in
likelihood = ... to fit the observables), and both influence the mean of the predictor values. Conceptually, I guess that the posterior distribution of the predictor should instead only be influenced by the measurements of the predictors themselves (independent of the observables).
I assume that the distinction between predictors and observables meaningful, because I need an intercept, and I could think of having multiple, independent blocks within the predictors that separately enter the GLM.
There were slightly other issues in other implementations, but all failed.
I am unsure whether to use
LKJCholeskyCov (seems to be the more recent implementation) or
LKJCorr (apparently more general) to generate priors for the covariance matrix, but both seem to be interchangeable to a certain degree (see alternative model below). Also, it could be that it is unnecessary to use
MvNormal, as it was sometimes done in example (2).
So my rather general questions:
- Does it make sense to model the predictors covariance as I intend to?
- Then, which parametrization and which prior specification are useful here?
- (How) Is it possible to decouple inference for predictors and observables?
Later, I will attempt to extend to hierarchical structure and multivariate observables (which I’ve done before, so that should be easier), but first things first.
Finally, if someone is interested, here is an alternative model definition, using
LKJCorr, following (4):
with pm.Model() as model: mu = pm.Normal('mu', mu=0., sd=10, shape=n_predictors) sigma = pm.HalfCauchy('sigma', 5, shape=n_predictors) C_triu = pm.LKJCorr('C_triu', n=n_predictors, eta = 1.) C = tt.fill_diagonal(C_triu[np.zeros((n_predictors,n_predictors), 'int')], 1.) sigma_diag = tt.nlinalg.diag(sigma) cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag) # tau = tt.nlinalg.matrix_inverse(cov) predictors = pm.MvNormal('predictors', mu=mu, cov=cov, observed = data.loc[:, predictor_names].values ) ### glm model assembly intercept_population = pm.Normal('intercept_population', mu = 0, sd = 10) residual = pm.HalfCauchy('residual', 5) estimator = intercept_population for pred_idx, pred_name in enumerate(predictor_names): estimator = estimator \ + mu[pred_idx] \ * data[pred_name].values ### likelihood likelihood = pm.Normal( 'likelihood' \ , mu = estimator \ , sd = residual \ , observed = data['measure1'].values \ )