Dear all,
I am trying to perform an analysis on a multivariate data set that I will describe below.
My goal is to account for the covariance structure within some of the parameters (“predictors” which influence the observables, e.g. age, temperature, etc.; maybe sometimes referred to as “covariates”) which I acquire along with my measurements of interest (“observables”). The predictor estimates shall then serve as the slopes in a subsequent linear model to infer the observables. I am not fully certain whether that makes sense and if so, how to construct the model.
Your advice to give me a start for this is much appreciated!
Cheers,
Falk
starting point
I figured that there is different strategies of decomposing the covariance matrix. In the following examples, I found two different groups of model structures in pymc3, using either LKJCorr
or LKJCholeskyCov
.
(1) https://docs.pymc.io/notebooks/LKJ.html
(2) Notebook 12 from video session 6 of the recent presentation by @junpenglao
(3) an older blog post by Austin Rochford
(4) this and this stackoverflow question are related, especially the latter one, but it does not use the values for the predictors to estimate their correlation (see below).
In all of these examples, my trouble starts when attempting to use the observed values of the predictors for modelling of the observables.
A hierarchical GLM with independent predictors would be no problem for me. However, I assume that covariance of the predictors is relevant (but please correct me), based on the notion that the sampling procedure will depend on the coupling of variables (see, for example, here this blog entry by @twiecki ).
data structure
The data structure is as follows:
- there are multiple observables (multivariate data), but for simplicity let’s start trying to model only one outcome variable
- there are
n_predictor
(e.g. 3) predictors that are correlated and potentially influence the observables - there are groups of measurements
For simplicity, let’s ignore hierarchy and reduce to a single observable.
Here is toy data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pymc3 as pm
import theano.tensor as tt
np.random.seed(42)
pd.set_option('precision', 2)
#_______________________________________________________________________________
### generate data
observations_per_group = 100
# we assessed values of three predictors
# which are correlated
predictor_names = ['pred1','pred2','pred3']
mu_actual = np.array([1, -2, 0.5])
n_predictors = len(mu_actual)
covmatrix_actual = np.array([ \
[ 1.0, -0.3, 0.0] \
, [-0.3, 1.0, 0.6] \
, [ 0.0, 0.6, 1.0] \
])
# data was measured from certain groups, giving it hierarchical structure.
n_groups = 2**4
intercepts = np.random.normal(1.,0.4,n_groups)
# print (intercepts)
# we obtained two measures of interest.
n_measures = 2
data = []
for grp_idx in range(n_groups):
x = np.random.multivariate_normal(mu_actual, covmatrix_actual, size = observations_per_group)
x = pd.DataFrame( x \
, columns = predictor_names \
, index = grp_idx*observations_per_group + np.arange(observations_per_group) \
)
x['group'] = 'g%02.0f' % (grp_idx+1)
for meas in range(n_measures):
x['measure%i' % (meas+1)] = np.random.normal(intercepts[grp_idx], 0.1, observations_per_group) \
+ meas * np.random.normal(0, 0.1, 1) # slight difference of measures
data.append(x.loc[:, ['group'] \
+ predictor_names \
+ ['measure%i' % (meas+1) for meas in range(n_measures)] \
])
data = pd.concat(data, axis = 0)
data['group'] = pd.Categorical(data['group'], ordered = False)
groups = list(data['group'].cat.categories)
data['group_index'] = data['group'].cat.codes.values
print (data)
n_observations = data.shape[0]
example model
Here is my best attempt on the model, following (1):
#_______________________________________________________________________________
### model definition
with pm.Model() as model:
### following https://docs.pymc.io/notebooks/LKJ.html
packed_chol = pm.LKJCholeskyCov('packed_chol', n=n_predictors,
eta=1., sd_dist=pm.HalfCauchy.dist(2.5))
chol = pm.expand_packed_triangular(n_predictors, packed_chol)
mu = pm.Normal('mu', 0., 10., shape = n_predictors \
, testval=data.loc[:, predictor_names].mean(axis=0) \
)
predictors = pm.MvNormal('predictors', mu, chol=chol \
, observed=data.loc[:, predictor_names].values \
)
### glm model assembly
intercept = pm.Normal('intercept', mu = 0, sd = 10)
residual = pm.HalfCauchy('residual', 5)
estimator = intercept
for pred_idx, pred_name in enumerate(predictor_names):
estimator = estimator \
+ mu[pred_idx] \
* data[pred_name].values
### likelihood
likelihood = pm.Normal( 'likelihood' \
, mu = estimator \
, sd = residual \
, observed = data['measure1'].values \
)
#_______________________________________________________________________________
### sample and inspect
with model:
trace = pm.sample(1000, tune = 1000, cores = 4, random_seed = 42)
pm.traceplot(trace)
plt.show()
The trace plot shows mu
's of approximately [-0.18, -0.06, 0.15]
(actual was: [1, -2, 0.5]
), whereas when commenting out the sections of “glm model assembly” and “likelihood”, the inference is correct.
The problem is that observed data is entered twice (in predictors = ...
for the model to estimate their correlation, and in likelihood = ...
to fit the observables), and both influence the mean of the predictor values. Conceptually, I guess that the posterior distribution of the predictor should instead only be influenced by the measurements of the predictors themselves (independent of the observables).
I assume that the distinction between predictors and observables meaningful, because I need an intercept, and I could think of having multiple, independent blocks within the predictors that separately enter the GLM.
question
There were slightly other issues in other implementations, but all failed.
I am unsure whether to use LKJCholeskyCov
(seems to be the more recent implementation) or LKJCorr
(apparently more general) to generate priors for the covariance matrix, but both seem to be interchangeable to a certain degree (see alternative model below). Also, it could be that it is unnecessary to use MvNormal
, as it was sometimes done in example (2).
So my rather general questions:
- Does it make sense to model the predictors covariance as I intend to?
- Then, which parametrization and which prior specification are useful here?
- (How) Is it possible to decouple inference for predictors and observables?
Later, I will attempt to extend to hierarchical structure and multivariate observables (which I’ve done before, so that should be easier), but first things first.
Finally, if someone is interested, here is an alternative model definition, using LKJCorr
, following (4):
with pm.Model() as model:
mu = pm.Normal('mu', mu=0., sd=10, shape=n_predictors)
sigma = pm.HalfCauchy('sigma', 5, shape=n_predictors)
C_triu = pm.LKJCorr('C_triu', n=n_predictors, eta = 1.)
C = tt.fill_diagonal(C_triu[np.zeros((n_predictors,n_predictors), 'int')], 1.)
sigma_diag = tt.nlinalg.diag(sigma)
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
# tau = tt.nlinalg.matrix_inverse(cov)
predictors = pm.MvNormal('predictors', mu=mu, cov=cov, observed = data.loc[:, predictor_names].values )
### glm model assembly
intercept_population = pm.Normal('intercept_population', mu = 0, sd = 10)
residual = pm.HalfCauchy('residual', 5)
estimator = intercept_population
for pred_idx, pred_name in enumerate(predictor_names):
estimator = estimator \
+ mu[pred_idx] \
* data[pred_name].values
### likelihood
likelihood = pm.Normal( 'likelihood' \
, mu = estimator \
, sd = residual \
, observed = data['measure1'].values \
)