Correlated samples in factor/cross-classified model

I am trying to create and fit a factor model. Here is the idea (reproducing code below):
Let’s assume we have a dataset with samples for 9 different instances. Each instance is a combination of two characteristics, type and size, where we have three types (A, B, C) and sizes (0, 1, 2). For each instance we have 100 samples which are generated from a Normal distribution with a mean mu and variance sigma.

  • sigma is specific to each individual.
  • the mean value of each individual is computed from the mean of its type and size, with a bit of variation

Here is the generation of the synthetic data:

import numpy as np
import pymc3 as pm
import pandas as pd
import string
import arviz as az

N_types = 3
N_sizes = 3
N = N_types * N_sizes
N_samples = 100
# Define structure
types = np.repeat(np.arange(N_types, dtype=int), N_sizes)
sizes = np.array(list(range(N_sizes)) * N_types)
type_labels = np.array(list(string.ascii_uppercase[:N_types]))
labels = [f"{type_labels[types[i]]}{sizes[i]}" for i in range(N)]
# Type-level parameters
intercept_types = np.array([0.5, 5., 8.])
intercept_sizes = np.array([2., 4., 3.])
# Individual deviations
intercept_individual = np.random.normal(size=N) * 0.01
sigma = np.random.exponential(size=N)
# Compute individual parameters
intercept = intercept_types[types] + intercept_sizes[sizes] + intercept_individual

# Create data
mu = intercept
data = np.random.normal(loc=mu[None], scale=sigma[None], size=(N_samples, N))
y = pd.DataFrame(data, columns=labels)

# DataFrame to define structure
structure = pd.DataFrame({'type': type_labels[types], 'size': sizes}, index=labels, dtype="category")

Thus, the samples reflect the structure of the dataset in type and size and I’d like to represent this in a factor model, where we fit parameters for the levels of each characteristic.

Here is my model definition:

# Define factor model
with pm.Model() as model:
    y_data = pm.Data('y', y)

    # Parameters at factor level
    intercept_size = pm.Normal(f'intercept_size', 0.0, 1.0, shape=(len(structure['size'].cat.categories)))
    intercept_type = pm.Normal(f'intercept_type', 0.0, 1.0, shape=(len(structure['type'].cat.categories)))
    
    j_size = structure['size'].cat.codes.values
    j_type = structure['type'].cat.codes.values
    intercept = pm.Deterministic('intercept', intercept_type[j_type] + intercept_size[j_size])
    sigma = pm.Exponential('sigma', 1.0, shape=y.shape[1])

    pm.Normal('obs', intercept, sigma, observed=y_data)

When I fit the model to the data:

with model:
    trace = pm.sample(target_accept=0.99, return_inferencedata=True)

I obtained highly correlated samples for the parameters intercept_size and intercept_type (correlation is basically perfect, ~ -0.99). I think this might not be surprising at first because there is some ambiguity. On the other hand, we have 9 individuals and 6 free parameters (3 for the size and 3 for the type), so from a linear algebra perspective, it should be possible to identify the parameters unambiguously.

My two questions are thus:

  • Is there some mistake in the model definition?
  • Is there a conceptual flaw in the model or in my thought that it should be possible to identify the parameters with the 9 individuals and 6 parameters?

If you think about it, it makes perfect sense that the two intercepts are perfectly correlated. Your model is fitting 9 independent normal distributions with \mu_i = \mu_{t_i} + \mu_{s_i} and \sigma = \sigma_i. Clearly, for any \mu_{t_i}, \mu_{s_i} = \mu_i - \mu_{t_i}, and you end up with those perfect lines you see in the trace plots. I computed lines of best fit between the traces for combinations of \mu_t and any \mu_s, and found that indeed, the slopes are all -1 and each intercept recovers perfectly the individual sample mean, i.e. intercept_types[0] + intercept_sizes[0] = 2.5. That is, each of those correlated relationships you see in the pair plots are showing you all the infinite ways to get what you asked for. The model parameters are definitely not identified.

You could solve this problem by attacking it from a different angle. Using dummy variables and slope parameters, for instance, will identify the effects you want very easily. I am also curious why you are estimating one standard error per individual. This suggests they are all from totally separate populations, so you need to fit 9 independent distributions, with no mechanism for information sharing between them. That doesn’t quite seem right.

Thank you for your reply! I think I am missing something obvious but let me reply with some questions:

If you think about it, it makes perfect sense that the two intercepts are perfectly correlated.

I don’t understand that, because I have 9 individuals, each composed by a type and size, i.e., I can write this down as an system of linear equations:

mu_A0 = mu_A + mu_0
mu_A1 = mu_A + mu_1

(How do you make those nice equations?)
Then I have 6 unknown quantities on the right, but 9 equations so there shouldn’t be any ambiguity, should there? Sorry, I think I am missing something obvious here, but I can’t quite see it.

Using dummy variables and slope parameters, for instance, will identify the effects you want very easily.

What exactly do you mean here? Do you mean writing this e.g. like:
mu_i = mu + \alpha * type + \beta * size
Would this work i the types and sizes are not ordered categories?

I am also curious why you are estimating one standard error per individual. This suggests they are all from totally separate populations, so you need to fit 9 independent distributions, with no mechanism for information sharing between them. That doesn’t quite seem right.

By construction of my synthetic data, all 9 individuals have different standard errors, right? That is why I fit individual parameters here. Perhaps I misunderstood your statement.

Thanks for your reply and help.

You make nice equations by putting dollar signs around normal LaTeX commands.

Let’s cast your system into a linear system, with 2 sizes and 2 types so I can type less:

\begin{bmatrix} \mu_{A,0} \\ \mu_{A,1} \\ \mu_{B,0} \\ \mu_{B,1} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mu_A \\ \mu_B \\ \mu_0 \\ \mu_1 \end{bmatrix}

What you care about is the rank of the design matrix, which np.linalg.matrix_rank tells me is 3. So indeed the system is not identified.

If you cast the problem into a dummy variables setup, you can see the co-linearity a bit better. The model is going to look like this:

y \sim N(\mu_i, \sigma)
\sigma \sim \text{Exponential}(\cdot)
\mu_i = \alpha + \sum_{t \in \text{Type}} \beta_t [t = \text{Type}_i] + \sum_{s \in \text{Size}} \beta_s [s = \text{Size}_i]

So we get 6 variables, one per size and one per type, that equals 0 or 1. Maybe it’s easier to see in code. y is from your code:

pd.get_dummies(y.melt()
                  .assign(constant = 1)
                  .assign(obs_type = lambda x: x.variable.str[0])
                  .assign(obs_size = lambda x: x.variable.str[1])
                  .drop(columns=['variable', 'value'])
              )

Here’s some random rows from this matrix:

     constant  obs_type_A  obs_type_B  obs_type_C  obs_size_0  obs_size_1  obs_size_2
126         1           1           0           0           0           1           0
206         1           1           0           0           0           0           1
410         1           0           1           0           0           1           0
284         1           1           0           0           0           0           1
719         1           0           0           1           0           1           0

This matrix is perfectly co-linear, because knowing obs_type_A is 0 and obs_type_B is 0 tells you what is obs_type_C (numpy tells me the rank is 5, which is what we expect).

So you could do estimation with this setup:

coords = {'betas':['constant', 'offset_B', 'offset_C', 'offset_1', 'offset_2']}

with pm.Model(coords=coords) as model:
    y_data = pm.Data('y', y)
    
    betas = pm.Normal('parameters', mu=0.0, sigma=1.0, dims=['betas'])
    
    mu = tt.dot(X.drop(columns=['obs_type_A', 'obs_size_0']), betas)
    sigma = pm.Exponential('sigma', 1.0)

    obs = pm.Normal('obs', mu, sigma, observed=y.melt()['value'])

Here’s the results:

                       mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
parameters[constant]  2.793  0.092   2.619    2.957      0.002    0.001    2196.0    2137.0    1.0
parameters[offset_B]  3.614  0.100   3.411    3.792      0.002    0.001    2927.0    2715.0    1.0
parameters[offset_C]  7.098  0.100   6.910    7.276      0.002    0.001    3070.0    2969.0    1.0
parameters[offset_1]  1.957  0.099   1.774    2.145      0.002    0.001    2756.0    2797.0    1.0
parameters[offset_2]  2.431  0.099   2.248    2.619      0.002    0.001    2929.0    2811.0    1.0
sigma                 1.240  0.029   1.184    1.295      0.000    0.000    3767.0    2600.0    1.0

It’s close but not dead on. Bear in mind that the offsets tell you how much larger the effect of being in type B makes you relative to being type A. The effect of type A and size 1 are mixed up in the constant. I don’t want to spend too much time refining it because we’re just working on simulated data.

As for the sigmas. I don’t know what objects you are studying, but it might be beneficial to think though the data generating process you have in mind. Consider mice. We might have three species of mice (types) and three diet treatment (sizes), and we measure the weight. So mice of a given species or on a certain diet come from their own distributions, described by the \mu parameters. But after we control for these effects, all mice also have their own individual variations around the group means. This is the sigma on the likelihood function. Does it make sense that each diet/species combination of mouse has idiosyncratic variation that is different than those of any other? Maybe or maybe not. Lurking around here, looking at other people’s models, and running my own, I’ve found the answer is usually not. If you have reason to believe there is group variation left at the end, you should think of way to explicitly model it at the group level, before you get to the individual level.

3 Likes

Awesome, thank you so much, this is incredibly helpful. Also, I am slightly embarassed about the Linear Algebra fallacy, thanks for explaining.

Yeah glad I could help a bit.

There may be a way to avoid the dummy variable approach and do a more direct decomposition on real data using a hierarchical model. If I understand it correctly, the partial pooling of parameters can help achieve identification in situations like these. It doesn’t work on your simulated data, but it would be worth a try on your actual dataset.

I want to ask a follow-up question about your last remark:

There may be a way to avoid the dummy variable approach and do a more direct decomposition on real data using a hierarchical model. If I understand it correctly, the partial pooling of parameters can help achieve identification in situations like these.

I’d like to understand better what you are suggesting. First, in the model you’re suggesting, we’re doing partial pooling, aren’t we? For instance, parameter[offset_1] pools across all units that are of size 1 (A1, B1, and C1), and parameters[offset_B] pools across all units of type B (B0, B1, B2).

Regarding a hierarchical model: Do you mean that in my original model, we would set up a hyperprior for those intercept_types and intercept_sizes, much like this:

# Define factor model
with pm.Model() as model:
    y_data = pm.Data('y', y)

    intercept_size_mean = pm.Normal(f'intercept_size_mean', 0.0, 1.0)
    intercept_type_mean = pm.Normal(f'intercept_type_mean', 0.0, 1.0)
    intercept_size_scale = pm.Exponential(f'intercept_size_scale', 1.0)
    intercept_type_scale = pm.Exponential(f'intercept_type_scale', 1.0)

    intercept_size = pm.Normal(f'intercept_size', intercept_size_mean, intercept_size_scale, shape=(len(structure['size'].cat.categories)))
    intercept_type = pm.Normal(f'intercept_type', intercept_type_mean, intercept_type_scale, shape=(len(structure['type'].cat.categories)))    
    
    j_size = structure['size'].cat.codes.values
    j_type = structure['type'].cat.codes.values
    intercept = pm.Deterministic('intercept', intercept_type[j_type] + intercept_size[j_size])
    sigma = pm.Exponential('sigma', 1.0, shape=y.shape[1])

    pm.Normal('obs', intercept, sigma, observed=y_data)

By the way, I ran the model and observed that the correlation between samples is lower but not negligible, around ~0.55.

intercepts_fitted = np.array(trace.posterior['parameters']).reshape(4000, -1)
np.corrcoef(intercepts_fitted.T)[0]

yielding

array([ 1.        , -0.54289372, -0.57471233, -0.5392847 , -0.55153031])

Wouldn’t that be regarded as problematic?

Correlation between the parameters isn’t bad per se; you really just might have correlated parameters. You have to decide if it makes sense in your context. In a model of stock returns for example, if you fit a simple linear trend to excess returns (r_{i,t} - r^f_t) = \alpha_i + \beta_{i}t you would expect \alpha_i and \beta_i to be negatively correlated – stocks that start the sample period with abnormally high or low returns (captured by \alpha) would be expected to revert to the sample mean over time (\beta < 0 when \alpha > 0 and vice-versa).

Correlation between parameters is a problem from a sampling perspective because most MCMC samplers (Metropolis-Hastings, Slice, Gibbs) don’t make “diagonal” steps, they update one dimension at a time and take square steps. So if you have highly correlated parameters, the posterior distribution you want to sample is very skinny in all the “basis” directions, so all the proposals get rejected and you never make your way up the long diagonal. NUTS exists exactly to solve this problem, so if you don’t have divergences there isn’t a problem, per se.

Identification, however, is a separate issue you have to work through theoretically. There’s no statistical test to know if your model accurately isolates the phenomenon you are interested in.

Partial pooling occurs when the group parameters are drawn from a common underlying distribution, so the the different groups can share information. I found this introduction to hierarchical models video helpful in explaining the mechanism. The key point is that when the parameters all come from different distributions, as in your first model, and in my dummy variable model, there is no mechanism for the estimates of one group to enter the estimates of another group, so there is no “pooling”.

Your original model could be said to have a form of incomplete pooling, because each individual is a combination of a size and a type, so variance from type A can enter into the estimates of size 1, but ONLY via individuals that are combination A1. In a hierarchical model, all variance from all individuals is propagated up through the model and helps to inform everything. I found this video lecture to be an excellent introduction to the theory of hierarchical models, their differences from classical panel models (the dummy variable “fixed effect” model is also discussed), how they work. There is also a nice discussion of pooling and partial pooling in the Radon case study notebook, which also has the benefit of using PyMC in the examples.

3 Likes

Thanks a lot for the explanation. In other words: if samples of parameters are correlated, that can actually tell us something about the model and the data, rather than being considered as a problem per se.

Thanks for the link to the video lecture as well, very helpful.

1 Like

Just wanted to report here that I managed to identify the parameters, at least approximately with uncertainty, with the following model that uses partial pooling for the intercepts:

# Define factor model
with pm.Model() as model:
    y_data = pm.Data('y', y)

    intercept_size_mean = pm.Normal(f'intercept_size_mean', 0.0, 1.0)
    intercept_type_mean = pm.Normal(f'intercept_type_mean', 0.0, 1.0)
    intercept_size_scale = pm.Exponential(f'intercept_size_scale', 1.0)
    intercept_type_scale = pm.Exponential(f'intercept_type_scale', 1.0)

    intercept_size_unscaled = pm.Normal(f'intercept_size_unscaled', 0, 1, shape=(len(structure['size'].cat.categories)))
    intercept_size = pm.Deterministic('intercept_size', intercept_size_unscaled * intercept_size_scale + intercept_size_mean)
    intercept_type_unscaled = pm.Normal(f'intercept_type_unscaled', 0, 1, shape=(len(structure['type'].cat.categories)))    
    intercept_type = pm.Deterministic('intercept_type', intercept_type_unscaled * intercept_type_scale + intercept_type_mean)
    
    j_size = structure['size'].cat.codes.values
    j_type = structure['type'].cat.codes.values
    intercept = pm.Deterministic('intercept', intercept_type[j_type] + intercept_size[j_size])
    sigma = pm.Exponential('sigma', 1.0, shape=y.shape[1])

    pm.Normal('obs', intercept, sigma, observed=y_data)

Using this model, I could get some satisfactory results:

visualization

(red dots: true value, blue dots: mean of posterior distribution, vertical lines indicate 3-97 percentile intervals).

I think this kind of model is roughly what you meant above, right, @jessegrabowski ?

1 Like

Yep, that’s exactly what I meant by a partially pooled model. Looks like you got it figured out, congrats!

1 Like