I am trying to create and fit a factor model. Here is the idea (reproducing code below):
Let’s assume we have a dataset with samples for 9 different instances. Each instance is a combination of two characteristics, type and size, where we have three types (A, B, C) and sizes (0, 1, 2). For each instance we have 100 samples which are generated from a Normal distribution with a mean mu and variance sigma.
- sigma is specific to each individual.
- the mean value of each individual is computed from the mean of its type and size, with a bit of variation
Here is the generation of the synthetic data:
import numpy as np
import pymc3 as pm
import pandas as pd
import string
import arviz as az
N_types = 3
N_sizes = 3
N = N_types * N_sizes
N_samples = 100
# Define structure
types = np.repeat(np.arange(N_types, dtype=int), N_sizes)
sizes = np.array(list(range(N_sizes)) * N_types)
type_labels = np.array(list(string.ascii_uppercase[:N_types]))
labels = [f"{type_labels[types[i]]}{sizes[i]}" for i in range(N)]
# Type-level parameters
intercept_types = np.array([0.5, 5., 8.])
intercept_sizes = np.array([2., 4., 3.])
# Individual deviations
intercept_individual = np.random.normal(size=N) * 0.01
sigma = np.random.exponential(size=N)
# Compute individual parameters
intercept = intercept_types[types] + intercept_sizes[sizes] + intercept_individual
# Create data
mu = intercept
data = np.random.normal(loc=mu[None], scale=sigma[None], size=(N_samples, N))
y = pd.DataFrame(data, columns=labels)
# DataFrame to define structure
structure = pd.DataFrame({'type': type_labels[types], 'size': sizes}, index=labels, dtype="category")
Thus, the samples reflect the structure of the dataset in type and size and I’d like to represent this in a factor model, where we fit parameters for the levels of each characteristic.
Here is my model definition:
# Define factor model
with pm.Model() as model:
y_data = pm.Data('y', y)
# Parameters at factor level
intercept_size = pm.Normal(f'intercept_size', 0.0, 1.0, shape=(len(structure['size'].cat.categories)))
intercept_type = pm.Normal(f'intercept_type', 0.0, 1.0, shape=(len(structure['type'].cat.categories)))
j_size = structure['size'].cat.codes.values
j_type = structure['type'].cat.codes.values
intercept = pm.Deterministic('intercept', intercept_type[j_type] + intercept_size[j_size])
sigma = pm.Exponential('sigma', 1.0, shape=y.shape[1])
pm.Normal('obs', intercept, sigma, observed=y_data)
When I fit the model to the data:
with model:
trace = pm.sample(target_accept=0.99, return_inferencedata=True)
I obtained highly correlated samples for the parameters intercept_size
and intercept_type
(correlation is basically perfect, ~ -0.99). I think this might not be surprising at first because there is some ambiguity. On the other hand, we have 9 individuals and 6 free parameters (3 for the size and 3 for the type), so from a linear algebra perspective, it should be possible to identify the parameters unambiguously.
My two questions are thus:
- Is there some mistake in the model definition?
- Is there a conceptual flaw in the model or in my thought that it should be possible to identify the parameters with the 9 individuals and 6 parameters?