Observed RVs of Different Lengths in a Linear Model


First, thank you for the amazing work on PyMC3.

I’m trying to build a linear model where the independent variables are not the same length. Using the actual values or creating ObservedRVs gives me an Input dimension mis-match. As a workaround, a colleague suggested creating two parameters, one that is observed and one that isn’t that share hyper-parameters, then use the UnobservedRVs in the linear model. This works to some extent, as the trace for the UnobservedRV matches the sample_ppc for the ObservedRV, but I can’t get the hyper-parameters to converge and I get some pretty wonky traceplots.

Is this the correct way to go about building a linear model with different length independent variables? Is there another option I can try?

Here’s a snippet of how I’m sharing parameters between the RVs, I can provide the full code if that will help

with pm.Model() as model:
    # Priors for gene expression
    hyper_shape = len(genes) * len(datasets)
    hyper_sigma = pm.InverseGamma('hyper_sigma', 2.1, 1, shape=hyper_shape)
    hyper_mu = pm.Normal('hyper_mu', 0, 1, shape=hyper_shape)
    # RVs for Gene expression
    # X is observed and Y isn't, but they share priors
    X, Y = {}, {}
    i = 0
    for gene in genes:
        for name, dataset in datasets:
            key = f'{gene}-{name}'
            X[key] = pm.Normal(f'X-{key}', mu=hyper_mu[i], sd=hyper_sigma[i], observed=dataset[gene])
            Y[key] = pm.Normal(f'Y-{key}', mu=hyper_mu[i], sd=hyper_sigma[i])
    i += 1

I then use the UnobservedRV Y when building the linear model.

Here’s an example of a traceplot I get with just one independent variable.


Usually the recommended way is to unpack/reshape the observed into a 1d vector:
Say your observed is

y = [[1, 2, 4, 6], 
     [0, 1],
     [5] ...]

you turn it into a vector, with the index:

y_new = [1, 2, 4, 6, 0, 1, 5, ...]
y_idx = [0, 0, 0, 0, 1, 1, 2, ...]

And in the model, you do:

pm.Normal('observed', hyper_mu[y_idx], sd=hyper_sigma[y_idx], observed=y_new)

Hope this helps.


Ah, thank you for the reply Junpeng. I should’ve realized that’s the method used in the radon GLM hierarchical model in the PyMC3 docs for county codes.