Form of index for multi-dimensional shape with observed data

Hi there,

I can’t seem to find any examples of how to pass the index for observed multidimensional RVs. Here’s a toy example:

# 100 Genes, 2 Datasets
shape = (100, 2)
with pm.Model() as model:
    mu = pm.Normal('mu', 0, 10, shape=shape)
    sd = pm.Uniform('sd', 1, 10, shape=shape)
    y = pm.Normal('y', mu[index], sd[index], shape=shape, observed=obs)

My primary question is: what is the form of index? Based on examples when shape is scalar, index is something like [0, 0, 1, 2, 0, 1, 2, ....]. For multidimensional shape I assumed it would be [ (0, 0), (0, 5), (1, 2), ...] if obs is a 1d vector of values.

My current approach is to flatten shape to a (200,) vector (in this case) and then use reshape when I need to do a pm.math.dot downstream, but I wasn’t sure if that was “efficient” or the “correct way” to do it.

Cheers and thanks again for PyMC3! Such a fantastic tool.

2 Likes

Hi John,
I think you don’t need to flatten your shape tuple, otherwise you’ll loose the 2D-aspect in your distribution. Just pass it to the distribution as a 2D array, and index mu and sd as you would index a numpy array: mu[:, 0] for instance, to select all lines but only the first column.

Also, I don’t think it’s necessary to pass shape to your likelihood, as there is already an observed.
Here is a NB where you’ll find lots of examples of shape handling in complex models.
Hope this helps :vulcan_salute:

1 Like

Hi Alex,

Thank you for the reply! I understand how I would index the RV if I wanted to index it to some specific input in the 2D array, like you describe mu[:, 0], but what I want to do is different (or I’m just thick and missing something):

In my case, I want one RV for each gene / dataset pairing:

|    | D1        | D2        | D3        | D4        |
|----|-----------|-----------|-----------|-----------|
| G1 | $X_{1,1}$ | $X_{1,2}$ | $X_{1,3}$ | $X_{1,4}$ |
| G2 | $X_{2,1}$ | $X_{2,2}$ | $X_{2,3}$ | $X_{2,4}$ |
| G3 | $X_{3,1}$ | $X_{3,2}$ | $X_{3,3}$ | $X_{3,4}$ |

The number of samples in each dataset is different, so I may have 127 samples in dataset 1 and 75 samples in dataset 2. So for 100 genes and 2 datasets I have 127 * 100 + 75 * 100 = 21,500 values I want to pass in as observations for these (100, 2) RVs.

If the shape of y is scalar, then index is just vector the same length as obs that map to the observations. Now that my RVs are in a 2D matrix, I’m trying to figure out how to pass in these 21,500 observations and map them to the the appropriate RV.

For context, here is the complete model with the reshape approach. I think having two different observed may indicate I should break it up into separate models?

with pm.Model() as model:
    # Constants
    training_genes = list(sample.index)
    N = bgdf.tissue.nunique()
    M = len(training_genes)
    MN = M * N

    # Gene expression priors
    sd = pm.InverseGamma('gm_sd', 1, 1, shape=MN)
    mu = pm.Normal('gm_mu', 0, 10, shape=MN)

    # Model gene expression
    x_hat = pm.Normal('x_hat', mu=mu[ix], sd=sd[ix], shape=MN, observed=obs)
    x = pm.Normal('x', mu=mu, sd=sd, shape=MN)

    # Likelihood priors
    eps = pm.InverseGamma('eps', 1, 1)
    beta =  pm.Dirichlet("beta", a=np.ones(N))

    # Likelihood
    y_hat = pm.Deterministic('y_hat', pm.math.dot(x.reshape((M, N)), beta))
    y = pm.Laplace('y', mu=y_hat[six], b=eps, observed=sample.values)

(3 datasets, 125 genes)

And what’s the error message you’re getting?