How indexing works in pymc

Hi everyone. I’ve been having trouble understanding how pymc handles indexing. Is there a document somewhere that explains how pymc manages shapes and indexes?

Let me give an example of the kind of problem I’m having. Suppose we want to fit two logistic regressions and do partial pooling between their slopes. I have one predictor (time) and two outcome variables (k). I’ve stored the outcomes in a 2x22 array, along with the n’s in another 2x22 array. Each row represents one cluster and each column represents one point in time. Here’s a data generating process:

# synthetic data
# one predictor variable 
# representing time

x = np.arange(22)

# n is a 2x22 matrix representing
# the maximiums at each point in time

n = np.random.randint(100,size=(2,22))

# apply inverse logit transformation on 
# x twice with two different paramters

a0 = -4
b0 = 0.1

p0 = np.exp(a0 + b0*x) / (1 + np.exp(a0 + b0*x))

a1 = -2
b1 = 0.08

p1 = np.exp(a1 + b1*x) / (1 + np.exp(a1 + b1*x))

# generate outcome variable,
# two count variables from
# binomial distribution

k0 = stats.binom(n=n[0],p=p0).rvs()
k1 = stats.binom(n=n[1],p=p1).rvs()

# combine to get 2 x 22 outcome array

k = np.array([k0,k1])

So this pymc code works. It’s ugly because it just enumerates both logistic regressions by using hard-coded indexes on the parameters and then another batch of hard-coded indexes to pull out the right data.

with pm.Model() as partial_pool:
    hyper_b_mu = pm.Normal('hbmu',0,1)
    hyper_b_std = pm.HalfNormal('hbstd',0.5)
    
    a = pm.Normal('a',0,1,shape=2)
    b = pm.Normal('b',hyper_b_mu,hyper_b_std,shape=2)

    p0 = pm.invlogit(a[0]+b[0]*x)
    p1 = pm.invlogit(a[1]+b[1]*x)
    
    y0 = pm.Binomial('y0',n=n[0],p=p0,observed=k[0])
    y1 = pm.Binomial('y1',n=n[1],p=p1,observed=k[1])
    
    trace_partial_pool = pm.sample()

I thought that I could just replace my hard-coded subscripts with an array, called cluster below, that takes on the same values (0,1). But this pymc code doesn’t work - aesara throws up “AssertionError: Could not broadcast dimensions” when it tried to compile the model. I’ve tried what feels like a dozen permutations of this indexed pymc code and all of them give me the same error.

# added cluster to manage indexing
cluster = np.array([0,1])

with pm.Model() as partial_pool:
    hyper_b_mu = pm.Normal('hbmu',0,1)
    hyper_b_std = pm.HalfNormal('hbstd',0.5)
    
    a = pm.Normal('a',0,1,shape=2)
    b = pm.Normal('b',hyper_b_mu,hyper_b_std,shape=2)

    p = pm.invlogit(a[cluster]+b[cluster]*x)
    
    y = pm.Binomial('y',n=n[cluster],p=p[cluster],observed=k[cluster])
    
    trace_partial_pool = pm.sample()

So anyway, it would be wonderful if someone could explain why the first one works but the second doesn’t when, to my mind, they are the same thing.

Indexing is supposed to work exactly like Numpy. What version of PyMC are you using? And can you share the full code to reproduce the issue?

You can call var.eval() on intermediate variables to get a sample value. This is useful to check if shapes match your expectations.

I’m on pymc version 4.1.3.

I think the only thing missing was import statements. If you run this, it should throw the broadcasting error I mentioned.

import pymc as pm
import arviz as az
import numpy as np
from scipy import stats

# synthetic data
# one predictor variable 
# representing time

x = np.arange(22)

# n is a 2x22 matrix representing
# the maximiums at each point in time

n = np.random.randint(100,size=(2,22))

# apply inverse logit transformation on 
# x twice with two different paramters

a0 = -4
b0 = 0.1

p0 = np.exp(a0 + b0*x) / (1 + np.exp(a0 + b0*x))

a1 = -2
b1 = 0.08

p1 = np.exp(a1 + b1*x) / (1 + np.exp(a1 + b1*x))

# generate outcome variable,
# two count variables from
# binomial distribution

k0 = stats.binom(n=n[0],p=p0).rvs()
k1 = stats.binom(n=n[1],p=p1).rvs()

# combine to get 2 x 22 outcome array

k = np.array([k0,k1])

cluster = np.array([0,1])

with pm.Model() as partial_pool:
    hyper_b_mu = pm.Normal('hbmu',0,1)
    hyper_b_std = pm.HalfNormal('hbstd',0.5)
    
    a = pm.Normal('a',0,1,shape=2)
    b = pm.Normal('b',hyper_b_mu,hyper_b_std,shape=2)

    p = pm.invlogit(a[cluster]+b[cluster]*x)
    
    y = pm.Binomial('y',n=n[cluster],p=p[cluster],observed=k[cluster])
    
    trace_partial_pool = pm.sample()

The short version of the error looks like:

ERROR (aesara.graph.opt): Optimization failure due to: constant_folding
ERROR (aesara.graph.opt): node: Assert{msg=Could not broadcast dimensions}(ScalarConstant{22}, TensorConstant{False})
ERROR (aesara.graph.opt): TRACEBACK:
ERROR (aesara.graph.opt): Traceback (most recent call last):
  File "C:\Users\Daniel Saunders\anaconda3\envs\pymc_env\lib\site-packages\aesara\graph\opt.py", line 1861, in process_node
    replacements = lopt.transform(fgraph, node)
  File "C:\Users\Daniel Saunders\anaconda3\envs\pymc_env\lib\site-packages\aesara\graph\opt.py", line 1066, in transform
    return self.fn(fgraph, node)
  File "C:\Users\Daniel Saunders\anaconda3\envs\pymc_env\lib\site-packages\aesara\tensor\basic_opt.py", line 2785, in constant_folding
    required = thunk()
  File "C:\Users\Daniel Saunders\anaconda3\envs\pymc_env\lib\site-packages\aesara\graph\op.py", line 559, in rval
    r = p(n, [x[0] for x in i], o, params)
  File "C:\Users\Daniel Saunders\anaconda3\envs\pymc_env\lib\site-packages\aesara\raise_op.py", line 96, in perform
    raise self.exc_type(self.msg)
AssertionError: Could not broadcast dimensions

This .eval() trick is really helpful thank you! I think my problem is connected to like a column vector vs row vector thing. When I transpose my n,k,x matrixes into shape 22x2, the pymc model compiles. It gives me a new error about the initial evaluation failing. I need to play around with it but I’ll write back tomorrow and let you know if this clears things up.

2 Likes

There are also specific rules how distribution parameters must be aligned shape-wise. This guide may help: Distribution Dimensionality — PyMC 4.1.7 documentation.

1 Like

The other recommendation would be to switch from shapes to coordinates/dimensions. The latter allow for dimensions to be named with semantically meaningful labels and often help with keeping track of shape information. See Oriol’s blog post here and my shorter transition guide here.

1 Like

In addition (yes, I am replying to my own reply), it is very common to flatten your observed variables (e.g., convert your data into “long” format) at which point you can then typically work with 1D parameter arrays. See here for just 1 example of such a guide (it’s also illustrated in Oriol’s post).

1 Like

These were just the guides I was looking for. Thank you Christian and Ricardo, much appreciated.