Shape Parameter for Multiple Treatment Groups

From “Bayesian Methods for Hackers” in Chapter 1 there’s an example where the expected rate of text messages across a single boundary, and the location of the boundary in time, are inferred. I’ve been trying to generalise this such that if we had count data for multiple people we could compute the parameters for each individual separately and simultaneously, without having to create separate likelihood functions (using “fancy indexing” as Eric J Ma calls it) in a style similar to this:

with pm.Model() as model:
    n_persons = fake_count_data['person_id'].nunique()
    n_count_data = len(fake_count_data)/n_persons
    alpha = 1.0/fake_count_data['n'].mean()
    lambda_1 = pm.Exponential("lambda_1", alpha, shape=n_persons) # Added shape=n here
    lambda_2 = pm.Exponential("lambda_2", alpha, shape=n_persons) # Added shape=n here
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1, shape=n_persons) # Added shape=n here
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1, shape=n_persons) # Added shape=n here

    idx = np.arange(n_count_data)
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2) # This line fails

    observation = pm.Poisson("obs", mu=lambda_[fake_count_data['person_id']], observed=fake_count_data['n']) # # Added [count_data['person_id']] here

However I get an error: ValueError: Input dimension mis-match. (input[0].shape[0] = 21, input[1].shape[0] = 17)

There seems to be an incompatibility with the dimensions of idx (17 units of time long) and tau (21 people) in the tau < idx comparison; every person should have their own tau but idx should simply be a sequence describing the length of the period.

Moreover I would have thought that doing the following would have no effect on the original code:
adding shape=1 to all three variables: lambda_1, lambda_2 and tau (leaving mu=lambda_ as is in the likelihood function observation). Why is my intuition wrong? I get an error this time on sampling: TypeError: Cannot convert Type TensorType(int64, vector) (of Variable tau_shared__) into Type TensorType(int64, (True,)). You can try to manually convert tau_shared__ into a TensorType(int64, (True,))..

Would love some help with this!

You could change that to

idx = np.arange(n_count_data)[:, None]

This should allow the condition in the switch to broadcast well, and the indexing you do here:

should still work.

TIL that np.arange(n_count_data)[:, None] works! This looks to be equivalent to np.expand_dims(np.arange(n_count_data), -1) (both turn the index into a n_count_data x 1 matrix instead of an array).

Thanks for the response! Almost there but not quite, the final line with the likelihood is now providing the error. I’m generating the fake data like so:

count_data = pd.DataFrame({
 "n": np.random.poisson(lam = 15, size = 74*3),
 "person_id": np.repeat([0,1,2], 74)

and get a ValueError: Input dimension mis-match. (input[0].shape[1] = 222, input[1].shape[1] = 3) As a reminder I’m using 3 people, each with 74 items of data.

and as above, the likelihood function is

 observation = pm.Poisson("obs", mu=lambda_[count_data['person_id']], observed=count_data['n'])

My suggestion is to separate the person and iid observations into separate axis like this:

"n": np.random.poisson(lam = 15, size = (74, 3)), "person_id": np.repeat([[0,1,2]], (74, 1))

and finally set the shape of the observations RV to shape=n_persons. The idea behind this is that you want to have the model specified well in the absence of observed data. By specified well I mean that sample_prior_predictive should work and its output for the observations RV should have the same shape as your fake data, when you ask to get 74 samples.

1 Like

Got it working thanks for your help lucianopaz! So it seems like with pm.switch it’s easier to better to pass data in a wide format, not a tall format (tall formats work fine on other distributions using numpy fancy indexing).

Curious to know the best practice if you have different numbers of observations for each treatment group (NANs give the warning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.)?

Final code if it will help others:

fake_count_data = pd.DataFrame(np.random.poisson(lam = 15, size = (74, 3)))
fake_count_data.columns = ['p0', 'p1', 'p2']

with pm.Model() as model:

    n_persons = fake_count_data.shape[1]
    n_count_data = len(fake_count_data)
    alpha = 1.0/fake_count_data.mean().mean()
    lambda_1 = pm.Exponential("lambda_1", alpha, shape=n_persons)
    lambda_2 = pm.Exponential("lambda_2", alpha, shape=n_persons)
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1, shape=n_persons)

    idx = np.arange(n_count_data)[:, None] # Thanks lucianopaz!
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)

    observation = pm.Poisson("obs", mu=lambda_, observed=fake_count_data)