Shape Parameter for Multiple Treatment Groups

jpryda · March 25, 2019, 4:07am

From “Bayesian Methods for Hackers” in Chapter 1 there’s an example where the expected rate of text messages across a single boundary, and the location of the boundary in time, are inferred. I’ve been trying to generalise this such that if we had count data for multiple people we could compute the parameters for each individual separately and simultaneously, without having to create separate likelihood functions (using “fancy indexing” as Eric J Ma calls it) in a style similar to this:

with pm.Model() as model:
    n_persons = fake_count_data['person_id'].nunique()
    n_count_data = len(fake_count_data)/n_persons
 
    alpha = 1.0/fake_count_data['n'].mean()
    
    lambda_1 = pm.Exponential("lambda_1", alpha, shape=n_persons) # Added shape=n here
    lambda_2 = pm.Exponential("lambda_2", alpha, shape=n_persons) # Added shape=n here
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1, shape=n_persons) # Added shape=n here
    
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1, shape=n_persons) # Added shape=n here

    idx = np.arange(n_count_data)
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2) # This line fails

    observation = pm.Poisson("obs", mu=lambda_[fake_count_data['person_id']], observed=fake_count_data['n']) # # Added [count_data['person_id']] here

However I get an error: ValueError: Input dimension mis-match. (input[0].shape[0] = 21, input[1].shape[0] = 17)

There seems to be an incompatibility with the dimensions of idx (17 units of time long) and tau (21 people) in the tau < idx comparison; every person should have their own tau but idx should simply be a sequence describing the length of the period.

Moreover I would have thought that doing the following would have no effect on the original code:
adding shape=1 to all three variables: lambda_1, lambda_2 and tau (leaving mu=lambda_ as is in the likelihood function observation). Why is my intuition wrong? I get an error this time on sampling: TypeError: Cannot convert Type TensorType(int64, vector) (of Variable tau_shared__) into Type TensorType(int64, (True,)). You can try to manually convert tau_shared__ into a TensorType(int64, (True,))..

Would love some help with this!

lucianopaz · March 25, 2019, 2:38pm

You could change that to

idx = np.arange(n_count_data)[:, None]

This should allow the condition in the switch to broadcast well, and the indexing you do here:

should still work.

colcarroll · March 25, 2019, 3:27pm

TIL that np.arange(n_count_data)[:, None] works! This looks to be equivalent to np.expand_dims(np.arange(n_count_data), -1) (both turn the index into a n_count_data x 1 matrix instead of an array).

jpryda · March 26, 2019, 2:59am

Thanks for the response! Almost there but not quite, the final line with the likelihood is now providing the error. I’m generating the fake data like so:

count_data = pd.DataFrame({
 "n": np.random.poisson(lam = 15, size = 74*3),
 "person_id": np.repeat([0,1,2], 74)
})

and get a ValueError: Input dimension mis-match. (input[0].shape[1] = 222, input[1].shape[1] = 3) As a reminder I’m using 3 people, each with 74 items of data.

and as above, the likelihood function is

 observation = pm.Poisson("obs", mu=lambda_[count_data['person_id']], observed=count_data['n'])

lucianopaz · March 26, 2019, 5:16am

My suggestion is to separate the person and iid observations into separate axis like this:

"n": np.random.poisson(lam = 15, size = (74, 3)), "person_id": np.repeat([[0,1,2]], (74, 1))

and finally set the shape of the observations RV to shape=n_persons. The idea behind this is that you want to have the model specified well in the absence of observed data. By specified well I mean that sample_prior_predictive should work and its output for the observations RV should have the same shape as your fake data, when you ask to get 74 samples.

jpryda · March 26, 2019, 12:50pm

Got it working thanks for your help lucianopaz! So it seems like with pm.switch it’s easier to better to pass data in a wide format, not a tall format (tall formats work fine on other distributions using numpy fancy indexing).

Curious to know the best practice if you have different numbers of observations for each treatment group (NANs give the warning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.)?

Final code if it will help others:

fake_count_data = pd.DataFrame(np.random.poisson(lam = 15, size = (74, 3)))
fake_count_data.columns = ['p0', 'p1', 'p2']

with pm.Model() as model:

    n_persons = fake_count_data.shape[1]
    n_count_data = len(fake_count_data)
    
    alpha = 1.0/fake_count_data.mean().mean()
    
    lambda_1 = pm.Exponential("lambda_1", alpha, shape=n_persons)
    lambda_2 = pm.Exponential("lambda_2", alpha, shape=n_persons)
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1, shape=n_persons)

    idx = np.arange(n_count_data)[:, None] # Thanks lucianopaz!
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)

    observation = pm.Poisson("obs", mu=lambda_, observed=fake_count_data)

Topic		Replies	Views
Understanding 'Coords' and 'dims' to analyze dataset version agnostic modeling	3	171	September 28, 2024
Labeled coords and dims in hierarchical group setting with different sizes v5 modeling	16	1159	May 31, 2023
Understanding indexing and dimensions in hierarchical models v5 time_series	9	1901	January 26, 2023
Multiple response model, broadcast failure? Questions	12	930	November 20, 2020
Dimension error for Hierarchical model version agnostic	11	383	April 13, 2023

Shape Parameter for Multiple Treatment Groups

Related topics