Hello,
I’ve started reading the Rethinking Statistics (2022) book and it was great so far. I want to make sure I understand how to compute predictions for a specific model, in this case is the multinomial categorical. (Chapter 11)
I have these generated data:
N = 500
# simulate family incomes for each individual
family_income = np.random.rand(N)
# assign a unique coefficient for each type of event
b = np.array([-2.0, 0.0, 2.0])
p = softmax(np.array([0.5, 1.0, 1.5])[:, None] + np.outer(b, family_income), axis=0).T
career = np.asarray([np.random.multinomial(1, pp) for pp in p])
career = np.where(career == 1)[1]
career[:11]
I’m creating a model that assigns a coefficient to each observation value (family_income)
with pm.Model() as m11_14:
a = pm.Normal("a", 0.0, 1.5, shape=2) # intercepts
b = pm.Normal("b", 0.0, 1.0, shape=2) # coefficients on family income
s0 = a[0] + b[0] * family_income
s1 = a[1] + b[1] * family_income
s2 = np.zeros(N) # pivot
s = pm.math.stack([s0, s1, s2]).T
p_ = at.nnet.softmax(s)
career_obs = pm.Categorical("career", p=p_, observed=career)
trace_11_14 = pm.sample(1000, tune=2000, target_accept=0.9, random_seed=RANDOM_SEED)
az.summary(trace_11_14, round_to=2)
I want to computed the implied predictions based on the posterior distributions of the coefficients a
and b
. I’ve tried this:
post_11_14 = az.extract_dataset(trace_11_14["posterior"])
s0 = post_11_14["a"][0, :] + post_11_14["b"][0, :] * family_income
The issue is that the length of the posterior for each coefficient is 4000 (4 chains of 1000 samples each) and the length of observations (family_income) is 500. What is the best practice in order to create predictions for each observation? Do we do some sort of aggregation on (chain, samples) level?
Here is the link to the entire chapter code.