How `observed` data are subset in hierarchical modelling

I just started fiddling around with PyMc and so far it has been a pleasure! However, how the library deals with observed data has not clicked yet. For example, let’s consider the varying intercept model in the radon example; here, we have one a value for each county. But when I pass the log_radon series of data in the likelihood, how are the data subset into counties’ values? Are the data selected with fancy indexing based on index of the array a?

edit: formatting

You should not think that the observed log_radon are being indexed into counties - you can understand it this way in this particular case, but in general, it does not apply.

In a hierarchical linear model, you have some kind of function (e.g., a linear function) that maps the input to some output. In this case, indexing is also a function (remember, functions are mapping) that map some \hat{y} to the observed y.

For example, you can have 10 counties, each with its own intercept \beta_i, and you have the observed as:

County index observed
0 .13
0 1.5
1 .23

when you model the \beta_i as beta = pm.Normal(..., shape=10), and do observed = pm.Normal(..., mu=beta[index], ..., observed=y). You are essitentally mapping the beta to your observed.

I hope this clear up things a bit.

Ok, thank you for the prompt response! I think I am getting closer. Just to make sure I understand:

  1. Does the county index need to contain only values from 0 up to some number with an increment of 1 then? (for example I could use as index the output of pd.Categorical(...).codes.) Do the values in the table you posted need to be sorted in ascending order with respect to the county index?
  2. Is it possible to deal with sub-indexes (an index for districts nested into county) in the same way? How could I do that? For example, let’s have the following table
County index districts index observed
0 0 .13
0 1 .15
0 2 .23
0 3 .24
1 0 .43
1 1 .25
1 2 .33
1 3 .4

and I want to have a multilevel model where each county parameter is from a common group distribution, and further each district depends on the county.

It doesnt need to be sorted. Getting the index using pd.Categorical(...).codes is fine.

The best way is to model it as a hierarchical model, maybe you would find GitHub - junpenglao/GLMM-in-Python: Generalized linear mixed-effect model in Python useful

1 Like

Great. I will look into GLMM examples. Thank you!