How `observed` data are subset in hierarchical modelling

Jack_Caster · March 21, 2018, 3:55pm

I just started fiddling around with PyMc and so far it has been a pleasure! However, how the library deals with observed data has not clicked yet. For example, let’s consider the varying intercept model in the radon example; here, we have one a value for each county. But when I pass the log_radon series of data in the likelihood, how are the data subset into counties’ values? Are the data selected with fancy indexing based on index of the array a?

edit: formatting

junpenglao · March 21, 2018, 4:23pm

You should not think that the observed log_radon are being indexed into counties - you can understand it this way in this particular case, but in general, it does not apply.

In a hierarchical linear model, you have some kind of function (e.g., a linear function) that maps the input to some output. In this case, indexing is also a function (remember, functions are mapping) that map some \hat{y} to the observed y.

For example, you can have 10 counties, each with its own intercept \beta_i, and you have the observed as:

County index	observed
0	.13
0	1.5
1	.23

when you model the \beta_i as beta = pm.Normal(..., shape=10), and do observed = pm.Normal(..., mu=beta[index], ..., observed=y). You are essitentally mapping the beta to your observed.

I hope this clear up things a bit.

Jack_Caster · March 21, 2018, 5:47pm

Ok, thank you for the prompt response! I think I am getting closer. Just to make sure I understand:

Does the county index need to contain only values from 0 up to some number with an increment of 1 then? (for example I could use as index the output of pd.Categorical(...).codes.) Do the values in the table you posted need to be sorted in ascending order with respect to the county index?
Is it possible to deal with sub-indexes (an index for districts nested into county) in the same way? How could I do that? For example, let’s have the following table

County index	districts index	observed
0	0	.13
0	1	.15
0	2	.23
0	3	.24
1	0	.43
1	1	.25
1	2	.33
1	3	.4

and I want to have a multilevel model where each county parameter is from a common group distribution, and further each district depends on the county.

junpenglao · March 21, 2018, 8:04pm

It doesnt need to be sorted. Getting the index using pd.Categorical(...).codes is fine.

The best way is to model it as a hierarchical model, maybe you would find GitHub - junpenglao/GLMM-in-Python: Generalized linear mixed-effect model in Python useful

Jack_Caster · March 21, 2018, 8:11pm

Great. I will look into GLMM examples. Thank you!

Topic		Replies	Views
Hierarchical model and pandas dataframe Questions	7	2394	July 29, 2021
Managing indices/data shape in a multi-level linear regression for incremental model updating Questions	0	372	September 10, 2020
Understanding coords, indexation, Data, ..., for multilevel models v5 modeling	1	3692	April 29, 2022
Indexing in hierarchical modeling version agnostic	0	68	July 2, 2024
Sampling a specific country from the hierarchical model example Questions	3	536	May 24, 2019

How `observed` data are subset in hierarchical modelling

Related topics