Modelling groups with different number of observations

dpananos · August 18, 2018, 9:45pm

I have 10 subjects I am studying. I’ve observed each subject and counted how many times they perform some action. I have observed some subjects multiple times, and observed others only once. Here is a sample of my data (with ID as rows, observation number as columns, and counts as entries).

SubID	1	2	3	4
12347.0	70.0	63.0		
12348.0	70.0			
12352.0	13.0	0.0	16.0	5.0
12356.0	80.0			
12359.0	26.0	115.0		
12362.0	69.0			
12365.0	0.0			
12372.0	83.0			
12383.0	60.0	32.0	53.0	
12388.0	21.0	30.0

I’d like to model the counts as coming from a hierarchical negative binomial distribution. Here is the setup I have so far

N = df.SubID.nunique()

with pm.Model() as model:
    
    alpha = pm.HalfCauchy('alpha',1)
    beta = pm.HalfCauchy('beta',1)
    
    mu = pm.HalfCauchy('mu',1, shape = N)
    a = pm.HalfCauchy('a',1, shape = N)

I’m unsure how to approach the observed part of the model since some subjects have only 1 observation while others have many more.

What is the proper way to structure my data and feed it into my model?

JWarmenhoven · August 19, 2018, 5:14am

You say you want to use a hierarchical model. How do you want to model the grouping/hierarchy of subjects? With regards to your data; what defines the grouping?

falk · August 19, 2018, 9:37am

I think two things are confused here.
@JWarmenhoven I think you understood “counts” as how often a subject was observed. But @dpananos also counted occurrences of the behaviour, which are the actual values in the columns.

@dpananos it is a bit unclear what you want to model and why. Here some thoughts.

data structure

First of all, your IDs and counts are integer, so you can change the data type.
Also, I think you data is better structured as unique rows with the same column number, like so:

SubID;n_observed_behaviour
12347;70
12347;63
12348;70
12352;13
12352;0
12352;16
12352;5
12356;80
12359;26
12359;115
12362;69
12365;0
12372;83
12383;60
12383;32
12383;53
12388;21
12383;30

The SubID is the column that groups the observations. The number of observations is implicitly covered by the number of rows (you can easily recover it by making your data a pandas DataFrame and using groupby and agg).

distribution

The question is: why? A binomial distribution would be applicable if there is a maximum of trials that a subject could have participated in. In that case, you get percentages, and could as well use continuous distributions such as pm.Beta or pm.Kumaraswamy.

However, such a total number is not given in your example data. So, your data remain “open end” integers, i.e. discrete. In that case, pm.Poisson or pm.DiscreteWeibull might apply (unfortunately I have no experience with the latter).

hierarchical model

If the only reason for multiple observation is your sampling, i.e. there is no behavioural bias that would require you to sum all the counts per subject, then indeed a hierarchical structure is justified.

Check out this blog by @twiecki for how observations are brought in.
I’ll sketch it below, but code is untested, lacking data. I chose Poisson distribution for simplicity, lambda will be the average number of observations per subject.
this requires pandas (pd) and the data structure that I suggested above.

df['group'] = pd.Categorical(data['SubID'], ordered = False)
group_index = data['group'].cat.codes.values

with pm.Model() as model:
    lambda_population = pm.HalfCauchy('lambda_population', 1.)

    lambda_subject = pm.HalfCauchy('lambda_subject' \
                                    , lambda_population \
                                    , shape = N \
                                    )


    estimate = lambda_subject[group_index] # takes the lambda corresponding to each subject
    likelihood = pm.Poisson('likelihood' \
                            , estimate \
                            , observed = df['n_observed_behaviour'].values \
                            )

I think lambda_population would be the value you are looking for.

Hope this helps!

Falk

JWarmenhoven · August 19, 2018, 7:58pm

I understand the count data, but I wonder about the grouping. Perhaps the columns 1-4 represent counts of the observed action in specific and identical (across subjects) conditions. In that case you could reshape the data like this…

(df.melt(id_vars='SubID',
        value_vars=['1','2','3','4'],
        var_name='Condition',
        value_name='Count')
 .dropna()
 .sort_values(['SubID', 'Condition'])
 .reset_index(drop=True))

	SubID	Condition	Count
0	12347.0	1	70.0
1	12347.0	2	63.0
2	12348.0	1	70.0
3	12352.0	1	13.0
4	12352.0	2	0.0
5	12352.0	3	16.0
6	12352.0	4	5.0
7	12356.0	1	80.0

…and perhaps have the condition as a group_index in the model.

Topic		Replies	Views
Hierarchical model for counts with categorical data - help structuring Questions	2	949	January 18, 2022
Hierarchical model, categorical outcome Questions	0	538	September 15, 2020
Data / model structure with hierarchical model and covariates at individual and group level Questions	5	940	October 6, 2021
Out of sample/model predictions in hierarchical model with new observed data v5 modeling	9	252	October 16, 2024
Multi-Multilevel Modeling Questions	9	2771	February 15, 2023

Modelling groups with different number of observations

data structure

distribution

hierarchical model

Related topics