Modelling groups with different number of observations

I have 10 subjects I am studying. I’ve observed each subject and counted how many times they perform some action. I have observed some subjects multiple times, and observed others only once. Here is a sample of my data (with ID as rows, observation number as columns, and counts as entries).

SubID	1	2	3	4
12347.0	70.0	63.0		
12348.0	70.0			
12352.0	13.0	0.0	16.0	5.0
12356.0	80.0			
12359.0	26.0	115.0		
12362.0	69.0			
12365.0	0.0			
12372.0	83.0			
12383.0	60.0	32.0	53.0	
12388.0	21.0	30.0		

I’d like to model the counts as coming from a hierarchical negative binomial distribution. Here is the setup I have so far

N = df.SubID.nunique()

with pm.Model() as model:
    alpha = pm.HalfCauchy('alpha',1)
    beta = pm.HalfCauchy('beta',1)
    mu = pm.HalfCauchy('mu',1, shape = N)
    a = pm.HalfCauchy('a',1, shape = N)

I’m unsure how to approach the observed part of the model since some subjects have only 1 observation while others have many more.

What is the proper way to structure my data and feed it into my model?

You say you want to use a hierarchical model. How do you want to model the grouping/hierarchy of subjects? With regards to your data; what defines the grouping?

I think two things are confused here.
@JWarmenhoven I think you understood “counts” as how often a subject was observed. But @dpananos also counted occurrences of the behaviour, which are the actual values in the columns.

@dpananos it is a bit unclear what you want to model and why. Here some thoughts.

data structure

First of all, your IDs and counts are integer, so you can change the data type.
Also, I think you data is better structured as unique rows with the same column number, like so:


The SubID is the column that groups the observations. The number of observations is implicitly covered by the number of rows (you can easily recover it by making your data a pandas DataFrame and using groupby and agg).


The question is: why? A binomial distribution would be applicable if there is a maximum of trials that a subject could have participated in. In that case, you get percentages, and could as well use continuous distributions such as pm.Beta or pm.Kumaraswamy.

However, such a total number is not given in your example data. So, your data remain “open end” integers, i.e. discrete. In that case, pm.Poisson or pm.DiscreteWeibull might apply (unfortunately I have no experience with the latter).

hierarchical model

If the only reason for multiple observation is your sampling, i.e. there is no behavioural bias that would require you to sum all the counts per subject, then indeed a hierarchical structure is justified.

Check out this blog by @twiecki for how observations are brought in.
I’ll sketch it below, but code is untested, lacking data. I chose Poisson distribution for simplicity, lambda will be the average number of observations per subject.
this requires pandas (pd) and the data structure that I suggested above.

df['group'] = pd.Categorical(data['SubID'], ordered = False)
group_index = data['group']

with pm.Model() as model:
    lambda_population = pm.HalfCauchy('lambda_population', 1.)

    lambda_subject = pm.HalfCauchy('lambda_subject' \
                                    , lambda_population \
                                    , shape = N \

    estimate = lambda_subject[group_index] # takes the lambda corresponding to each subject
    likelihood = pm.Poisson('likelihood' \
                            , estimate \
                            , observed = df['n_observed_behaviour'].values \

I think lambda_population would be the value you are looking for.

Hope this helps!


1 Like

I understand the count data, but I wonder about the grouping. Perhaps the columns 1-4 represent counts of the observed action in specific and identical (across subjects) conditions. In that case you could reshape the data like this…

 .sort_values(['SubID', 'Condition'])
SubID Condition Count
0 12347.0 1 70.0
1 12347.0 2 63.0
2 12348.0 1 70.0
3 12352.0 1 13.0
4 12352.0 2 0.0
5 12352.0 3 16.0
6 12352.0 4 5.0
7 12356.0 1 80.0

…and perhaps have the condition as a group_index in the model.