I think two things are confused here.
@JWarmenhoven I think you understood “counts” as how often a subject was observed. But @dpananos also counted occurrences of the behaviour, which are the actual values in the columns.
@dpananos it is a bit unclear what you want to model and why. Here some thoughts.
data structure
First of all, your IDs and counts are integer, so you can change the data type.
Also, I think you data is better structured as unique rows with the same column number, like so:
SubID;n_observed_behaviour
12347;70
12347;63
12348;70
12352;13
12352;0
12352;16
12352;5
12356;80
12359;26
12359;115
12362;69
12365;0
12372;83
12383;60
12383;32
12383;53
12388;21
12383;30
The SubID
is the column that groups the observations. The number of observations is implicitly covered by the number of rows (you can easily recover it by making your data a pandas DataFrame and using groupby
and agg
).
distribution
The question is: why? A binomial distribution would be applicable if there is a maximum of trials that a subject could have participated in. In that case, you get percentages, and could as well use continuous distributions such as pm.Beta
or pm.Kumaraswamy
.
However, such a total number is not given in your example data. So, your data remain “open end” integers, i.e. discrete. In that case, pm.Poisson
or pm.DiscreteWeibull
might apply (unfortunately I have no experience with the latter).
hierarchical model
If the only reason for multiple observation is your sampling, i.e. there is no behavioural bias that would require you to sum all the counts per subject, then indeed a hierarchical structure is justified.
Check out this blog by @twiecki for how observations are brought in.
I’ll sketch it below, but code is untested, lacking data. I chose Poisson distribution for simplicity, lambda will be the average number of observations per subject.
this requires pandas (pd
) and the data structure that I suggested above.
df['group'] = pd.Categorical(data['SubID'], ordered = False)
group_index = data['group'].cat.codes.values
with pm.Model() as model:
lambda_population = pm.HalfCauchy('lambda_population', 1.)
lambda_subject = pm.HalfCauchy('lambda_subject' \
, lambda_population \
, shape = N \
)
estimate = lambda_subject[group_index] # takes the lambda corresponding to each subject
likelihood = pm.Poisson('likelihood' \
, estimate \
, observed = df['n_observed_behaviour'].values \
)
I think lambda_population
would be the value you are looking for.
Hope this helps!
Falk