Multiple observations per outcome (group-level likelihood)

tjburch · February 18, 2021, 10:31pm

I have data where each row corresponds to a single review of an item. Items can be reviewed multiple times (anywhere from 1 to ~20), however they only have one final outcome - an example subset may be:

ID	Reviewer	p1	p2	p3	outcome
1	A	1.50	1.6	0.9	2.4
1	B	1.45	1.35	1.46	2.4
1	C	1.55	1.51	1.51	2.4
2	B	1.05	1.02	1.00	1.5
2	C	1.06	1.16	0.80	1.5
3	C	0.86	1.5	0.46	0.8

I’m interested in doing a regression at the ID/outcome level. In principle, I could just do a groupby("ID").mean(), but there’s a lot of information lost in doing so. It’d be nice to capture the idea that many reviews with similar values ought to make us more confident in the value. Is there a standard way to create a likelihood at a “group” level without having to roll to a summary statistic? (The long term goal would be to do a hierarchical model to encapsulate the reviewer level effect, since reviewers would be similar but might have some variance between them, but one step at a time ).

There is an old, semi-related thread, but I couldn’t extract anything applicable from it: Modelling groups with different number of observations - #3 by falk

RavinKumar · February 20, 2021, 4:37pm

There’s a couple of approaches. You could run an independent regression per group, multilevel model with pooled or unpooled parameters, or a hierarchical regression like you said. The different approaches are covered in great detail in this notebook. Do any look like the right match for your problem?

https://docs.pymc.io/notebooks/multilevel_modeling.html

OriolAbril · February 21, 2021, 12:14am

I also want to point out that you can also do model comparison at the group level even if you have run the model with all the data. We have a WIP notebook about that on the ArviZ resources repo: link to semi-stale PR and link to nicely rendered notebook

tjburch · February 23, 2021, 4:51am

Thanks for the responses @RavinKumar and @OriolAbril

I’ve been thinking about this for a bit and think it’s closest to the original rugby prediction example notebook, where they infer team-level parameters using game-level data, like here -

However, the main difference between the two is that in the example notebook, the data used is per-game, so you have several observations per team. However, in my data, there’s only one outcome per ID (a movie could have any number of reviewers, but it’s only going to have one final box office gross).

ricardoV94 · February 25, 2021, 7:05am

Looks to me like you need to write a likelihood of the box office gross in terms of fixed reviews as predictors. You can start with something very simple like using the mean of the reviews and the std deviation of the reviews and check if you get anything useful. Since you don’t seem to have any more information about the quality (or identity) of the reviewers or the movies, there is not really much more you can do.

It will be mostly about finding a likelihood that goes reasonably well from number of reviews, and their values to box office gross

Do the reviews even predict the box office? Since you don’t have a model yet, it may be useful to do some exploratory data analysis of reviews x outcome. This can give you some ideas about the functional form and noisiness of your system.

tjburch · February 25, 2021, 3:42pm

fixed reviews

What do you mean by “fixed” in this context?

It will be mostly about finding a likelihood that goes reasonably well from number of reviews, and their values to box office gross

I’ll have to think more about this. I think I want to inform the model that more reviews → less uncertainty. I think that means I’ll have to say reviews inform us about some underlying “quality” parameters where the sd is a function of number of reviews. Reviews are weird in that they’re almost like aggregating priors of other people.

Do the reviews even predict the box office?

Yeah. By a correlation analysis, they’re lightly predictive. Independently, anywhere from a 5-20% correlation to the outcome. The reviews are intended to be on different metrics of the movie, so they’re not too correlated with one another.

Topic		Replies	Views
Hierarchical regression models for ratings data ( 2 by 2 within-subject design) Questions	3	1738	December 12, 2019
Multi-Multilevel Modeling Questions	9	2771	February 15, 2023
Multinomial hierarchical regression with multiple observations per group ("Bad energy issue") Questions	22	3348	March 14, 2019
Data with different levels of aggregation version agnostic modeling	11	101	May 22, 2025
Multilevel model: how to model outcome at the group level Questions	3	952	December 28, 2018

Multiple observations per outcome (group-level likelihood)

Related topics