Modeling Sampling Error

I have a question related to modeling sampling error.

My observed values are actually aggregated data coming from a sample. For each observed value I know the sample size used to derive the observed value.

I want to use that sample size information to consider the uncertainty of the observed value. My initial thoughts for this were to assume each observed value comes from a normal distribution with standard deviation dependent upon the sample size and have the mean be the observed value.

I would then like to feed this information into a Gamma distribution, but it’s not clear how to do this since the values from that Normal distribution are no longer observed.

So my question is how can I include the uncertainty related to the sample size?

The model that I am using with the observed values is here:

I was just going to suggest you open a new post :wink:
When you say observed values are coming from a sample, do you mean something like in a Binominal observation where you observed data as pairs (n, k)?

The observed values are weighted, normalized counts data. So for each observation I know the base number of people sampled to form the calculated value and also the calculated value itself.

To clarify, I am looking at subpopulations in a sample, but it’s not as clear as a Binomial observation.

I am still a bit unsure if I understand, so you are observing some values (weighted, normalized counts data), which is a transformation of some base number (the base number of people sampled). I would say if this transformation is deterministic, then this two values (the observed and the base value) contains the same information, you can just model one of them. If this transformation is stochastic (e.g., k ~ Binomial (n,p)) or with unknown parameters (eg a linear function with unknown coefficients), then you can model the parameters of this transformation in the model.

Sorry, by normalized I mean taking the observed weighted data and dividing by the size of the population. The contribution of each member of a sample towards the observed data is unknown and different for each member.

As an example, for an event I am told that of a population of 1 million people, a sample of 34 people attending the event was taken and that after weighting there were an average of 40,500 people of that population in attendance per minute of the event. The value I am told was observed for the event is then 40,500 / 1,000,000 = 0.0405. Thus, without knowing the weight each person in the sample contributed to the value, I want to place uncertainty around the value of 0.0405 proportional to the amount of people that were used to form the sample.

Hopefully that’s clearer.

@junpenglao, does my clarification make sense?

I see, in this case I would try to model directly on the observed (i.e., 34), and since this number is the output of the linear function (observed = total population * unknown proportion $p$ * time of the sampling period), you have:
total population = 1 million, this is an estimation so you can also represent it as a prior distribution
unknown proportion, 0.0405 in your example but unknown, you can model it as a Beta distribution
time of the sampling period = a known value. For example in a 10 hours event you sample for 5 mins then this value is 5/(10*60) = 0.00833333333

and you can model your observed as a Poisson: observed ~ Poisson(total population * unknown proportion $p$ * time of the sampling period).

Hope this makes sense :wink:

@junpenglao

It sounds like you are saying to model each event as a Poisson process with rate proportional to the population proportion of the event times the length of the event.

I don’t think this is quite what I want but let me see if I can summarize your above thoughts.

In the above, we are given the observed value r = 0.0405 and the knowledge of the population total and the duration of the event in minutes, say p and t, respectively. We know that 0.0405 * p * t = 40,500 * t is the sum of the minutes of attendance calculated in some way for the sample of n = 34 people. Let s_i be the total minutes of attendance for person i in the sample. Then we know that sum_{i=1}^34 s_i = 40,500 * t and we want to place uncertainty around the quantity sum_{i=1}^34 s_i.

Are you saying to model n ~ Poisson(p * r * t), or in pmyc3, n = pm.Poisson(‘n’, mu=p * r * t, observed=n)? Isn’t the rate of that Poisson distribution much much greater than the observed variable n?

Yep that’s exactly what I meant. I dont think the rate of the Poission would be much higher than n, as it would just drive down the unknown in p * r * t (i.e., r, and p if you place prior on it)

@junpenglao

Thank you for all of the help.

I am attaching a toy example to illustrate what I believe to be are problems with this approach. Note that t, n, and p are assumed to be known, i.e. we know the time in seconds of sampling, the number of people sampled, and the true number of people in the population. What we do not know is the number of viewing seconds (viewing_seconds) derived from the sample and the true average proportion of the population for each event ®.

toy_events_data.csv (2.1 KB)

toy_example.py (2.8 KB)

From the correlation plot, we see some strong correlation between n and viewing_seconds since t and p are essentially constant.

So I can use either the recorded viewing seconds or n and set up a Poisson likelihood distribution with rate r * p * t and place a Gamma prior on r (I have some prior knowledge on the behavior of these r’s).

The problem comes from interpreting the trace for the r value after sampling with Metropolis. If I use the measured viewing seconds, I get the correct interpretation of r, the population proportion in attendance on average, but there is no uncertainty around these values. If I instead use n as the observed, I get the uncertainty around r, but in the ppc, I lose the ability to understand the inferred viewing seconds of the event since n is not directly proportional to the viewing seconds. Worse, I lose the interpretation of the r value.

I would prefer if it is somehow possible to use the measured viewing seconds in the likelihood but place uncertainty around it proportional to the number of people used in the sample.

To clarify, in the first two records, I have r values of 0.037624706 and 0.014726273. But the first r value was measured with a sample of 148 people as compared to 44 people for the second. I want to encapsulate the belief that I trust that r value measured using 148 people more than the one measured using 44 people.

Does that make sense?

There is no uncertainty around r in the first model because there is no uncertainty in the observed subset['viewing_seconds'] = subset['p'] * subset['r'] * subset['t'].

I think I still not get what you are trying to model, maybe it would help to writing down the generative process, generate some fake data, and see if you can build model to recover the true parameters. It might help also to consult some papers in ecology on related topic.

I’m trying to find the true latent value of r which is the average proportion of a population that was attending/viewing at any given second of an event. The measurement of r is derived from the (sum of the viewing seconds / duration of event) / population size or r = (sum_viewing_seconds / t) / p. I don’t know how they come up with sum_viewing_seconds as it’s dependent on each sampled person’s weight and actual attendance/viewing. Note that the weight of the person is the number of people that person is meant to represent. The only thing I know is the number of people that are used in the sample.

Since there are events measured that use much less people in the sample, I’m skeptical of the measured sum_viewing_seconds and therefore the measured r. Further, I know generally how r should behave based off of the interests of the population and the event in question. I was hoping there would be some standard way of incorporating the sample size into the uncertainty but it doesn’t sound like that is the case.

Hopefully this additional information is helpful.

I’ll try and see if there are any papers in ecology that measure population proportions of events using samples that I can borrow from. Thanks!

@junpenglao

Additionally, how come if I change the magnitude of p, i.e. express it in millions and carry that through the other calculations, the trace plot doesn’t show point estimates for r anymore and each r has a distribution?

That’s just a visualization problem I think, as in the original magnitude the random walk is too small compared to the range of the data. if you just plot r[i] they should have the same shape

@junpenglao

Here are the resulting traces for one event:

image

image

You are right that the shape is the same, but there is much much less uncertainty in the latter when the value for p is in the original scale.

Does that match expectations?

I would say it is reasonable, as scaling the input change the measure in the parameter space: remember the volume stays the same under different parameterization, which means the uncertainty might scale differently.

But in your case above, the y-axis of the histograms are different - the number of samples is not the same, so it might change if you sample more from the first case.

The sampling from the second one comes from using Metropolis since NUTS fails with the large parameter space, I can try thinning the trace.

Using less samples didn’t change the outcome.