PyMC3 - Differences in ways observations are passed to model -> difference in results?

DBCerigo · October 26, 2017, 12:12pm

I’m trying to understand if there is any meaningful difference in the ways of passing data into a model - either aggregated or as single trials (note this will only be a sensical question for certain distributions e.g. Binomial).

Predicting p for a yes/no trail, using a simple model with a Binomial distribution.

What is the difference in the computation/results of the following models (if any)?

I choose the two extremes, either passing in a single trail at once (reducing to Bernoulli) or passing in the sum of the entire series of trails, to exemplify my meaning though I am interested in the difference in between these extremes also.

# set up constants
p_true = 0.1
N = 3000
observed = scipy.stats.bernoulli.rvs(p_true, size=N)

Model 1: combining all observations into a single data point

with pm.Model() as binomial_model1:
    p = pm.Uniform('p', lower=0, upper=1)
    observations = pm.Binomial('observations', N, p, observed=np.sum(observed))
    trace1 = pm.sample(40000)

Model 2: using each observation individually

with pm.Model() as binomial_model2:
    p = pm.Uniform('p', lower=0, upper=1)
    observations = pm.Binomial('observations', 1, p, observed=observed)
    trace2 = pm.sample(40000)

There is isn’t any noticeable difference in the trace or posteriors in this case. I attempted to dig into the pymc3 source code to try to see how the observations were being processed but couldn’t find the right part.

Possible expected answers:

pymc3 aggregates the observations under the hood for Binomial anyway so their is no difference
the resultant posterior surface (which is explored in the sample process) is identical in each case -> there is no meaningful/statistical difference in the two models
there are differences in the resultant statistics because of this and that…

junpenglao · October 26, 2017, 12:32pm

The logp of the two model is different but only up to a constant. For example, in the first model observations.logp(binomial_model1.test_point) and observations.logp_elemwise(binomial_model2.test_point) is the same: array(-1159.4138508309436), while in the second model you have:

observations.logp(binomial_model2.test_point)
array(-2079.4415416797233)
# and
observations.logp_elemwise(binomial_model1.test_point)
array([-0.69314718, -0.69314718, -0.69314718, ..., -0.69314718,
       -0.69314718, -0.69314718])

observations.logp_elemwise(binomial_model2.test_point).sum()
-2079.4415416798361

Where the logp_elemwise is the logp of each observation.

And for the MCMC sampler, PyMC3 compiled a logp function by taking the sum of the logp of all free RVs in the model. So that’s how it aggregates the observations under the hood. In a way you can say that

DBCerigo · October 26, 2017, 2:30pm

Thanks a lot @junpenglao

Topic		Replies	Views
Random variable as observation Questions	9	1445	January 20, 2023
Sampling semantics of multiple observed variables Questions	5	2567	April 25, 2019
Sampler issues on Beta prior Binomial likelihood v5 bug	5	394	December 8, 2022
Deterministic with observables changes the dimensions of the variables, why? version agnostic development , modeling	5	1128	July 26, 2022
Including Observations in the Model, a beginner question version agnostic gaussian_process , modeling , sampling	14	124	October 20, 2024

PyMC3 - Differences in ways observations are passed to model -> difference in results?

Related topics