# PyMC3 - Differences in ways observations are passed to model -> difference in results?

I’m trying to understand if there is any meaningful difference in the ways of passing data into a model - either aggregated or as single trials (note this will only be a sensical question for certain distributions e.g. Binomial).

Predicting p for a yes/no trail, using a simple model with a Binomial distribution.

What is the difference in the computation/results of the following models (if any)?

I choose the two extremes, either passing in a single trail at once (reducing to Bernoulli) or passing in the sum of the entire series of trails, to exemplify my meaning though I am interested in the difference in between these extremes also.

``````# set up constants
p_true = 0.1
N = 3000
observed = scipy.stats.bernoulli.rvs(p_true, size=N)
``````

Model 1: combining all observations into a single data point

``````with pm.Model() as binomial_model1:
p = pm.Uniform('p', lower=0, upper=1)
observations = pm.Binomial('observations', N, p, observed=np.sum(observed))
trace1 = pm.sample(40000)
``````

Model 2: using each observation individually

``````with pm.Model() as binomial_model2:
p = pm.Uniform('p', lower=0, upper=1)
observations = pm.Binomial('observations', 1, p, observed=observed)
trace2 = pm.sample(40000)
``````

There is isn’t any noticeable difference in the trace or posteriors in this case. I attempted to dig into the pymc3 source code to try to see how the observations were being processed but couldn’t find the right part.

• pymc3 aggregates the observations under the hood for Binomial anyway so their is no difference
• the resultant posterior surface (which is explored in the sample process) is identical in each case -> there is no meaningful/statistical difference in the two models
• there are differences in the resultant statistics because of this and that…

The logp of the two model is different but only up to a constant. For example, in the first model `observations.logp(binomial_model1.test_point)` and `observations.logp_elemwise(binomial_model2.test_point)` is the same: `array(-1159.4138508309436)`, while in the second model you have:

``````observations.logp(binomial_model2.test_point)
array(-2079.4415416797233)
# and
observations.logp_elemwise(binomial_model1.test_point)
array([-0.69314718, -0.69314718, -0.69314718, ..., -0.69314718,
-0.69314718, -0.69314718])

observations.logp_elemwise(binomial_model2.test_point).sum()
-2079.4415416798361
``````

Where the `logp_elemwise` is the logp of each observation.

And for the MCMC sampler, PyMC3 compiled a logp function by taking the sum of the logp of all free RVs in the model. So that’s how it aggregates the observations under the hood. In a way you can say that

Thanks a lot @junpenglao