# How to model historical data in the right distribution

I have observations to analyses for which I have actual historical data. I think it would make sense (a lot) to use this as a prior, since it literally is a prior

However, I’m having some trouble to model this and generate the right prior distributions. For a beta distribution I can simply use scipy’s beta model fit function on the data to get an estimate for alpha and beta. But for a zero inflated distribution I’m not sure how to get the right estimates of the parameters I need.
For example with the ZeroInflatedPoisson I think I need Theta and Psi, but not sure how to derive them from the actual data.

any suggestions?

I had this issue as well. Check out this thread What is the best way to estimate theta and psi for ZeroInflatedPoisson? - Questions / version agnostic - PyMC Discourse

thanks, I actually found that one and am a bit confused what is meant by logits there and how to determine theta.
Furthermore I’m not sure how to come up with psi value based on my data. If I know what proportion of the data are 0’s, do I just use that as a value? or 1- that? (since it was said in that threat that choosing 0 for psi will give you all zeroes).
As you can tell I’m still learning here and feel a bit like Im in over my head here when talking about the statistical details of all this

If you have historical data and a model you want to fit (i.e., a Zero Inflated Poisson), you don’t want a prior, you want a posterior! You should estimate these parameters.

Not sure if I fully understand your remark. I kinda have a posterior that I want to use as a prior for the test. So basically:
1: have a prior for the historical data
2: calculate the posterior using the historical data
3: use this posterior as prior for my actual test
4: calculate the posterior for the test.

I thought to combine steps 1,2,3 by estimating/fitting a distribution to the historical data.

Bit more context:
I’m analysing the effect of a treatment on a group (g2) by comparing it to a control group (g1). I have historical data (without the treatment) for both groups.
the delta (the effect of treatment) then is simply the posterior of g2 - posterior of g1
So my reasoning is that it would be best to create priors for both groups separately, so I include the possibility that there is a pre-existing difference for any reason. (in theory they should be the same, but in reality there will always be some difference)
Therefor I want to estimate the parameters of the distribution that best fits/reflects the historical data

Or can I somehow follow the steps above and use the posterior of the historical data as the prior for my test? (how would I do this?)

EDIT:
Example:
Let’s say we have 10,000 users for which we have data about the number of steps they take per day.
I want to do an intervention and see what the effect is of this. I randomise these people in two groups of 5,000. group 2 get’s a daily adjusted step goal, group 1 still gets the fixed goal like before.
We continue measuring the daily step count for all users.
Now I’m interested in the effect of my intervention/treatment.

I could ignore the historical data and just look at the difference in step count between group 1 and group 2 and call it a day.
However, I think we can and should do better in this case, since we have the historical data. There could (and will) be a ‘pre-existing difference’ in the step count between the two groups. This can be expected to be quite small, but we can’t rule out the possibility that this will skew our results.
Let’s say the difference is -1% before the test, we then run the risk of concluding there is no effect when the effect is 1% (positive) or even that there is a negative effect when there is none.
Since we have the historical data, we could take this difference into account by using this as a prior for our analysis. this way we incorporate the pre-existing difference and we are looking more at the ‘real effect’.

Hope this makes sense, or if I’m faulty in my reasoning Im more than happy to learn

You have longitudinal data, i.e. the same users tracked over time, before and after the intervention? If so, a hierarchical prior over the individual constants would account for time-invariant individual idiosyncrasies (i.e., unobserved personality traits that would affect their responsiveness to the treatment). You would probably want to include an interaction term between these effects and the treatment.

At the group level, if the groups are randomly assigned, there’s no reason to model an interaction between the individual characteristics and the group assignment, because you know by construction cov(group, individual_characteristics) = 0.

In broad strokes, though, you seem to be describing something like a difference-in-difference experimental design (see chapter 5.2 of Mostly Harmless Econometrics for the canonical treatment. They estimate the model using frequentist methods, but nothing stops you from implementing it in PyMC) There’s a ton of literature on this you could consult for guidance on how to estimate effects in your setup, and I’m sure there are more dedicated Bayesian treatments of experiment designs like this. Sadly, however, I can’t point you to them.

I’m doing DID analysis whenever we are not able to do a randomized experiment.
In this case the main focus should be the difference between the two (randomized) groups.

However, we have historical data for most (!) of the users in the experiment. So our priors could (should) be more informative I think.
to be concrete; the thing I want to prevent is the following I’ve seen happen in my simulations:
Conclude that the effect observed during test period between group 1 and group 2 is due to the intervention. I’ve seen cases where the pre-existing difference was something like -4%. the difference seen during experiment was -2%. Now if we do not take the pre-existing difference into account, we conclude that the effect of our intervention was negative (-2% < 0), while in fact, it was positive considering the pre-existing difference between the groups. (-2%>-4%)

So therefor I’m looking into how I should be using this into my prior.

I guess my question was a bit convoluted and basically I have two questions:
how to incorporate the historical data in the best way.
how to model the right distribution (in case of ratio’s/zero inflated data.)

I think I now understand that I should take an approach where I model the parameters of my prior, based on the historical data. Just wondering/struggling how to check this and how to plot the generated distribution by PyMC

As for the best approach to using the knowledge of the pre-existing difference;
I wonder whether it would make more sense (or difference?) to not care about the actual values that much but rather the difference? And model the expected difference based on the pre-existing difference and compare this with the observed difference?
Or should I just stick to creating both posteriors and only then compare?

Going back to your original question, if all you you want to do is estimate the parameters of some distribution, then use those parameters as priors in estimation of another distribution, I believe there is nothing from stopping you from having two likelihood functions in your model. Put hyper-priors over the parameters of whatever distribution you are interested in, then estimate that distribution using your historical data as observed.

Then, in the same model block, take those estimates parameters and use them as inputs to another layer of parameters, which inform your post-treatment model. In pseudo-code:

prior_mu = pm.Normal(...)
prior_sigma = pm.HalfCauchy(...)

historical_loglike = pm.Normal(mu=prior_mu, sigma=prior_sigma, observed=historical_data)

post_intervention_mu = pm.Normal(mu=prior_mu, ...)
post_intervention_sigma = pm.HalfNormal(...)
intervention_loglike = pm.Normal(mu=post_intervention_mu, sigma = post_intervention_sigma, observed=post_intervention_data)


But, I just really, really strong suspect all of this is equivalent to something like \mu_{i,t} = \alpha_i + \beta \cdot \text{intervention}_{i,t} + \alpha_i \beta \cdot \text{intervention}_{i,t}, where intervention = 1 if t >= T else 0, and T is the date of intervention, and i indexes the individuals in the study.

The other place I am struggling with what you want is that the groups are randomized. By construction, a priori, the expected group difference is zero. Everything related to deviation from that expectation is captured by the spread in your posterior estimates of the intervention effect.

In the example you give, there is no interpretation where we would conclude the effect of the intervention is negative, because the spread between the groups decreased. The linear relationship I wrote above would capture this change, as would a DID regression.