How best to model risk ratio

Hi, I’m cross-posting from stackoverflow since I just realised I might be better off asking here. https://stackoverflow.com/questions/56928544/how-can-i-calculate-risk-ratio-using-pymc3

I’d like to model the risk ratio for data in a survey, but I’m having difficulty framing the problem to pymc3. Risk ratio is the ratio of probability of some outcome given an exposure, to the probability without exposure.

So as I understand it, risk ratio is: (incidence in exposed group) / (incidence in notExposed group)

I’ve tried posing the problem as shown in the sample code below:

import pymc3

exposed_num_observations = 25
exposed_observed = 11
notExposed_num_observations = 853
notExposed_observed = 202

with pm.Model() as rr_mdoel:
    # Prior
    p_exposed = pm.Uniform("p_exposed", 0 ,1)
    p_notExposed = pm.Uniform("p_notExposed", 0 ,1)
    sd = pm.Uniform("sd", 0,5)

    # Likelihood
    exposed = pm.Binomial("exposed", 
                      p=p_exposed, 
                      n=exposed_num_observations,  
                      observed=exposed_observed)
    notExposed = pm.Binomial("notExposed", 
                             p=p_notExposed, 
                             n=notExposed_num_observations,  
                             observed=notExposed_observed)

    risk_ratio = pm.Normal("risk_ratio", mu=(p_exposed/p_notExposed), sd=sd)

    # Inference Run/ Markov chain Monte Carlo 
    trace = pm.sample(10000, chains=2, tune=1000)

I expected the risk_ratio to always be above 0 (because you cant have a negative risk ratio), and I expected sd to be fit to some value. The results (as shown by az.plot_posterior(trace, round_to=3) for example) show some negative values for risk_ratio and sd is still uniform.

Any tips here are greatly appreciated!

The risk ratio is a strictly deterministic function of the probabilities. This means it should be defined as

risk_ratio = pm.Deterministic('risk_ratio', p_exposed/p_notExposed)

That way when you sample from the joint posterior of "exposed" and "notExposed", they are converted into the (univariate) distribution of "risk_ratio". If you want to estimate the mean and standard deviation, you can pull it out of

np.std(trace['risk_ratio'])

By contrast, what this line does:

risk_ratio = pm.Normal("risk_ratio", mu=(p_exposed/p_notExposed), sd=sd)

is define a new Normal random variable with mean p_exposed/p_notExposed and deviation sd. Because there aren’t any observations of this random variable, there is no information to constrain sd, which is why the posterior of sd exactly matches the prior.

Fantastic - that seems to work as expected! Thanks for an informative response

edit: feel free to leave that answer on stackoverflow too if you use it