Observed data dependent on random variables

Dear PyMC team,

is it possible to use random variables in the observed variables?

The data is assumed to be Poisson distributed, and the unfolded value is determined by nuisances parameters obtained with pm.Normal(): pm.Poisson(‘unfolded’, mu=unfolded, observed=data)

However, not only the unfolded value depends on the nuisance parameters, but also the observed data will be modified. Is it a valid approach to simply define data in terms of the RVs also used for unfolded?

Thank you very much in advance!
Alexander

Hi @Alexander,

Could you perhaps clarify a bit better what you are trying to do? Where is that pm.Normal() being used? I imagine your model is something like this (correct me if its not):

mu = pm.Normal("mu", ...)
unfolded = pm.Poisson("unfolded", mu=mu, observed=data)

If that is the case there is nothing wrong with it (except you have to be careful to make sure that negative values do not appear in mu).

But I am confused by this sentence:

However, not only the unfolded value depends on the nuisance parameters, but also the observed data will be modified.

What do you mean when you say the observed data will be modified?

1 Like

Hi @ricardoV94,
Thank you very much for your answer! Yes, I am sorry, I should have explained it better.

My model is something like this:

mc = pm.Normal(“mc”,…)
np = pm.Normal(“np”,…)
mu = mc * (1+np)
data = rawdata * (1-np)
unfolded = pm.Poisson(“unfolded”, mu=mu, observed=data)

So pm.Normal() is used to get both the simulated value mc and the nuisance parameter np, to get an expected mu for the poisson distribution. The “rawdata” gets modified by np as well. Thus, the observed data depends also on the random variable np. Is this allowed?

That seems odd to me. To answer your immediate question, I don’t think it is possible (nor logical) to have variable observed values. Also the part with the (1+np) and (1-np), it seems that the only logical solution for np would be zero… otherwise your mu and data would always diverge.

But most of all I am very confused by what you are trying to do. Do you have any resources you can share that discuss what you have in mind?

Hi @ricardoV94,
Thank you for your answer! I think the problem with the logical solution for np to be zero comes from the simplified version of the model.

Actually, this problem arises in particle physics when there are detector uncertainties. For some uncertainties, we smear the simulation mc by (1+np) if np > 0, while we smear the data by (1+np) if np < 0. So it would rather look like

mc = pm.Normal(“mc”,…)
np = pm.Normal(“np”,…)
mu = mc * (1+theano.tensor.clip(np,-10.0,0.0))
data = rawdata * (1+theano.tensor.clip(np,0.0,10.0))
unfolded = pm.Poisson(“unfolded”, mu=mu, observed=data)

Unfortunately, there are no public resources I could share on this.

Thanks for the clarification.

Surprisingly running a toy model where the observed data data is modified by a random variable does not seem to raise any Errors. I did not know this was possible… (and I have no idea if it works as you would expect). Hopefully someone else in here will be able to give you better help than I could.

Hi again,

I have created a minimal working example to clarify the issue:

  data = array(20.0)
  with pm.Model() as model:
      truth = pm.Uniform("truth",lower=0.0,upper=100.0)
      np = pm.Normal("np",mu=0.0,sigma=0.1)
      unfold = truth * (1.+np)
      smeareddata = data * (1.-np)
      unfolded = pm.Poisson("unfolded", mu=unfold, observed=smeareddata)
      trace = pm.sample(100000, tune=10000, cores=1, chains=1,progressbar=True)

I want to estimate the truth value given the observed data. Due to measurement uncertainties, both the observed data and the simulated truth value is smeared with np. In reality, data is an array, and there hundreds of nuisance parameters, and the unfolded is related with truth by a response matrix, but this simple example should capture the essential problem.

Will the above code work? I do not see any errors when running it, but I am not sure if it really does what I think.

Thank you very much in advance!

Carefully with reassigning a second variable to unfolded. Other than that I still don’t see the point of smudging the observed data, but you can always simulate fake data and see if you recover the truth and np parameters.

lf you are just trying to model measurement noise there are more straightforward ways.

Have a look at this pymcon talk: Microbial Cell Counting in a Noisy Environment by Cameron Davidson-Pilon

Or this case study in STAN: https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html

Maybe it is in line with what you have in mind?

I am sorry if I am nudging you away from something sensible, it’s just that I have never seen such approach. Some references would certainly help.

Hi,

the reassignment of a second variable to unfolded was a typo, sorry!

Thank you for your references, I will check them!