Prior predictive sampling with transformed RV

I was playing with a Dirichlet distribution with an uninformed prior, as follows:

with pm.Model() as histogram_model:
    weights = pm.HalfNormal('weights', shape=10, testval=np.array([0.1] * 10), transform=pm.transforms.sum_to_1)
    pm.Dirichlet('obs', a=weights, shape=10)

When I sample from the prior predictive, it seems like the transform is not applied:

with histogram_model:
    prior_trace = pm.sample_prior_predictive(var_names=['weights'])
prior_trace['weights'][0]

returns, in one example:

array([0.32092762, 0.80871371, 0.14739512, 0.4890916 , 0.31236516,
       0.88778043, 1.18220203, 2.7092601 , 1.45828804, 0.76331364])

Am I doing something wrong? Or is this to be expected ā€“ that the transform wonā€™t be applied?

And if the transform is not going to be applied, is there some way to evaluate a model like this to debug it?

You are correct, the transform only applied on the latent ā€œfreeā€ variable, but not on the random method, as the random method is assume to produce random sample already in the right domain.

I think the only way is to do pm.sample() directly.

Should we treat this as a bug in sample_prior_predictive and sample_posterior_predictive? It really seems wrong to me.

If you agree, I will make an issue for this.

No, I wouldnā€™t treat this as a bug from sample prior predictiveā€™s perspective. I think this is a consequence of the sometimes contradictory behavior of transform. I think that we should document somewhere the fact that passing transforms like this will only change sample but wonā€™t affect prior predictives.

@lucianopaz But doesnā€™t this make the prior predictive wrong as a tool for debugging a model? It is going to give results that are outside the support of the distribution, correct?

Even worse, it seems to me, is what will happen if we use posterior predictive sampling for out-of-sample predictions (machine learning applications). In this case we are going to return predictions that are simply wrong, arenā€™t we?

1 Like

It will return wrong samples in this case. But it will return the correct samples for many other cases, like the halfnormal, halfcauchy or even the uniform distribution. In my opinion, transform is at fault here.

I see your point about the transforms being at fault, but I still think that not applying the transform works against the idea of Bayesian modeling and against the principle of understandable software.

  1. Bayesian modeling is based on using generative models and inverting them to reason from evidence. Having generative models that donā€™t generate properly seems wrong to me.

  2. Programmers have every reason to expect that a transformed random variable will yield transformed samples under all circumstances. Having them generate transformed samples under only some circumstances is akin to having the multiplication operator sometimes doing addition instead.

Changing it might be difficult, but I still think that what PyMC3 does here is wrong, and should be fixed. I canā€™t think of a principled argument ā€“ as opposed to an expedient argument ā€“ for the current behavior.

I dont think we should change this, FWIW what you are doing in

weights = pm.HalfNormal('weights', shape=10, testval=np.array([0.1] * 10), transform=pm.transforms.sum_to_1)

is incorrect, as HalfNormal does not produce a vector that sum to 1. What transform does is transferring a (vector) of value in \text{R} into something that are in the domain that defined by the distribution.

In fact, the transformation in PyMC3 is defined in this direction only. It is different than package like TFP, where transformation is defined in both direction and you can apply both forward and inverse transformation.

What transform does is transferring a (vector) of value in R into something that are in the domain that defined by the distribution.

In fact, the transformation in PyMC3 is defined in this direction only. It is different than package like TFP, where transformation is defined in both direction and you can apply both forward and inverse transformation.

Iā€™m sorry: I donā€™t follow this argument. Iā€™m asking for the transforms to be applied in sample_posterior_predictive and sample_prior_predictive. This is only applying the transform forward, as you say, so why would it be wrong to do this?

It seems like the argument for doing this is clear: a programmer that builds a model with transformed variables, and then asks PyMC3 to generate samples from that model, will expect to see transformed variables (e.g., no truncated normal samples below the lower bound).

With all due respect, I canā€™t see how any argument could be made for the current behavior.

What I am trying to say is that, the current behavior never intended to modify the random method of a distribution, as the random method is always assuming to produce random sample that satisfy the constrain. Thus user should not expect transform to be apply on sample_posterior_predictive and sample_prior_predictive.
It is however a valid feature request to have transform apply on a distribution so that it does the correct forward and inverse on both samples and log_prob computation, for that we could introduce a new transform distribution class like the one in TFP.

I see. It seems like the prior and posterior predictive sampling are not ā€œfirst class citizensā€ (a term that programming languages people use) in the PyMC3 language.

One thing remains unclear to me: if we have ā€œinternalā€ transformed variables (e.g., hyper-parameters in a hierarchical model), will predictive samples of the observed variables be only untransformed, or will they also be wrong?

E.g., if one has a truncated normal RV as a Free RV, and it is used to give the mean of a child variable, will that child variableā€™s value be wrong, also? Or does the theano network ensure that the value that goes to the child variable will be transformed?

Like this:
I = [1, 2, 3]
Ī¼ ~ TruncatedNormal( Ī¼=I, Ļƒ=1)
o ~ TruncatedNormal(Ī¼=Ī¼, Ļƒ=1)

[sorry for such a stupid example]
Now even if I apply the transform to ā€œoā€ myself, the answer would still be wrong if Ī¼ is not transformed.

I donā€™t know if the value that Theano transmits from Ī¼ to o is going to be the transformed value, or the raw value (in this case presumably a sample from an un-truncated Normal). Which is it?

If theano transforms variables like Ī¼, then this can be worked around relatively easily. But if it does not, then the results of sampling could be not just a little wrong, but arbitrarily wrong.

Itā€™s kind of hard to answer just with writing, I think that a schematic will help explain whatā€™s going on. In your example with the HalfNormal, the transform will get in the way and make a separate FreeRV called __log_{your_halfnormal_name}, and the HalfNormal itself will become a TransformedRV. The dependency graph from theano will look something like this:

and in general it would look like this:

In your example with the sum_to_1 transform you can ask theano to print the weights tensor computational graph like this:

>>> theano.printing.debugprint(weights)
ViewOp [id A] 'weights'   
 |Join [id B] ''   
   |TensorConstant{0} [id C]
   |Subtensor{::} [id D] ''   
   | |weights_sumto1__ [id E]
   |Elemwise{sub,no_inplace} [id F] ''   
     |InplaceDimShuffle{x} [id G] ''   
     | |TensorConstant{1} [id H]
     |InplaceDimShuffle{x} [id I] ''   
       |Sum{axis=[0], acc_dtype=float64} [id J] ''   
         |Subtensor{::} [id K] ''   
           |weights_sumto1__ [id E]

So, as you can see, basically, pymc3 always puts an rv higher up the hierarchy. When you call sample, the transformed (weights_sumto1__) will be used as the FreeRV, and the weights value will be computed as a Deterministic. So in the end, the sampled trace will have the samples of both weights_sumto1__ and weights. In sample posterior predictive, nothing very special happens, the values for each of the nodes are grabbed straight out of the trace.

The complicated part happens in sample prior predictive. What goes on is that we first try to draw a value from weights_sumto1__, that will raise a MissingInputError because we didnā€™t provide the computational graphā€™s input value. After that, we simply call the weights random method and return its value as it came out, because we assume that the samples from weights already lie in the proper sample space.

To get what you want (weights that are constrained to sum to one), you have to be a bit more explicit:

import pymc3 as pm
from theano import tensor as tt


with pm.Model() as histogram_model:
    weights_raw = pm.HalfNormal('weights', shape=10)
    weights = weights_raw / tt.sum(weights_raw, axis=-1, keepdims=True)
    pm.Dirichlet('obs', a=weights, shape=10)

If youā€™re interested in storing the values of weights when you sample from your model (or draw prior predictive samples), you just have to wrap it with a Deterministic.

1 Like

That is very correct, also main reason why we have so much headache of shape etc issue right now.

It will be correct, consider a subgraph in a Bayesian Graph (i.e., Markov blanket):
ā€¦ ā†’ A ~ distribution_1 ā†’ B ~ distribution_3(A, C) ā†’ ā€¦
C ~ distribution_2 -------------/

for log_prob computation, you replace A in the graph so that:
ā€¦ + distribution_1.log_prob(input_A) + distribution_3(input_A, input_C).log_prob(ā€¦) + ā€¦

As you can see, A is replace by user/programmatic input in both distribution and value, thus as long as it is transform to the proper constrained space that defined by distribution_1 and distribution_2, it is valid. Note how it is different than in the forward computation that distribution_1 does not depending on the value of A, whereas in the inverse (log_prob) computation it does depending on it.

Thank you, @junpenglao! That is a relief to know!

So the work-around here is to take the predictive trace and for any transformed RV, apply the transform. Not ideal, but definitely feasible.

It would be nice to see that predictive sampling becomes a first class citizen in PyMC4.

Thanks for the explanation, @lucianopaz.

Iā€™m not sure how this relates to @junpenglaoā€™s answer, though.

As I understood his answer, the only point at which we have to worry about getting un-transformed values is at the bottom of the theano graph.

But if I understand you correctly, that is not true ā€“ when sampling from the prior predictive, we could end up with random samples of untransformed RVs where we expect transformed ones, and those untransformed values could be used to compute values of children. If that is the case, then any use of a transformed RV, anywhere in a PyMC3 model, can cause prior predictive sampling to fail.

For posterior predictive, things will be correct if the transformed values are captured in the trace generated by pm.sample() But sometimes this cannot be done for out of sample predictions, if variablesā€™ shapes are functions of data.

If I am summarizing correctly, then any use of predictive sampling in PyMC3 is potentially risky.

Oh, and now I think I understand @junpenglao better: The transform, as he says, acts only ā€œbackwardsā€ from the Transformed RV up to the underlying RV in the theano graph.

TBH, I still donā€™t understand why itā€™s the Transformed RV that gets the random method instead of the Free RV ā€“ I guess thatā€™s just the most convenient way to do things for sampling that is driven by observations.

@lucianopaz @junpenglao ā€“ Iā€™m left with two questions:

  1. If a transformed RV does not support predictive sampling, why donā€™t we signal an exception when the programmer tries to sample from such a distribution? We could easily make it a special exception so that the user who understands the limitation could set a handler that accepts the un-transformed samples. Iā€™m kind of disturbed that I have been reporting results that are inaccurate for months without realizing it.

  2. Could this be fixed in draw_values? Could we check to see if a variable is transformed, and if so, collect the un-transformed value from the random method and transform it?

  1. The TransformedRV has its random method, and it is used in predictive sampling. Whatā€™s wrong is the notion that passing a transform parameter to a distribution will change the values returned from said RVā€™s random.

  2. It shouldnā€™t be draw_values responsibility to deal with this. We should fix the the way in which TransformedRVs are built and clearly distinguish between the two cases at hand: automatically unconstraining the values of an RV in pm.sample and transforming the values drawn from a given distribution. The first case is what we currently call transform and I think thatā€™s wrong sand confusing. The second case is what you want to get, it would apply the transform during inference and also in predictive sampling.

So am I right in thinking that this is something that should be fixed by fixing the random() method of the transformed RV?

I assume right now we just pass the call to random through to a TransformedRV's distribution, which is the un-transformed distribution. Is that correct?

Actually, I just looked and it looks like the TruncatedNormal, at least, has a special random() method that looks like it will do the right thing (at least it consults the upper and lower properties). So maybe at least some of the transformed RVs work?

I think there is still some concepts are getting confused here:

  • The default transform does the right thing, because it is mapping values in the unconstrained space to the right domain
  • The default random also does the right thing, as it is producing random sample regardless of what transformation in the inference time it is doing, as long as it is volume preserved.

Problem you are describing are: user specified transformation that change the domain of the distribution, for example, a sum to 1 transformation on Uniform distribution. In this case, the random method would be wrong as the transformation is modifying the domain instead of just unbounding the free parameter.

In a way, we are speaking about 2 operation here, and the current transform kwarg only deal with [1] (unbounding a constrained parameter), but not intended to solve [2] (transforming a distribution into another) although we could use it to partially solve [2].

1 Like