I am trying to fit my simple linear model with the Gamma distribution likelihood function which requires mu ranges to be 0 to 1.

My simple code is:

def fn(a,b,x):
y_hat = a + b*x
return y_hat
with pm.Model() as model:
a = pm.HalfNormal('a', 10)
b = pm.HafNormal('b', 10)
mu = fn(a,b,x)
pm.Gamma('obs', mu = mu, sigma = sigma, observed=y)
pm.sample(4000, tune=2000, random_seed=rng, chains=2, target_accept=0.99 , init="advi")

With this model setting, if my mu, the fn(a,b,x) produces larger than 1 or smaller than 0, it fails to sample.

Could you please advise how to prevent this possible domain error?
Is there any method that I can penalize loss whenever fn produces a larger than 1 and smaller than 0 value? Something like:

The pm.Potential approach you propose works, but it will introduce discontinuities into the geometry of the likelihood function. This makes it harder on the NUTS sampler. So I would use that as a last resort.

If you know you want mu to be between 0 and 1, you could just apply the sigmoid (inverse logit) function to the output of fn, like:

As a rule of thumb, it’s better to let the sampler propose unconstrained values, then transform those values to meet your constraints, rather than trying to build in hard constraints.

I see. for pm.invlogit(mu), do I also need to transform y with the sigmoid function?

My question is, if we do not transform observed y with sigmoid, how can we make an inference with our parameters?

For example,

case 1)
with pm.Normal(‘obs’, mu=ax, 1, observed=y), it is easy to make an inference with estimated a. If a is around 0.5, we can say that y is always a half much as x.

case 2)
On the other hand, if we make, pm.Normal(‘obs’, mu=pm.invlogit(ax), 1, observed=y), it may be impossible to make an inference like case 2 (am I wrong?)

case 3)
If we do pm.Normal(‘obs’, mu=pm.invlogit(ax),1, observed=sigmoid_y), we may be able to make the same inference as case 1).

Am I misunderstanding the process?

To make a better inference and fit the model with a better likelihood function, do I need to pursue the case 3)?

In all cases, you can answer these questions by taking partial derivatives of the expected value of \hat{y_i} given your model.

In the normal case, the EV of \hat{y_i} is just \mu, so if you have a linear function for \mu, then \frac{\partial \hat y_i}{\partial x_i} = a, so a unit change in x_i corresponds to a change of size a to your prediction for the i-th object in your data.

If you apply a sigmoid, you end up with \mathbb E[\hat y_i] = \frac{1}{1+e^{-ax_i}}, so the partial derivative w.r.t x_i will end up depending on y_i: \frac{\partial \hat y_i}{\partial x_i} = \frac{e^{-ax_i}}{(1 + e^{-ax_i})^2} = a \hat y_i \frac{1}{1 + e^{ax_i}}

Here it’s not impossible to interpret the output, but it is difficult, because 1) the size of the effect of a change in x_i depends on the value of \hat{y}_i, and 2) the effect of a change in x_i is non-linear. So it’s not so neat and tidy, but people do it. You could evaluate it at meaningful points (mean?, median?, mode?), or for values of x_i corresponding to particular objects of interest in your data set.

Note that transforming the data has nothing to do with any of this. Once you leave linear relationships, interpreting coefficients gets more difficult and there’s not a lot you can do about it.

That being said, you might explore different transformations of your data to make them more normal and avoid these issues all together.

Thanks, Jesse. Yes, I agreed that the trade-off between transformation and interpretability is common. But as you are aware, in many cases, transforming data into the Normal dist. is almost impossible.

A follow-up question arose with the case that we cannot transform our observed data into the Normal dist. For example, if we use different likelihood functions, such as Beta distribution for the likelihood function, the mean would be \alpha\over(\alpha + \beta). In this case, would the EV of \hat{y}_i be the \alpha\over(\alpha + \beta)?
In addition, with the Beta distribution, (without any data transformation), the partial derivative w.r.t. x_i is still {\partial \hat{y}_i \over \partial x_i} =a, so in Beta distribution likelihood case, how can we interpret a? Can we still assert that “a unit change in x_i corresponds to a change of size a to my prediction for the i-th object in my data” with the Beta likelihood distribution?

If your likelihood is beta distributed, you are in the so-called Beta-regression case from the GLM literature. You could look at, for example, research like this for some insight on how to interpret the parameters.