As you can see I have a hierarchical model, where a Dirichlet process is the prior for a weight distribution of a gaussian mixture model.
When I calculate the gradient of logp_elemwiset wrt beta, it returns me a value independent of the alpha value.
This is a bit puzzling, because if as listed here https://docs.pymc.io/theano.html in the section “How pymc3 uses Theano?”, the logp value of a variable includes its prior. Then the prior of the beta logp value is a function of alpha. In this model, alpha is a free rv, so beta.logp should return a missing input error.
Even if beta is specified, the prior logP(beta), in logp(beta) as defined by:
It is a function of alpha.
Can someone explain what value of alpha is being assumed under the hood?
My understanding is that, it does not include its prior. As the logp function of a single RV (element-wise or not) is \pi(y \mid \theta) that depends on only its immediate input \theta, if \theta follows some distribution \theta \sim \pi(\theta \mid \gamma) you need to express that in \pi (y \mid \theta) also.
From the computation perspective in PyMC3, as shown in the doc (small rewrite for clarity):
# For illustration only, those functions don't actually exist
# in exactly this way!
model = pm.Model()
mu = tt.scalar('mu')
mu2 = tt.scalar('mu2')
logp_mu = pm.Normal.dist(0, 1).logp(mu)
logp_mu2 = pm.Normal.dist(mu, 1).logp(mu2)
logp_obs = pm.Normal.dist(mu2, 1).logp(data)
as you can see, logp_obs is only depends on mu2, which is a theano scaler. You can not take gradient of logp_obs wrt mu, unless you chain the logp together by doing logp_mu2+logp_obs. It is different than say mu2 = function(mu).
Another way to think of it: priors are regulations, just like regulations if you want to regulate something you need to add the regulation terms to your loss function. Same here, if you want to take the gradient of the parameters that specify the regulation (for optimization), you need to add the regulation (priors) to the loss function (logp) first.