Poor inference results with hierarchical model


after playing around with PyMC for a while I am now trying to apply PyMC v5 to a real-world problem. However, even if I think the model is pretty simple, no matter what I do, sampling always fails (one or more variables have RSS values in the range 1.1 to 3, effective sample sizes are sometimes <<100, I see patterns in trace plots and a lot divergences using NUTS).

The problem can be stated as follows: Find the parameters of a quadratic regression function


The prior information I want to encode:

  • The target variable y does not have any uncertainty, but the independent variable x has.
  • x\in\left[0,1\right], y may be unconstraint.
  • The vertex x_{0} of the graph must be in the interval \left[\frac{1}{2},1\right] and it must be the global maximum and there are no observations expected with x values greater than the vertex. (I.e., if x_{0} is 0.7, all possible x-values are expected within \left[0,0.7\right].)
  • This is a hierarchical model: Each group of data has its own vector of coefficients, \beta=\left(\beta_{0},\beta_{1},\beta_{2}\right), as well vertex x_{0}.

The following relations can be derived from this:

  • \beta_{2}<0
  • \beta_{1}=-2x_{0}\beta_{2}

The current state of the model looks like this, which tries to encode the prior information via the variables x_{0} and \beta_{2}.

q = 0.5
n_obs = len(x)
n_groups = len(grps)

# initial values
beta_2_init, beta_1_init, beta_0_init = np.polyfit(x, y, 2)
x0_init = -0.5*beta_1_init/beta_2_init
x1_init = x0_init + 0.5*h1/beta_2_init

with pm.Model() as model:
    # x0
    theta_0_alpha = pm.Uniform("theta_0_alpha", lower=0.1, upper=100, initval=16.7631)
    theta_0_beta = pm.Uniform("theta_0_beta", lower=2, upper=100, initval=5.5099)
    theta_0 = pm.Beta("theta_0", alpha=theta_0_alpha, beta=theta_0_beta, shape=n_groups)
    x0 = pm.Deterministic("x0", q + theta_0*(1 - q))
    x0_ = x0 - x_offset

    # beta_2
    mu_beta_2 = pm.Uniform("mu_beta_2", lower=np.max([0, -0.5*beta_2_init]), upper=-1.5*beta_2_init)
    sigma_beta_2 = pm.HalfNormal("sigma_beta_2", sigma=50)
    beta_2_neg = pm.InverseGamma("beta_2_neg", mu=mu_beta_2, sigma=sigma_beta_2, shape=n_groups)
    beta_2 = pm.Deterministic("beta_2", -beta_2_neg)

    # x
    sigma_sigma_x = 0.05
    sigma_x = pm.HalfNormal("sigma_x", sigma_sigma_x)
    x_lat = pm.Uniform("x_lat", lower=-x_offset*np.ones(n_obs) - sigma_sigma_x, upper=x0_[grp_idx] + sigma_sigma_x)
    x_obs = pm.Normal("x_obs", mu=x_lat, sigma=sigma_x, observed=x)

    # y
    sigma_beta_0 = pm.HalfNormal("sigma_beta_0", 100)
    beta_1 = pm.Deterministic("beta_1", -2*x0_*beta_2)
    beta_0 = pm.Normal("beta_0", mu=beta_0_init, sigma=sigma_beta_0, shape=n_groups)
    y_pred = pm.Deterministic("y_pred", beta_0[grp_idx] + beta_1[grp_idx] * x_lat + beta_2[grp_idx] * x_lat**2)
    y_obs = pm.Normal("y_obs", mu=y_pred, sigma=5, observed=y)

Here x_offset is the mean of the original x values of all observations and groups. I.e., I the model works on the shifted x values x - x_offset in order to be able to set the prior mean of \beta_{0} to zero (otherwise I would have no specific prior knowledge about this parameter).

Please note that I could not find out how to encode that y is actually noise-free, so I decided to give it at least a bit of Gaussian noise. This seems to be important for sampling, because setting the variance of y_pred to, e.g., 0.001, the results are even worse.


with model:
    trace = pm.sample(3000, tune=6000, chains=4, cores=4, nuts_sampler_kwargs={"target_accept": 0.99})

I don’t know what else to try and would appreciate any help. Thanks.

I tried playing with the problem for a little bit. I don’t think I’ll be able to make much progress unless I have a clearer idea of the underlying data or, at least, a way to synthetically generate it.

I will say that one possible reason you might be getting loads of divergences is that Uniform distributions can be pretty rough on your sampler. NUTS likes it when the posterior is shaped like a bowl (or a high dimensional bowl). A uniform distribution is more like a pit. There isn’t any natural prior information to slow your sampler as it gets close to the boundaries. You might try soft-encoding your prior information but setting up normal distributions where the bulk of the prior mass falls in the ranges you like but not all it. Or gamma distributions if the parameter has to be positive.


In addition to @daniel-saunders-phil 's comments, I’d add that I don’t understand what it means for y to be deterministic if it’s a function of random variables. If it’s really true, you can just stop your model at the y_pred = ... line, removing the likelihood term related to y, since it isn’t a random variable.

1 Like

Thank you both for your responses.

@daniel-saunders-phil: I followed your suggestion and replaced all uniform distributions with either gamma or beta distributions. The results are much better now, but still I get a lot of divergences. Also, tuning with 6000 and sampling 3000 values takes ~20 minutes on an i7-8750H (but actually I don’t now what sampling times to expect here if everything runs perfectly.) I have prepared a data set and a complete Jupyter notebook (please see attachment). (My setup with this notebook running without any problems: Win10, VS Code, PyMC v5.3.0.)

data.csv (36.6 KB)
Hierarchical_Model_Forum.csv (573.6 KB)

@jessegrabowski I wrote the posterior as

\begin{array}{ccc} p\left(\beta,\psi|y,x\right) & \propto & p\left(x,y|\beta,\psi\right)p\left(\beta,\psi\right)\\ & = & p\left(y|x,\beta\right)p\left(x|\psi\right)p\left(\beta,\psi\right) \end{array}

where \beta is the vector of coefficients I am particularly interested in and \psi represents all other variables. Since y is determinstic, I wanted to write something like a delta “function”:


Otherwise, I wouldn’t know how to integrate the observed y values into the model specification.

I would recommend using sample_prior_predictive to show what sort of constraints you are putting on your model. Also, its never a bad idea to normalize your data inputs.