Arviz plot_hdi for logarithmic predictor variable in log scale

Hi everyone,

I wanted to apply Bayesian linear regression to data where the predictor variable is log scaled. As result I wanted to see the uncertainty in the mean, using also az.plot_hdi().

Using az.plot_hdi() I don’t get results I would have expected.

Here my attempts to understand it:

Linear set-up on example: GLM: Linear regression — PyMC 5.9.1 documentation
I started trying az_plot_hdi() additionally to the used az.plot_lm().
Basically this worked and I got a result I would have expected:

Then I wanted to work with the logarithmic data and changed the example set-up a little bit:

size_ln = 50 # smaller dataset for more uncertainty around mean
true_intercept_ln = 1
true_slope_ln = 2

x_ln = np.random.lognormal(size=size_ln) # instead np.linspace() for a nicer data spacing
# y = a + b*log10(x)
true_regression_line_ln = true_intercept_ln + true_slope_ln * np.log10(x_ln)
# add noise like in example
y_ln = true_regression_line_ln + rng.normal(scale=0.5, size=size)

data_ln = pd.DataFrame(dict(x=x_ln, y=y_ln))

With the following slightly changed model (logarithm in the likelihood):

with Model() as model_ln:  # model specifications in PyMC are wrapped in a with-statement
    # Define priors
    sigma = HalfCauchy("sigma", beta=10)
    intercept = Normal("Intercept", 0, sigma=20)
    slope = Normal("slope", 0, sigma=20)

    # Define likelihood
    likelihood = Normal("y", mu=intercept + slope * np.log10(x_ln), sigma=sigma, observed=y_ln)

    # Inference!
    # draw 3000 posterior samples using NUTS sampling
    idata_ln = sample(3000)

For the result, I wanted to use the graphic from above. But for my use cases we usually show the x-axis log-scaled, so I tried this too. This is displayed on the left side. On the right side I left it linear scaled and displayed the x values as log values (reduced Code below).

# log
az.plot_lm(idata=idata_ln, x=x_ln, y="y", num_samples=30, axes=ax[0], y_model="y_model", y_kwargs={"marker":"x", "markersize":2}, y_model_mean_kwargs={"color":"green", "lw":2}, y_model_fill_kwargs={"alpha":0.01})
az.plot_hdi(x_ln, idata_ln.posterior["y_model"], color = "red", fill_kwargs={"label":"95% HDI"}, hdi_prob=0.95, ax = ax[0])
ax[0].set_xscale("log")

#linear
az.plot_lm(idata=idata_ln, x=np.log10(x_ln), y="y", num_samples=30, axes=ax[1], y_model="y_model", y_kwargs={"marker":"x", "markersize":2}, y_model_mean_kwargs={"color":"green", "lw":2}, y_model_fill_kwargs={"alpha":0.01})
az.plot_hdi(np.log10(x_ln), idata_ln.posterior["y_model"], color = "red", fill_kwargs={"label":"95% HDI"}, hdi_prob=0.95, ax = ax[1])

I would have expected a similar result. Therefore I tried to work with the posterior for myself and here it looks similar:

ida = az.extract(idata_ln, group="posterior", combined=True, num_samples=30)
#log
for i in range(0,30):
    x = np.array([0.1,10])
    ax[0].plot(x, ida["Intercept"][i].values + ida["slope"][i].values*np.log10(x), color="black", lw = 0.5)

mu = idata_ln.posterior["Intercept"].mean().values + idata_ln.posterior["slope"].mean().values*np.log10(x)
ax[0].plot(x, mu, color = "green", lw=2)
ax[0].set_xscale("log")

# linear
for i in range(0,30):
    x = np.array([0.1,10])
    ax[1].plot(np.log10(x), ida["Intercept"][i].values + ida["slope"][i].values*np.log10(x), color="black", lw = 0.5)
ax[1].plot(np.log10(x), mu, color="green", lw=2)

I worked with:
PyMC v5.6.1
Arviz v0.16.1

Maybe I have a big thinking error regarding this transformation set-up or my use of the methods is not appropriate. I really hope that someone could find my mistake(s).
Thanks in advance!

I’m not 100% sure what your intended output is, but maybe knowing a bit about how to do DIY plot_lm and plot_hdi will be helpful?

To make nice plots, it can be helpful to swap out the data you fit the model on with sorted dummy data. An example of this is shown in the data containers how-to.

In your case I made some small changes to your model:

  • I set the x_data to be a pm.MutableData
  • I linked the observed variable’s shape to the shape of the input data.
  • I saved mu as a pm.Deterministic, because I knew I’d want to use it later.

I also re-defined x_ln to be draws from rng.normal instead of draws from rng.lognormal, because otherwise the variable should be called x_exp, and it was driving me nuts to log a variable already called log.

Here’s the model after my changes:

with pm.Model() as model_ln:  # model specifications in PyMC are wrapped in a with-statement
    x_data = pm.MutableData('x', x_ln)
    
    # Define priors
    sigma = pm.HalfCauchy("sigma", beta=10)
    intercept = pm.Normal("Intercept", 0, sigma=20)
    slope = pm.Normal("slope", 0, sigma=20)

    mu = pm.Deterministic('mu', intercept + slope * x_data)
    
    # Define likelihood
    likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y_ln,
                           shape=x_data.shape[0])

    # Inference!
    # draw 3000 posterior samples using NUTS sampling
    idata_ln = pm.sample(3000)

Now you have the parameter estimates. Next, we want to make an evenly-spaced grid of data and generate predictions on that. Use pm.set_data:

with model_ln:    
    x_grid = np.linspace(x_ln.min(), x_ln.max(), 100)
    pm.set_data({'x':x_grid})
    idata_spaced = pm.sample_posterior_predictive(idata_ln, predictions=True)

Now we’re ready to plot. We need the mean and hdis of idata_spaced. I don’t like to use az.extract, because az.hdi requires the chain and draw dimensions, which extract flattens.

One trick I like with ax.fill_between is to use *hdi.values.T. The dimensions of the idata return by az.hdi is (y_obs_dim_1, hdi_low_high), with shapes (n_obs, 2). So if you transpose and unpack it, you get two arrays: one for the lower bound, and one for the upper bound. This is exactly what ax.fill_between expects. I also use this “trick” to get rng.choice to sample some regression lines for me (choice will sample from the left-most dimension if you give it an array)

mu = idata_spaced.predictions.y.mean(dim=['chain', 'draw'])
hdi = az.hdi(idata_spaced.predictions).y

# I use .stack here, could also use `az.extract` to do the stacking
reg_line_samples = rng.choice(idata_spaced.predictions.stack(sample=['chain', 'draw']).mu.values.T,
                              size=30)

fig, ax = plt.subplots(1, 2, figsize=(14,4))
for axis, scale in zip(fig.axes, ['linear', 'log']):
    axis.plot(x_grid, mu, label='Posterior Mean', zorder=100, lw=2)
    axis.fill_between(x_grid, *hdi.values.T, alpha=0.25, label='HDI 94%')
    axis.scatter(x_ln, y_ln, label='Observed')
    for sample in reg_line_samples:
        line, = axis.plot(x_grid, sample, color='0.5', lw=0.5)

    # Trick to set only 1 entry in the legend (if you set label in the loop you'll get 30)
    line.set_label('Posterior Regression Sample')
    axis.set(xscale=scale, title=scale.title())
    
ax[0].legend()
plt.show()

Output:

1 Like

Thank you very much for your (very fast) answer! Unfortunately I have only now had time to play around with the code.

It is not exactly what I wanted but it helped me anyway. You were right that it was helpful to know how to DIY the plots.

I work with some data (laboratory results) for which there are relationships in the literature. The relationships are based on a linear regression where the independent variable is logarithmically transformed (such as here: https://kenbenoit.net/assets/courses/me104/logmodels2.pdf).

Therefore I assume for my observed y: Y_i = α + βlog(X_i) + ε
So the linear scaled plot shows a nonlinear relationship and in the log-linear scaled plot it gives a linear relationship.

I adapted your code for this model and it worked for me (see graphic below). But I think I had to add var_names = [“mu”, “y”] for sampling the posterior_predictive, otherwise the code for reg_line_samples did not work.
Additionally I added in analogy hdi_mu = az.hdi(idata_spaced.predictions).mu for the credibility interval of the mean and it matched the regression lines of the sample.

When I tried to get this right graphic with the az.plot_hdi-method, I got this strange result displayed in the second graphic in my first question on the left side.

But I think I will use the method you suggested in future. Thank you very much!

1 Like