How to correctly create a GLM with PyMC

J_V · September 28, 2023, 8:34am

Hello,
I’m new to PyMC and I’m trying to perform a simple task: fitting a linear model in a Bayesian way.
I’m a beginner to PyMC, so this is a naive question.

I’ve been studying the documentation and came across several examples like Example 1, Example 2, Example 3, Example 4.

While I’ve made progress, there are still some aspects that confuse me, especially regarding how to structure my model to allow for modifying observed data later for out-of-sample predictions using Coordinates and Mutable data. Let me explain:
I initially created my model following Example 1, which I understood. It looks like this:

with pm.Model() as linear_model_s_t:
    # 1  Prior Knowledge:
    
    #Intercept
    intercept = pm.LogNormal('Intercept', mu=mu_B, sigma=std_B)
    
    #Slope
    slope = pm.LogNormal('Slope', mu=mu_A, sigma=std_A)
    
    #Standard Deviation of the normal distribution Y~N(X,b,sigma**2)
    sigma = pm.HalfNormal('sigma', sigma=10)
    
    # 2 - Estimate the mean, that will be my Y
    # Y = Ax + B ->
    
    mu = slope * x_new + intercept
    
    # 3 - Define Likelihood 
    
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed = y_new)
    
    # 4 -  Sampling
    
    trace = pm.sample(2000, tune=2000,chains=4, cores=2, random_seed=rng)
    
    # 5 - Generate prior and posterior samples
    
    prior_pred = pm.sample_prior_predictive(samples=1000, random_seed=rng) 
    post_pred = pm.sample_posterior_predictive(trace,random_seed=rng)

Then I came across Example 2, which uses the “babies” dataset and seems more similar to what I intend to do. However, it raised some questions:

with pm.Model(
    coords_mutable={"obs_idx": np.arange(len(data))}, coords={"parameter": ["intercept", "slope"]}
) as model_babies:
    mean_params = pm.Normal("mean_params", sigma=10, dims=["parameter"])
    sigma_params = pm.Normal("sigma_params", sigma=10, dims=["parameter"])
    month = pm.MutableData("month", data.Month.values.astype(float), dims=["obs_idx"])

    mu = pm.Deterministic("mu", mean_params[0] + mean_params[1] * month**0.5, dims=["obs_idx"])
    sigma = pm.Deterministic("sigma", sigma_params[0] + sigma_params[1] * month, dims=["obs_idx"])

    length = pm.Normal("length", mu=mu, sigma=sigma, observed=data.Length, dims=["obs_idx"])

    idata_babies = pm.sample(random_seed=RANDOM_SEED)

I understand that I set the mutable coordinates: ‘obs_idx’ and the ones that will not change: ‘parameter’. But I don’t understand how PyMC will know which parameter it is, e.g., mean_params = pm.Normal(“mean_params”, sigma=10, dims=[“parameter”]). How does it determine whether it is for ‘intercept’ or ‘slope’?
I don’t understand the roles of mean_params and sigma_params. Based on mu, it seems they correspond to the slope and intercept.
I don’t understand why there are two sigma parameters.
How should I set the sigma values? I know they should be positive, so using a half-normal distribution is okay. But what about the specific value of sigma? I’ve set it to 10, but only because all the examples use 10. How can I correctly determine this value? I understand the sigma for the slope and intercept, but not for mu.

jessegrabowski · September 28, 2023, 5:51pm

Nothing is ever simple! Here are some (attempted) answers:

Coords

When you use coords, PyMC will infer the shape of an RV based on the length of the coords. So in the example, mean_params has shape 2, because it has dims=["parameter"], and len(coords['parameter']) is 2.

Once PyMC knows the shape, the rest is just labeling. Consider the example where you don’t use dims, like this:

    mean_params = pm.Normal("mean_params", sigma=10, shape=(2,))

In this case, how does PyMC “know” which param is which? It doesn’t – they are what they do. Looking down at the mu computation:

    mu = pm.Deterministic("mu", mean_params[0] + mean_params[1] * month**0.5, dims=["obs_idx"])

We see that the first parameter is used as an intercept, and the 2nd parameter is used as a slope coefficient. So that’s what they are. The same is true if you label them with dims, it just also gives you a way to organize and keep track of things, especially when doing post-estimation graphing/analysis.

Params

In this model, there are two linear models: one for the mean, and one for the variance. This would be an example of a heteroskedastic model. Basically we are saying that we want the model to be less and less certain about it’s predictions as the month variable increases.

Sigma

See above for why there sigma_params. To reiterate: instead of just having a single sigma value that describes all the data (the homoskedastic case), we want the errors of the model to vary in the months dimension. Perhaps you can think of the intercept as the usual homoskedastic sigma, and the sigma_slope * months as an uncertainty offset?

Setting priors

It’s quite hard eh? First you should use scientific knowledge. The mean is perhaps easier to think about? Your model is a + b * month **2, so you can think about what is the largest value of month you’re going to feed into the model, then think about values of a and b that would lead to reasonable lengths for babies. Here we set sigma=10, which implies that both a and b will have 95% of their mass between about -200 and 200; implying we can have babies with negative length. Pretty bad!

In more complex models doing all this reasoning is not reasonable, so you should use pm.sample_prior_predictive to do prior predictive simulations. This will show you fake data based on your priors, and help you to tune your priors to keep the range reasonable.

Remember, though, that the goal is to get the model to produce plausible data, not your data. It’s not a fitting exercise! Doing prior simulations to get your model into a reasonable parameter space is an important step in the Bayesian workflow (but one that often gets skipped, ask me how I know…)

J_V · September 29, 2023, 4:11pm

Your answer is so comprehensive! Thank you! The section about setting priors helped me a lot. I feel fortunate to have you answering my question. I’m still trying to understand whether I should opt for the heteroskedastic model or keep it simple, but it’s really good to grasp both concepts. Once again, thank you.

Topic		Replies	Views
Best Practices for Time Series Forecasting version agnostic	11	3793	August 15, 2024
How do I make future predictions in a time series model with out of sample data? version agnostic prediction	12	2566	April 28, 2023
Trouble understanding how to use sample_posterior_predictive to generate a prediction	1	1280	August 26, 2023
Posterior predictive with spline model v5	2	285	August 31, 2023
Sample_posterior_predicitve not catching shape of new data v5 prediction	10	1273	August 24, 2022

How to correctly create a GLM with PyMC

Coords

Params

Sigma

Setting priors

Related topics