Inconsistencies w/ R in Discrete Choice and Random Utility PyMC example?

Greetings,

I am relatively new to pymc and am trying to construct a discrete choice model of heating system adoption. I found this very relevant and helpful example. However, I am noting that the parameter estimates in this example end up being quite different than those for the R mlogit package (here), which uses the same dataset.

In particular, the Basic Model in the pymc example generates positive coefficient values for the installed cost predictor (beta_ic=0.002, sd=0 in pymc vs. beta_ic=-0.00623187, stderror=0.00035277 in the analogous R result.) Even when improving the model specification in the Improved Model, the positive coefficient value remains at close to zero (while continuing to be negative in the mlogit example). The author attempts to justify the result but I suspect something is wrong with the model specification because it doesn’t make sense for increased equipment installed cost to translate into increased probability of adoption (and again, this is not the case in the mlogit model).

Can anyone help diagnose what is going on, and whether this model might be misspecified? (I noted another thread in which model misspecification resulted in similar issues, but I have been unable to translate the fix to this case.)

Thank you in advance!

Welcome! Perhaps @Nathaniel_Forde can help as he developed this notebook.

Hi JTL,

Thanks for the feedback on the notebook. You’re correct that the result is in the early models are in contradiction with theory. The idea of the notebook was to show escalating complexity in the models and arrive at the point in the final model where you use priors to constrain the estimates more effectively.

I suspect you could probably improve things for the earlier models by either having tighter priors on the beta parameter. I was using Normal(0, 1) is this context and that seems quite wide in retrospect.

Or additionally you could have a negatively constrained prior as i show in the last model where you force the price effect to be realised as a negative value.

Finally for a Bayesian model I generally would focus more on the range of the posterior than a comparison of point estimates with R. The estimation methods are different and the Bayesian fit is incorporating uncertainty i’ve embedded the prior setting. So for resolving i’d play around with how you set the priors on the price coefficient and perform a kind of sensitivity analysis.

Actually, on second thought i think there is an error in the model specification that needs a fix.

The issue relates to the zero’d out data for the hp alternative. Model identification requires we zero out parameters that make it hard for the model to sample, but since the first model here only has fixed beta coefficients for each of the covariates that is not an issue and we can include the hp alternative.

N = wide_heating_df.shape[0]
observed = pd.Categorical(wide_heating_df["depvar"]).codes
coords = {
    "alts_probs": ["ec", "er", "gc", "gr", "hp"],
    "obs": range(N),
}

with pm.Model(coords=coords) as model_1:
    beta_ic = pm.Normal("beta_ic", 0, 1)
    beta_oc = pm.Normal("beta_oc", 0, 1)

    ## Construct Utility matrix and Pivot
    u0 = beta_ic * wide_heating_df["ic.ec"] + beta_oc * wide_heating_df["oc.ec"]
    u1 = beta_ic * wide_heating_df["ic.er"] + beta_oc * wide_heating_df["oc.er"]
    u2 = beta_ic * wide_heating_df["ic.gc"] + beta_oc * wide_heating_df["oc.gc"]
    u3 = beta_ic * wide_heating_df["ic.gr"] + beta_oc * wide_heating_df["oc.gr"]
    u4 = beta_ic * wide_heating_df["ic.hp"] + beta_oc * wide_heating_df["oc.hp"]
    s = pm.math.stack([u0, u1, u2, u3, u4]).T

    ## Apply Softmax Transform
    p_ = pm.Deterministic("p", pm.math.softmax(s, axis=1), dims=("obs", "alts_probs"))

    ## Likelihood
    choice_obs = pm.Categorical("y_cat", p=p_, observed=observed, dims="obs")

    idata_m1 = pm.sample_prior_predictive()
    idata_m1.extend(
        pm.sample(nuts_sampler="numpyro", idata_kwargs={"log_likelihood": True}, random_seed=101)
    )
    idata_m1.extend(pm.sample_posterior_predictive(idata_m1))

pm.model_to_graphviz(model_1)

This gives us:

Then when we get to the expanded model with the alternative specific intercepts, then we have to zero out the problematic matter parameters which is in this case just the alternative specific intercept for the hp alternative

N = wide_heating_df.shape[0]
observed = pd.Categorical(wide_heating_df["depvar"]).codes
coords = {
    "alts_probs": ["ec", "er", "gc", "gr", "hp"],
    "alts_intercepts": ["ec", "er", "gc", "gr"],
    "obs": range(N),
}

with pm.Model(coords=coords) as model_1:
    beta_ic = pm.Normal("beta_ic", 0, 1)
    beta_oc = pm.Normal("beta_oc", 0, 1)
    alphas = pm.Normal("alpha", 0, 1, dims="alts_intercepts")

    ## Construct Utility matrix and Pivot
    u0 = alphas[0] + beta_ic * wide_heating_df["ic.ec"] + beta_oc * wide_heating_df["oc.ec"]
    u1 = alphas[1] + beta_ic * wide_heating_df["ic.er"] + beta_oc * wide_heating_df["oc.er"]
    u2 = alphas[2] + beta_ic * wide_heating_df["ic.gc"] + beta_oc * wide_heating_df["oc.gc"]
    u3 = alphas[3] + beta_ic * wide_heating_df["ic.gr"] + beta_oc * wide_heating_df["oc.gr"]
    u4 =  0 + beta_ic * wide_heating_df["ic.hp"] + beta_oc * wide_heating_df["oc.hp"]
    s = pm.math.stack([u0, u1, u2, u3, u4]).T

    ## Apply Softmax Transform
    p_ = pm.Deterministic("p", pm.math.softmax(s, axis=1), dims=("obs", "alts_probs"))

    ## Likelihood
    choice_obs = pm.Categorical("y_cat", p=p_, observed=observed, dims="obs")

    idata_m1 = pm.sample_prior_predictive()
    idata_m1.extend(
        pm.sample(nuts_sampler="numpyro", idata_kwargs={"log_likelihood": True}, random_seed=101)
    )
    idata_m1.extend(pm.sample_posterior_predictive(idata_m1))

pm.model_to_graphviz(model_1)

which gives us an estimate more in line with mlogit as you report

Thanks so much for flagging this! I’ll make a PR to have it fixed in the example notebook

@Nathaniel_Forde thanks so much for the follow up, and my apologies for not responding to your first message – I was going to ask about why an uninformative prior wouldn’t end up producing similar estimates to the mlogit example, but then ran into a configuration issue with pymc after updating my OS and hadn’t had a chance to rerun the example. I’ll work through your updated approach here and try to reproduce your results. Thanks again!

PS: I suspect you’ve already made the revisions to the example b/c the results for the first model look similar to what you have above. Just noting that the subsequent paragraph still refers to a positive installed cost coefficient:

Our parameter estimates differ slightly from the reported estimates, but we agree the model is inadequate. We will show a number of Bayesian model checks to demonstrate this fact, but the main call out is that the parameter values for installation costs should probably be negative. It’s counter-intuitive that a βic increase in price would increase the utility of generated by the installation even marginally as here. Although we might imagine that some kind of quality assurance comes with price which drives satisfaction with higher installation costs. The coefficient for repeat operating costs is negative as expected. Putting this issue aside for now, we’ll show below how we can incorporate prior knowledge to adjust for this kind of conflicts with theory.

1 Like

Thanks JTL, yeah updated the code and missed that reference. Code should be good now however. Thanks again for flagging it.

Could it be something as simple as the R package gives you a maximum likelihood estimate and PyMC is giving you a Bayesian estimate? These can be off by much more than the example you showed.

How do you get sd = 0 in a PyMC model? Is that just rounding to zero?

This is easy to see with a simple Bernoulli model. Suppose I have observations y = [1 \ 0 \ 1 \ 0 \ 0] and a model with likelihood y \sim \text{Bernoulli}(\theta) and prior \theta \sim \text{beta}(1, 1) (i.e., uniform). The MLE is going to be \theta^* = 0.4. The Bayesian posterior is \text{beta}(3, 4), with the mean as standard Bayesian estimator, \widehat{\theta} = \frac{3}{3 + 4} \approx 0.428. So you get different estimates. As the size of y goes to infinity, the estimates will converge.

The Bayesian estimate is nice because if the model is correct, it minimizes expected square error of the estimate. All we know about the MLE is that it makes the data most likely and converges to the right answer asymptotically (as does the Bayesian estimate).