Categorical predictor variable, categorical response variable

I will reply to my own post, in case someone wants another example of a simple Bayesian linear model that has both a categorical predictor and categorical response variable. There don’t seem to be very many of examples of these at the time of this writing, maybe it will be useful to someone. The version below works for me, using pymc3 version 3.8 and python 3.7.7, using the data attached above:

    import numpy as np
    import pymc3 as pm
    import theano.tensor as tt
    import arviz as az
    import pandas as pd
    from patsy import dmatrices

    landUseForesttypeDF = pd.read_csv('landUseForesttypeDF.csv', index_col='site')
    y_sMe = pd.Categorical(landUseForesttypeDF['forestType']).codes
    _, X = dmatrices('forestType ~ landUse', data=landUseForesttypeDF, return_type='dataframe')
    X.drop(columns=['Intercept'], inplace=True)
    x_sMe = X.values

    with pm.Model() as model_sMe:
        α = pm.Normal('α', mu=0, sd=5, shape=4)
        β = pm.Normal('β', mu=0, sd=5, shape=(4,4))
        μ = pm.Deterministic('μ', α + pm.math.dot(x_sMe, β))
        θ = tt.nnet.softmax(μ)
        yl = pm.Categorical('yl', p=θ, observed=y_sMe)
        trace_sMe = pm.sample(2000, init='adapt_diag')

Still following @aloctavodia’s example, we can do a couple quick checks on the performance. The model produces an array of log-likelihoods that we can exponentiate back to probabilities and use to pick the most likely forest type for each site according the model:

data_predMe = trace_sMe['μ'].mean(0)

y_predMe = [np.exp(point)/np.sum(np.exp(point), axis=0) for point in data_predMe]

Then we take the forest type with the highest probability from each array, because these are the predicted forest types for each site, and see how often the model predicts the forest type that was actually observed in the field:

predictedForestType = pd.Series(np.argmax(y_predMe, axis=1), index=landUseForesttypeDF.index)
observedForestType = pd.Series(y_sMe, index=landUseForesttypeDF.index)
(predictedForestType == observedForestType).sum()/len(predictedForestType)

0.6557377049180327

The model picks the correct forest type 65% of the time. We can also do a quick check to see the variance explained with ArviZ’s r2 function:

ppc = pm.sample_posterior_predictive(trace_sMe, samples=5000, model=model_sMe)
az.r2_score(y_sMe, ppc['yl'])

r2 0.640125
r2_std 0.047982

6 Likes