Categorical predictor variable, categorical response variable

Hello PyMC community,

I have two categorical variables, one predictor variable and one response variable.

The predictor variable, called “land use”, has 5 levels, which categorizes how a site was used 30 years ago. The response variable is variable is set of forest habitat types, I-IV, reflecting the type of trees that have grown in after 30 years. This forest-type response variable is also categorical with no ranking. Each site can have only one historical land use type, and one current day forest type.

The data in table form looks like this, abundances are numbers of sites.

BC BS CLB RCA RG marg
I 0 0 0 4 5 9
II 0 0 0 0 15 15
III 6 7 8 0 0 21
IV 6 5 4 0 1 16
marg 12 12 12 4 21 61

(A CSV with individual observations as rows attached if anybody wants to go the extra mile)

My first instinct is make a linear model, using a multinomial regression, after reshaping the predictor variable (“land use”) into binary dummy variables. I am trying to adapt @aloctavodia’s example for softmax/multinomial regression in his great book. So I have coded the model as such:

import numpy as np
import pymc3 as pm
import theano.tensor as tt
import pandas as pd
import patsy

Y, X = patsy.dmatrices('forestType ~ landUse', data=landUseForestType, return_type='dataframe')
y_s = Y.astype('int')
x_s = X.iloc[:,1:]
x_s = (x_s - x_s.mean(axis=0)) / x_s.std(axis=0)
x_s.columns = ['BS','CLB','RCA', 'RG']

with pm.Model() as model_s:
α = pm.Normal('α', mu=0, sd=5, shape=4)
β = pm.Normal('β', mu=0, sd=5, shape=(4,4))
μ = pm.Deterministic('μ', α + pm.math.dot(x_s, β))
θ = tt.nnet.softmax(μ)
yl = pm.Categorical('yl', p=θ, observed=y_s)

This returns the following error:

Bad initial energy, check any log probabilities that are inf or -inf, nan or very small:
yl -inf

Why am I getting negative-infinity/ bad initial energy errors?

I am aware that there is an instance of perfect separation in this table. As in, one of the predictor categories is 100% ending up in a response category ( RCA → Type I forest). Could this be an issue here? Or perhaps I should be using a different sampler beside the nuts sampler?

The data:
landUseForesttypeDF.csv (585 Bytes)

I will reply to my own post, in case someone wants another example of a simple Bayesian linear model that has both a categorical predictor and categorical response variable. There don’t seem to be very many of examples of these at the time of this writing, maybe it will be useful to someone. The version below works for me, using pymc3 version 3.8 and python 3.7.7, using the data attached above:

import numpy as np
import pymc3 as pm
import theano.tensor as tt
import arviz as az
import pandas as pd
from patsy import dmatrices

y_sMe = pd.Categorical(landUseForesttypeDF['forestType']).codes
_, X = dmatrices('forestType ~ landUse', data=landUseForesttypeDF, return_type='dataframe')
X.drop(columns=['Intercept'], inplace=True)
x_sMe = X.values

with pm.Model() as model_sMe:
α = pm.Normal('α', mu=0, sd=5, shape=4)
β = pm.Normal('β', mu=0, sd=5, shape=(4,4))
μ = pm.Deterministic('μ', α + pm.math.dot(x_sMe, β))
θ = tt.nnet.softmax(μ)
yl = pm.Categorical('yl', p=θ, observed=y_sMe)

Still following @aloctavodia’s example, we can do a couple quick checks on the performance. The model produces an array of log-likelihoods that we can exponentiate back to probabilities and use to pick the most likely forest type for each site according the model:

data_predMe = trace_sMe['μ'].mean(0)

y_predMe = [np.exp(point)/np.sum(np.exp(point), axis=0) for point in data_predMe]

Then we take the forest type with the highest probability from each array, because these are the predicted forest types for each site, and see how often the model predicts the forest type that was actually observed in the field:

predictedForestType = pd.Series(np.argmax(y_predMe, axis=1), index=landUseForesttypeDF.index)
observedForestType = pd.Series(y_sMe, index=landUseForesttypeDF.index)
(predictedForestType == observedForestType).sum()/len(predictedForestType)

0.6557377049180327

The model picks the correct forest type 65% of the time. We can also do a quick check to see the variance explained with ArviZ’s r2 function:

ppc = pm.sample_posterior_predictive(trace_sMe, samples=5000, model=model_sMe)
az.r2_score(y_sMe, ppc['yl'])

r2 0.640125
r2_std 0.047982

5 Likes