Categorical predictor variable, categorical response variable

Hello PyMC community,

I have two categorical variables, one predictor variable and one response variable.

The predictor variable, called “land use”, has 5 levels, which categorizes how a site was used 30 years ago. The response variable is variable is set of forest habitat types, I-IV, reflecting the type of trees that have grown in after 30 years. This forest-type response variable is also categorical with no ranking. Each site can have only one historical land use type, and one current day forest type.

The data in table form looks like this, abundances are numbers of sites.

BC BS CLB RCA RG marg
I 0 0 0 4 5 9
II 0 0 0 0 15 15
III 6 7 8 0 0 21
IV 6 5 4 0 1 16
marg 12 12 12 4 21 61

(A CSV with individual observations as rows attached if anybody wants to go the extra mile)

My first instinct is make a linear model, using a multinomial regression, after reshaping the predictor variable (“land use”) into binary dummy variables. I am trying to adapt @aloctavodia’s example for softmax/multinomial regression in his great book. So I have coded the model as such:

    import numpy as np
    import pymc3 as pm
    import theano.tensor as tt
    import pandas as pd
    import patsy

    landUseForestType = pd.read_csv('landUseForesttypeDF.csv', index_col='site')    
    Y, X = patsy.dmatrices('forestType ~ landUse', data=landUseForestType, return_type='dataframe')
    y_s = Y.astype('int')
    x_s = X.iloc[:,1:]
    x_s = (x_s - x_s.mean(axis=0)) / x_s.std(axis=0)
    x_s.columns = ['BS','CLB','RCA', 'RG']

    with pm.Model() as model_s:
        α = pm.Normal('α', mu=0, sd=5, shape=4)
        β = pm.Normal('β', mu=0, sd=5, shape=(4,4))
        μ = pm.Deterministic('μ', α + pm.math.dot(x_s, β))
        θ = tt.nnet.softmax(μ)
        yl = pm.Categorical('yl', p=θ, observed=y_s)
        trace_s = pm.sample(2000, init='adapt_diag')

This returns the following error:

Bad initial energy, check any log probabilities that are inf or -inf, nan or very small:
yl -inf

Why am I getting negative-infinity/ bad initial energy errors?

I am aware that there is an instance of perfect separation in this table. As in, one of the predictor categories is 100% ending up in a response category ( RCA → Type I forest). Could this be an issue here? Or perhaps I should be using a different sampler beside the nuts sampler?

Thanks in advance.

The data:
landUseForesttypeDF.csv (585 Bytes)

I will reply to my own post, in case someone wants another example of a simple Bayesian linear model that has both a categorical predictor and categorical response variable. There don’t seem to be very many of examples of these at the time of this writing, maybe it will be useful to someone. The version below works for me, using pymc3 version 3.8 and python 3.7.7, using the data attached above:

    import numpy as np
    import pymc3 as pm
    import theano.tensor as tt
    import arviz as az
    import pandas as pd
    from patsy import dmatrices

    landUseForesttypeDF = pd.read_csv('landUseForesttypeDF.csv', index_col='site')
    y_sMe = pd.Categorical(landUseForesttypeDF['forestType']).codes
    _, X = dmatrices('forestType ~ landUse', data=landUseForesttypeDF, return_type='dataframe')
    X.drop(columns=['Intercept'], inplace=True)
    x_sMe = X.values

    with pm.Model() as model_sMe:
        α = pm.Normal('α', mu=0, sd=5, shape=4)
        β = pm.Normal('β', mu=0, sd=5, shape=(4,4))
        μ = pm.Deterministic('μ', α + pm.math.dot(x_sMe, β))
        θ = tt.nnet.softmax(μ)
        yl = pm.Categorical('yl', p=θ, observed=y_sMe)
        trace_sMe = pm.sample(2000, init='adapt_diag')

Still following @aloctavodia’s example, we can do a couple quick checks on the performance. The model produces an array of log-likelihoods that we can exponentiate back to probabilities and use to pick the most likely forest type for each site according the model:

data_predMe = trace_sMe['μ'].mean(0)

y_predMe = [np.exp(point)/np.sum(np.exp(point), axis=0) for point in data_predMe]

Then we take the forest type with the highest probability from each array, because these are the predicted forest types for each site, and see how often the model predicts the forest type that was actually observed in the field:

predictedForestType = pd.Series(np.argmax(y_predMe, axis=1), index=landUseForesttypeDF.index)
observedForestType = pd.Series(y_sMe, index=landUseForesttypeDF.index)
(predictedForestType == observedForestType).sum()/len(predictedForestType)

0.6557377049180327

The model picks the correct forest type 65% of the time. We can also do a quick check to see the variance explained with ArviZ’s r2 function:

ppc = pm.sample_posterior_predictive(trace_sMe, samples=5000, model=model_sMe)
az.r2_score(y_sMe, ppc['yl'])

r2 0.640125
r2_std 0.047982

6 Likes

Hi. Your question and auto-answer is very useful because, as you mention, there are really few examples about categorical predictor and outcome.

I’m new to both PyMC and this kind of analysis. I’m trying to replicate your code since I’m facing the same problem, but with multiple categorical regressors. However, I cannot figure out how do you went from your initial table, as shown here:

to the data that you use to run the model, using the command

Y, X = patsy.dmatrices('forestType ~ landUse', data=landUseForestType, return_type='dataframe')

because the table doesn’t have the column named “site” nor the variables “forestType” or “landUse”. Could you kindly help me about how to create the “Y” and “X” variables to reproduce your model, so I can adapt it to a multiple categorical multinomial?

In addition, I fail to figure out how did you define a reference category for the model.

Thanks, and all the best.
Mauricio.

import pymc3 as pm

Hi. Quick note… I would encourage upgrading from pymc3. We’re on 5.20.0 now with a whole bunch of improvements.