Hello PyMC community,
I have two categorical variables, one predictor variable and one response variable.
The predictor variable, called “land use”, has 5 levels, which categorizes how a site was used 30 years ago. The response variable is variable is set of forest habitat types, I-IV, reflecting the type of trees that have grown in after 30 years. This forest-type response variable is also categorical with no ranking. Each site can have only one historical land use type, and one current day forest type.
The data in table form looks like this, abundances are numbers of sites.
BC | BS | CLB | RCA | RG | marg | |
---|---|---|---|---|---|---|
I | 0 | 0 | 0 | 4 | 5 | 9 |
II | 0 | 0 | 0 | 0 | 15 | 15 |
III | 6 | 7 | 8 | 0 | 0 | 21 |
IV | 6 | 5 | 4 | 0 | 1 | 16 |
marg | 12 | 12 | 12 | 4 | 21 | 61 |
(A CSV with individual observations as rows attached if anybody wants to go the extra mile)
My first instinct is make a linear model, using a multinomial regression, after reshaping the predictor variable (“land use”) into binary dummy variables. I am trying to adapt @aloctavodia’s example for softmax/multinomial regression in his great book. So I have coded the model as such:
import numpy as np
import pymc3 as pm
import theano.tensor as tt
import pandas as pd
import patsy
landUseForestType = pd.read_csv('landUseForesttypeDF.csv', index_col='site')
Y, X = patsy.dmatrices('forestType ~ landUse', data=landUseForestType, return_type='dataframe')
y_s = Y.astype('int')
x_s = X.iloc[:,1:]
x_s = (x_s - x_s.mean(axis=0)) / x_s.std(axis=0)
x_s.columns = ['BS','CLB','RCA', 'RG']
with pm.Model() as model_s:
α = pm.Normal('α', mu=0, sd=5, shape=4)
β = pm.Normal('β', mu=0, sd=5, shape=(4,4))
μ = pm.Deterministic('μ', α + pm.math.dot(x_s, β))
θ = tt.nnet.softmax(μ)
yl = pm.Categorical('yl', p=θ, observed=y_s)
trace_s = pm.sample(2000, init='adapt_diag')
This returns the following error:
Bad initial energy, check any log probabilities that are inf or -inf, nan or very small:
yl -inf
Why am I getting negative-infinity/ bad initial energy errors?
I am aware that there is an instance of perfect separation in this table. As in, one of the predictor categories is 100% ending up in a response category ( RCA → Type I forest). Could this be an issue here? Or perhaps I should be using a different sampler beside the nuts sampler?
Thanks in advance.
The data:
landUseForesttypeDF.csv (585 Bytes)