Sampling a hierarchical ordered logistic regression

I am sampling a hierarchical ordered logistic regression. Each observation has a response y in 0-11, and reflects some covariates and a geographic variable, state. The coefficient for each state is represented by the "a" variable.

I have a few concerns and questions. I’ve read a bunch of the pymc3 docs and forum posts, but I’m not sure how to address these issues.

Here is the code and diagram.

with pymc3.Model() as model:
    # State level priors.
    # Mean across states.
    mu_a = pymc3.Normal('mu_a', mu=8, sigma=2)
    # Variation across states.
    sigma_a = pymc3.HalfCauchy('sigma_a', beta=1)
    
    # State-level intercept.
    a = pymc3.Normal('a', mu=mu_a, sigma=sigma_a, shape=num_states)      
    
    # Covariates.
    beta = pymc3.Normal('beta', mu=0, sd=2, shape=x.shape[1]-num_states)
    
    # Combine the state level parameters with the covariates.
    mu =  pymc3.math.dot(x[:, :-num_states], beta) + pymc3.math.dot(x[:, -num_states:], a)

    # Apply the invlogit transformation.
    theta = 1 / (1 + pymc3.math.exp(-mu))
    
    # Prior on cutpoints, one between each consecutive outcome option.
    cutpoints_prior = [b+.5 for b in sorted(set(y))[:-1]]    
    # Cutpoints are ordered from least to greatest.
    cutpoints = pymc3.Normal("cutpoints", 
                             mu=cutpoints_prior, 
                             sd=np.array([0.1 for _ in cutpoints_prior]), 
                             shape=len(cutpoints_prior),
                             transform=pymc3.distributions.transforms.ordered)

    y_ = pymc3.OrderedLogistic("y", cutpoints=cutpoints, eta=theta, shape=x.shape[0], observed=y)

The test point has a very extreme y value.

The y value is extreme. Is there any reason to worry about that or try to get it closer to 0?

model.check_test_point()
mu_a                         -1.61
sigma_a_log__                -1.14
a                           -46.87
beta                         -4.84
cutpoints_ordered__          13.84
y                     -70721145.93
Name: Log-probability of test_point, dtype: float64

The prior-predictive y values are not close to 8.

>>> prior_predictive['y'].mean(axis=0).mean()
1.2820398066533978

I would’ve expected it to be around 8.0 as specified in the mu_a prior. Mostly the values are close to 1.0. I guess that means the prior predictive values have no state (and so they don’t get the boost of the a coefficient). But that’s an unrealistic observation – every real observation is from a real state. Is there a way to get prior predictions in light of that information?

Sampling takes a long time.

%%time 

with model:
    trace = pymc3.sample(2000, tune=2000, step=pymc3.NUTS())

Multiprocess sampling (4 chains in 4 jobs)
NUTS: [cutpoints, beta, a, sigma_a, mu_a]
Sampling 4 chains: 1%|▏ | 216/16000 [03:31<4:20:46, 1.01draws/s]

It can go for four hours or more. Is this to be expected? It’s difficult to be efficient when I have to wait 4 hours between attempts.

I tried a shorter run, but the chains didn’t mix well. A selection from the trace plot:

ADVI raises a FloatingPointError

The discussions I’ve read on the forum and blogs mention that ADVI can be faster, so I tried it but it raised an exception.

with model:
    trace = pymc3.fit(method='advi')

---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<timed exec> in <module>

~/tools/miniconda/envs/project/lib/python3.7/site-packages/pymc3/variational/inference.py in fit(n, local_rv, method, model, random_seed, start, inf_kwargs, **kwargs)
    788                         'or Inference instance' %
    789                         set(_select.keys()))
--> 790     return inference.fit(n, **kwargs)

~/tools/miniconda/envs/project/lib/python3.7/site-packages/pymc3/variational/inference.py in fit(self, n, score, callbacks, progressbar, **kwargs)
    132         with tqdm.trange(n, disable=not progressbar) as progress:
    133             if score:
--> 134                 state = self._iterate_with_loss(0, n, step_func, progress, callbacks)
    135             else:
    136                 state = self._iterate_without_loss(0, n, step_func, progress, callbacks)

~/tools/miniconda/envs/project/lib/python3.7/site-packages/pymc3/variational/inference.py in _iterate_with_loss(self, s, n, step_func, progress, callbacks)
    216                     except IndexError:
    217                         pass
--> 218                     raise FloatingPointError('\n'.join(errmsg))
    219                 scores[i] = e
    220                 if i % 10 == 0:
FloatingPointError: NaN occurred in optimization. 
The current approximation of RV `a`.ravel()[0] is NaN.
The current approximation of RV `a`.ravel()[1] is NaN.
...

Question

Is it possible to try more things out without waiting many hours between attempts? How would you suggest I proceed? Thank you.

1 Like

Hi Mina,
I don’t know your use case and data, and I’m not very experienced with ordered logistic regression, but here are some pointers:

  • Have you tried with different priors to see if this affects your prior predictive checks?
  • Sampling and ADVI issue: have you standardized your predictors? Predictors that are on a very large scale often mess up MCMC sampling.
  • You could find this notebook useful – it’s an adaptation to PyMC3 of McElreath’s Rethinking chapter 11 (first edition).

Sorry I can’t give you a more detailed answer, but I hope this already helps you :vulcan_salute:

Hi AlexAndorra,

  • Have you tried with different priors to see if this affects your prior predictive checks?

I tried adjusting the priors a bit but didn’t see any notable differences.

  • Sampling and ADVI issue: have you standardized your predictors? Predictors that are on a very large scale often mess up MCMC sampling.

The predictors have values between 0 and 10 so I didn’t think those scale issues would apply here.

  • You could find this notebook useful – it’s an adaptation to PyMC3 of McElreath’s Rethinking chapter 11 (first edition).

Thanks. I think my code is similar to that code, but the hierarchical aspect in my model does complicate things.

I tried using an OLS instead of the ordered logit and that seemed to work better – at least it converges. I can stick with OLS for now, but if anybody has suggestions on how to make the ordered logit work I remain interested.

I realise I’m two years too late, but I’m posting more for visibility as I was running into issues myself and I’ve managed to get a hierarchical ordered logistic regression working and sampling in PyMC (PyMC 4.0.0b1), so it may be helpful for others. I’ve been working through a problem in the book noted above, chapter 12 problem 12H2. Which asks the reader to fit a hierarchical ordered logistic regression model onto the Trolley data set in the book. See code below:

d = pd.read_csv('Data/Trolley.csv', delimiter=';')
d['id_factorised'] = pd.factorize(d.id)[0]
action = shared(d['action'].values)
intention = shared(d['intention'].values)
contact = shared(d['contact'].values)
N_ids = d['id'].unique().shape[0]

with pm.Model() as m_12h2_null:
    a = pm.Normal(
        'a', 0., 10.,
        transform=pm.distributions.transforms.ordered,
        shape=6, testval=np.arange(6) - 2.5)
    
    resp_obs = pm.OrderedLogistic(
        'resp_obs', 0., a,
        observed=d.response.values - 1
    )
    trace_m12h2_null = pm.sample(2000)

with pm.Model() as m_12h2_flexible:
    sigma = pm.HalfCauchy('sigma', 1.)
    a_id = pm.Normal('a_id', 0, sigma, shape=N_ids)
    
    a = pm.Normal(
        'a', 0., 10.,
        transform=pm.distributions.transforms.ordered,
        shape=6,
        initval=trace_m12h2_null['posterior']['a'].mean(dim=['chain', 'draw'])
    )
    
    bA = pm.Normal('bA', 0., 10.)
    bI = pm.Normal('bI', 0., 10.)
    bC = pm.Normal('bC', 0., 10.)
    phi  = a_id[d.id_factorised.values] + bA * action + bI * intention + bC * contact 
 
    resp_obs = pm.OrderedLogistic(
        'resp_obs', phi, a,
        observed=d.response - 1
    )
    trace_m12h2_flexible = pm.sample(4000, tune=1500)

The key thing to get it sampling was to set a sensible initval parameter in the prior for a. This was done by fitting a null model onto the data and using the trace of a from that null model as an initial value for initval. Sampling takes about 9 mins for this set-up and I think the model probably needs to be tweaked to improve sampling effeciency, but otherwise this will hopefully be a good start for anyone wanting to try and fit one of these models.