Ordinal model does not find true means and variance

Hi all

I am trying to set up a model to analyze ordinal survey data. The survey questions ask about proportions (e.g. 'for what proportion of X did you Y?), and the response options are intervals of proportions (e.g. ‘less than 5% of X’, ‘between 5% and 25% of X’, etc.).

There is a similar ordinal model in chapter 23 of Kruscke’s DBDA 2, which someone has implemented in pymc3. I’ve changed this model to fit to my data and purposes, and get the following:

nYlevels = 6
thresh = np.array([0.05,0.25,0.5,0.75,0.95])

#a function to calculate the probabilities for all the thresholds

@as_op(itypes=[tt.dvector, tt.dvector], otypes=[tt.dmatrix])
def outcome_probabilities( alpha, beta):
    out = np.empty(((nYlevels), int(12)), dtype=np.float64)
    n = stats.beta(alpha,beta)    #I changed this one from stats.norm to stats.beta, and now estimate alpha and beta in the model
    out[0] = n.cdf(theta[0]) 
    out[1] = n.cdf(theta[1]) - n.cdf(theta[0])
    out[2] = n.cdf(theta[2]) - n.cdf(theta[1])
    out[3] = n.cdf(theta[3]) - n.cdf(theta[2])
    out[4] = n.cdf(theta[4]) - n.cdf(theta[3])
    out[5] = 1 - n.cdf(theta[4])
    return out

#the model

with pm.Model() as ordinal_model:    

    theta = thresh
    mu = pm.Uniform('mu',0,1,shape=int(12))
    var = pm.Uniform('var',0,(mu*(1-mu)),shape=int(12))

    alpha = ((1 - mu) / var - 1 / mu) * (mu ** 2)
    beta = alpha * (1 / mu - 1)
          
    pr = outcome_probabilities( alpha, beta) 
        
    y = pm.Categorical('y', pr.T, observed=all_samples.T)

To test this model, I sampled data from random beta distributions, and then transformed this into ordinal data using the thresholds from the response options (‘thresh’ in the code above). The model ostensibly runs well with the simulated data, but is sometimes very far off the true means and variances. In particular, the 94%HDI of the posterior is very far away from the true mean when the true mean is relatively high (say 0.4 and up). In those cases, the model generally vastly underestimates the mean (but also not always). For many random pairs of alpha and beta, however, the estimates for the mean and var are very accurate.

I’m not sure what might be the reason that the model works so well for some pairs of alpha and beta, but not for others. This is what I’ve looked at:

  • The sampling looks fine, as far as I can tell, but is very slow (perhaps because I cannot use NUTS because of the custom Theano Operation to calculate the thresholded cumulative probabilities?).
  • I’ve tried the same model but without priors for ‘mu’ and ‘var’, and with various exponential or halfnormal priors for ‘alpha’ and ‘beta’, but that leads to very similar results.
  • I’ve tried using varying sample sizes for the simulated data. This seems to make a difference, with fewer random pairs of alpha and beta for which the estimate of the mean is off, and when the estimates are off they are off a bit less (but the true mean is still very far out of the 90% HDI). However, even with sample sizes of 20k (which is far less than the survey I’m hoping to run), the model still strongly underestimates the mean for pairs of alpha and beta that lead to a high mean.

Is there anything I can change, check or try?Or is it simply very difficult to estimate the mean from this kind of ordinal data?

Thanks!

EDIT: Here’s how I simulated the data:

# get 12 samples from beta distributions with random alpha's and beta's between 0 and 20

samples_dct = {} 

for i in range(1,13):
    alpha = random.uniform(0.0001, 20)
    beta = random.uniform(0.0001, 20)
    samples_dct[i] = np.random.beta(alpha, beta,size=500)

# the thresholds, and the category that corresponds to them

ordinal_scores = {0.05:1,0.25:2,0.5:3,0.75:4,0.95:5, 1:6}

#transform the samples into ordinal data

ordinal_data = copy.deepcopy(samples_dct)   

samples = list(ordinal_data.values())

for sample in samples:
    for key,value in ordinal_scores.items():
        sample[sample < key ] = value

#look at first sample to check

first_sample = pd.DataFrame(data = [samples[0], samples_dct[1]], index = ['ordinal score','true value']).T

#turn into array of categorical pd.Series 

all_samples = []

for i in samples:
    sample = pd.Series(i).astype('category').cat.codes.values
    all_samples.append(sample)
    
all_samples = np.array(all_samples)

Can you provide your data simulation code?

Sorry, should have done that from the start. I’ve edited the post and added the data simulation code.

I poked around just a bit, but I would suggest taking a look through this notebook. Not only does it provide a few different scenarios, it also uses a pure theano implementation which should be substantially faster.

3 Likes

Thanks, this is precisely what I needed, and much faster too probably. I will try this later today!

1 Like

So I’ve tried using the approach explained in that notebook (which matches perfectly with what I wanted to do). Unfortunately, I’m assuming that the underlying distribution is a Beta distribution and it seems that the logcdf of pymc3.Beta can only handle scalar values at the moment. In order to bypass that, I tried splitting up the thresholds, replacing:

probs1 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(d1))

by:

    prob1 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.05))
    prob2 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.25))
    prob3 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.50)) 
    prob4 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.75)) 
    prob5 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.95))
   
    probs1 = tt.stack([prob1,prob2,prob3,prob4,prob5],axis=1)

(the values are the particular thresholds of my study).

Unfortunately, I think that ran into this error. So I had to use the as_op decorator that I used in my original model, as follows:

@as_op(itypes=[tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def outcome_probabilities( alpha, beta):
    out = np.empty(((6)), dtype=np.float64)
    n = stats.beta(alpha,beta)  
    out[0] = n.cdf(theta[0]) 
    out[1] = n.cdf(theta[1]) - n.cdf(theta[0])
    out[2] = n.cdf(theta[2]) - n.cdf(theta[1])
    out[3] = n.cdf(theta[3]) - n.cdf(theta[2])
    out[4] = n.cdf(theta[4]) - n.cdf(theta[3])
    out[5] = 1 - n.cdf(theta[4])
    return out


with pm.Model() as model1:

    theta = d1
    mu = pm.Uniform('mu',0,1)
    sigma = pm.Uniform('sigma',0,(mu*(1-mu)))
    alpha = ((1 - mu) / sigma - 1 / mu) * (mu ** 2)
    
    beta = alpha * (1 / mu - 1)

    probs1 = outcome_probabilities(alpha,beta)
    
    probs1 = pm.Deterministic("probs1", probs1)
    
    pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values)

This works much better than the model I started with, both with respect to posterior predictive checks and recovery of the true parameter values. The sampling is still quite slow, because it still isn’t a pure theano implementation, but at least the estimates are much better now. Thanks again for the pointer, @cluhmann !

1 Like

I checked in with @ricardoV94 and the gradient issue described with pymc3.distributions.dist_math.incomplete_beta should be fixed in v4. So if you are interested in trying out the v4 beta that is currently available, it might help. Just trying to get you moving quicker!

Thanks, that’s great to hear – I’ll definitely look into that!

I didn’t know that notebook was visible yet. There will be some edits, but the ordinal regression stuff there won’t change.

FYI. I’m going to be working on another example notebook very soon to show off the new distributions pm.OrderedProbit and pm.OrderedLogit. Will try to remember to post it here when it’s done, or at least when I submit the PR.

1 Like

@drbenvincent , I know it’s been a while since you responded to this, but any chance that the new example notebook is up already? The reason I’m asking is that I’m trying to turn this into a hierarchical model, with various categorical predictors but I’m not quite sure how to do this. pm.Multinomial requires an array of counts as observations, and I don’t know how to do the indexing for the categorical predictors if I cannot put the data in long format. Any hints on how I can go about this, or resources I can take a look at? Thanks!

Hi. Maybe you’re looking up for a probit model, as described in your initial post. But perhaps you’ll find an OrderedLogistic model useful as well. Usually they provide very close results (In my experience they are a bit easier to sample). If you’re interested in multilevel (aka hierarchical) models with latent variables, I have this repo implementing one: https://github.com/SimonErnesto/IRT_GRM_Bayesian .

I also tried out simulating random responses to the ordered categories you described on your first post but adding items and categories, so to run a multilevel model over those variables (just adaptive variance over items in this example, for simplicity). I’ve got very decent results with this model:

# -*- coding: utf-8 -*-

import numpy as np
import pymc3 as pm
from numpy import random
import pandas as pd
from scipy.special import expit
from tqdm import tqdm
import arviz as az
import matplotlib.pyplot as plt
import os

path = ""

os.chdir(path)

samples_dct = {} 

for i in range(1,13):
    alpha = random.uniform(0.0001, 20)
    beta = random.uniform(0.0001, 20)
    samples_dct[i] = np.random.beta(alpha, beta,size=500)

# the thresholds, and the category that corresponds to them

ordinal_scores = {0.05:1,0.25:2,0.5:3,0.75:4,0.95:5, 1:6}

response = ["less than 5%", "5% to 25%", "25% to 50%", "50% to 75%", "75% to 95%"]

scores = []
items = []
resp = []

I = 20 #number of items
P = 300 #number of participants


for c in range(3): #number of categories
    for i in range(I): #number of items
        for p in range(P): #number of participants
            s = np.random.choice([0,1,2,3,4])
            scores.append(s)
            items.append(i)
            resp.append(response[s])
                
data = pd.DataFrame({'score':scores, 'item':items, 'response':resp})

cats = np.array([np.repeat(c, len(data)/3) for c in range(3)]).flatten()

data['category'] = cats


C = len(data.category.unique()) #number of ategories
K = len(response) 

scores = data.score.values

ii = pd.factorize(data['item'])[0].astype('int32')
cj = pd.factorize(data['category'])[0].astype('int32')

#### Multilevel Ordered Logistic Model
with pm.Model() as model:
    s = pm.HalfNormal('s', 1)
    e = pm.Normal('e', 0, 1, shape=C)
    k = pm.Normal('k', mu=[-2,-1,1,2], sigma=s, transform=pm.distributions.transforms.ordered,
                      shape=(I, K-1), testval=np.arange(K-1))  
    y = pm.OrderedLogistic('y', cutpoints=k[ii], eta=e[cj], observed=scores)

Prior predictives are a bit off (sorry I don’t leave an example here, just to avoid cramming the post too much), but posteriors are very precise:

with model:
    trace = pm.sample(1000)

def pordlog(a):
    pa = expit(a)
    p_cum = np.concatenate(([0.], pa, [1.]))
    return p_cum[1:] - p_cum[:-1]

cuts = trace['k'].T

colors = ['crimson', 'slateblue', 'forestgreen']

props = []

item_data = data.groupby('item', as_index=False).sum()
for i in tqdm(range(I)):# range over items
    for c in tqdm(range(C)):# range over category
        color = colors[c]
        item_name = 'item'+str(item_data.item[i])
        prob = cuts[:,i,:] - trace['e'][:,c]
        posts = np.array([pordlog(prob.T[s]) for s in range(len(prob.T))]).T
        prop = data[data.item==data.item.unique()[i]]
        prop = [len(prop[prop.score==s])/len(prop) for s in [0,1,2,3,4]]
        props.append(prop)
        pmeans = [m.mean() for m in posts]
        h5s = [az.hdi(h, hdi_prob=0.9)[0] for h in posts]
        h95s = [az.hdi(h, hdi_prob=0.9)[1] for h in posts]
        plt.plot(pmeans, color=color, linewidth=2, label='Posterior Mean')
        plt.fill_between([0,1,2,3,4],h5s,h95s, color=color, alpha=0.2, label='90% HDI')
        plt.plot(prop, color='slategray', linewidth=2, linestyle=':', label='Observed Score')
        plt.suptitle("Posterior")
        plt.title(item_name, size=11)
        plt.grid(alpha=0.1)
        plt.xticks(range(5))
        plt.legend(prop={'size':10})
        plt.xlabel('Score')
        plt.ylabel('Probability')
        plt.gca().spines['top'].set_visible(False)
        plt.gca().spines['right'].set_visible(False)
        plt.savefig(path+'posteriors/'+item_name+'cat'+str(c)+'_prior_pred.png', dpi=300)
        plt.close()

Here’s an example of the posteriors of item 7 at category 2:

I know you’re looking up for the mean responses, but that can be obtained from the posterior (maybe this is better to avoid losing info?). The distribution does not come from CDF as in the probit model, but the means of posteriors across categories or participants should be similarly informative for log-odds/probabilities (personally I find a bit easier to understand probabilities rather than normal or log-odds outcomes). I hope this is useful in some way. Please let me know if I misunderstood something.

Thanks, @Simon , this looks incredibly useful. However, I can’t seem to find the repo you’re referring to (https://github.com/SimonErnesto/IRT_GRM_Bayesian). Any chance the link is wrong, or it’s been moved somewhere else?

I’m so sorry, I forgot to make it public :sweat_smile:
Does it work now?: https://github.com/SimonErnesto/IRT_GRM_Bayesian

Ah yes, works now, thank you!

I’ve finally had the time to look at this again, and I think your solution works well for me, @Simon . Thanks so much!

I’ve kept the basic structure of your model, but adapted it a bit to my purposes (and on the basis of McElreath’s section on ordered logistic models). I’ve added hyperpriors for the cutpoints (the different ‘Items’ in your example), and a few more categorical predictors (experience, gender, field). I’ve also added a parameter for ‘participants’, as I worry that a lot of the variation in responses (question responses) may be due to differences between the respondents (and not due to the other predictors). It looks like this now, and runs well if I put target_accept high enough (I may have to use a non-centered version of this model):

with pm.Model(coords=coords) as hierarchical_survey:

    # hyperpriors for the cutpoints
    cutpoints_h = pm.Normal('cutpoints_h',0,1, dims = 'cutpoint_names')
    cutpoints_s = pm.HalfNormal('s', 1)
    
    # hyperpriors for field, experience and participant
    s_p = pm.HalfNormal('p_s', 1)
    s_f = pm.HalfNormal('s_f', 1)
    s_a = pm.HalfNormal('s_a', 1)
    
    # cutpoints for the different questions
    cutpoints = pm.Normal('cutpoints', mu=cutpoints_h, sigma=cutpoints_s, transform=pm.distributions.transforms.ordered,
                      dims = ('question','cutpoint_names'), testval=np.arange(6))

    
    # predictors
    
    genderv = pm.Normal("genderv", 0.0, 1, dims = 'gender')
    fieldv = pm.Normal("fieldv", 0.0, s_f, dims = 'field')
    experiencev = pm.Normal("experiencev", 0.0, s_a, dims = 'experience')
    participantv = pm.Normal("participantv", 0.0, s_p, dims = 'participants')


    # Data, for pps
    
    G = pm.Data("G", gender_idx)
    F = pm.Data("F", field_idx)
    E = pm.Data("E", experience_idx)
    

    phi = pm.Deterministic("phi", genderv[G] +  participantv[participant_idx] + fieldv[F] + experiencev[E]) # 
    y = pm.OrderedLogistic("y", phi, cutpoints[question_idx], observed=genders.scores.values - 1)

Does that make sense? One thing I wondered about, is what difference it would make if ‘Items’ (‘questions’ in my code) were added as a separate parameter in the linear model, instead of estimating all the different sets of cutpoints. I’m mainly interested in the overall effect of gender (i.e. independently of question, field, experience and participant), and I suspect that doing it that way wouldn’t change much, but I don’t think I understand this well enough to be sure.

I also hope the model (with relatively many parameters, as there is one for each respondent) will still run well when the data set gets bigger (the pilot was only 130, but I’m expecting around 2000).

Thanks once again for your advice, it keeps amazing me how helpful everyone here is even to people who are inexperienced and new to this like me!

I’m glad it was useful. Let me try to address your questions one by one:

  1. You can indeed add items a separate parameter (e.g. a varying intercept). That may not work in all cases, though, as each item is rated once by each participant (though this may not be true for your specific case) there is a strong dependency between items and cutpoints. Thus, adding items as an independent varying intercept (for instance) may be more problematic than adding them as part of the varying cutpoints, but you can try both and see which works best for your specific case. (Always better to follow a Bayesian workflow approach in my view).

  2. The problem with all logisitc and similar models is that the likelihoods or sampling distributions do not have a variance parameter, so all variance goes through the estimate and that makes all parameters conflated, so you won’t be able to identify each causal effect independently. Maybe some other approach such as a factor analysis or path analysis could help with that, given you use a Gaussian model. But that defeats the purpose of treating ratings appropriately as an ordered variable. I’m not sure whether there is an escape to this catch 22, but maybe just reporting interpretations of parameters cautiously and making very clear that the effect of gender (or sex?) is not independent of your other variables, or comparing with other models, or maybe model averaging, etc., could help a bit.

  3. If the model converges well now (i.e. R_hats below or equal to1.01, ESS over 5-10% of your draws, etc.), then it should probably converge well with more data (just will take longer).

No worries, we’re all here to learn (= . So, please, take all my points above with a pinch of salt. Maybe someone more knowledgeable than me will see them and correct some things I may have gotten wrong. Good luck with the main study.

Thanks so much, that’s very clear and helpful!

I understand that it’s not possible to identify effects entirely independently, but I’ll look into getting the average marginal effect or something similar. There’s this thread on here, and there are quite some resources for R (like this one, and it’s even part of brms, but sadly nothing similar exists for python/pymc as far as I can see). I suppose that’s the best estimate of ‘independent’ effects that I can get, which I can then interpret cautiously as you suggest.