Ordinal model does not find true means and variance

Hi all

I am trying to set up a model to analyze ordinal survey data. The survey questions ask about proportions (e.g. 'for what proportion of X did you Y?), and the response options are intervals of proportions (e.g. ‘less than 5% of X’, ‘between 5% and 25% of X’, etc.).

There is a similar ordinal model in chapter 23 of Kruscke’s DBDA 2, which someone has implemented in pymc3. I’ve changed this model to fit to my data and purposes, and get the following:

nYlevels = 6
thresh = np.array([0.05,0.25,0.5,0.75,0.95])

#a function to calculate the probabilities for all the thresholds

@as_op(itypes=[tt.dvector, tt.dvector], otypes=[tt.dmatrix])
def outcome_probabilities( alpha, beta):
    out = np.empty(((nYlevels), int(12)), dtype=np.float64)
    n = stats.beta(alpha,beta)    #I changed this one from stats.norm to stats.beta, and now estimate alpha and beta in the model
    out[0] = n.cdf(theta[0]) 
    out[1] = n.cdf(theta[1]) - n.cdf(theta[0])
    out[2] = n.cdf(theta[2]) - n.cdf(theta[1])
    out[3] = n.cdf(theta[3]) - n.cdf(theta[2])
    out[4] = n.cdf(theta[4]) - n.cdf(theta[3])
    out[5] = 1 - n.cdf(theta[4])
    return out

#the model

with pm.Model() as ordinal_model:    

    theta = thresh
    mu = pm.Uniform('mu',0,1,shape=int(12))
    var = pm.Uniform('var',0,(mu*(1-mu)),shape=int(12))

    alpha = ((1 - mu) / var - 1 / mu) * (mu ** 2)
    beta = alpha * (1 / mu - 1)
          
    pr = outcome_probabilities( alpha, beta) 
        
    y = pm.Categorical('y', pr.T, observed=all_samples.T)

To test this model, I sampled data from random beta distributions, and then transformed this into ordinal data using the thresholds from the response options (‘thresh’ in the code above). The model ostensibly runs well with the simulated data, but is sometimes very far off the true means and variances. In particular, the 94%HDI of the posterior is very far away from the true mean when the true mean is relatively high (say 0.4 and up). In those cases, the model generally vastly underestimates the mean (but also not always). For many random pairs of alpha and beta, however, the estimates for the mean and var are very accurate.

I’m not sure what might be the reason that the model works so well for some pairs of alpha and beta, but not for others. This is what I’ve looked at:

  • The sampling looks fine, as far as I can tell, but is very slow (perhaps because I cannot use NUTS because of the custom Theano Operation to calculate the thresholded cumulative probabilities?).
  • I’ve tried the same model but without priors for ‘mu’ and ‘var’, and with various exponential or halfnormal priors for ‘alpha’ and ‘beta’, but that leads to very similar results.
  • I’ve tried using varying sample sizes for the simulated data. This seems to make a difference, with fewer random pairs of alpha and beta for which the estimate of the mean is off, and when the estimates are off they are off a bit less (but the true mean is still very far out of the 90% HDI). However, even with sample sizes of 20k (which is far less than the survey I’m hoping to run), the model still strongly underestimates the mean for pairs of alpha and beta that lead to a high mean.

Is there anything I can change, check or try?Or is it simply very difficult to estimate the mean from this kind of ordinal data?

Thanks!

EDIT: Here’s how I simulated the data:

# get 12 samples from beta distributions with random alpha's and beta's between 0 and 20

samples_dct = {} 

for i in range(1,13):
    alpha = random.uniform(0.0001, 20)
    beta = random.uniform(0.0001, 20)
    samples_dct[i] = np.random.beta(alpha, beta,size=500)

# the thresholds, and the category that corresponds to them

ordinal_scores = {0.05:1,0.25:2,0.5:3,0.75:4,0.95:5, 1:6}

#transform the samples into ordinal data

ordinal_data = copy.deepcopy(samples_dct)   

samples = list(ordinal_data.values())

for sample in samples:
    for key,value in ordinal_scores.items():
        sample[sample < key ] = value

#look at first sample to check

first_sample = pd.DataFrame(data = [samples[0], samples_dct[1]], index = ['ordinal score','true value']).T

#turn into array of categorical pd.Series 

all_samples = []

for i in samples:
    sample = pd.Series(i).astype('category').cat.codes.values
    all_samples.append(sample)
    
all_samples = np.array(all_samples)

Can you provide your data simulation code?

Sorry, should have done that from the start. I’ve edited the post and added the data simulation code.

I poked around just a bit, but I would suggest taking a look through this notebook. Not only does it provide a few different scenarios, it also uses a pure theano implementation which should be substantially faster.

3 Likes

Thanks, this is precisely what I needed, and much faster too probably. I will try this later today!

1 Like

So I’ve tried using the approach explained in that notebook (which matches perfectly with what I wanted to do). Unfortunately, I’m assuming that the underlying distribution is a Beta distribution and it seems that the logcdf of pymc3.Beta can only handle scalar values at the moment. In order to bypass that, I tried splitting up the thresholds, replacing:

probs1 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(d1))

by:

    prob1 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.05))
    prob2 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.25))
    prob3 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.50)) 
    prob4 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.75)) 
    prob5 = tt.exp(pm.Beta.dist(alpha=alpha, beta=beta).logcdf(0.95))
   
    probs1 = tt.stack([prob1,prob2,prob3,prob4,prob5],axis=1)

(the values are the particular thresholds of my study).

Unfortunately, I think that ran into this error. So I had to use the as_op decorator that I used in my original model, as follows:

@as_op(itypes=[tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def outcome_probabilities( alpha, beta):
    out = np.empty(((6)), dtype=np.float64)
    n = stats.beta(alpha,beta)  
    out[0] = n.cdf(theta[0]) 
    out[1] = n.cdf(theta[1]) - n.cdf(theta[0])
    out[2] = n.cdf(theta[2]) - n.cdf(theta[1])
    out[3] = n.cdf(theta[3]) - n.cdf(theta[2])
    out[4] = n.cdf(theta[4]) - n.cdf(theta[3])
    out[5] = 1 - n.cdf(theta[4])
    return out


with pm.Model() as model1:

    theta = d1
    mu = pm.Uniform('mu',0,1)
    sigma = pm.Uniform('sigma',0,(mu*(1-mu)))
    alpha = ((1 - mu) / sigma - 1 / mu) * (mu ** 2)
    
    beta = alpha * (1 / mu - 1)

    probs1 = outcome_probabilities(alpha,beta)
    
    probs1 = pm.Deterministic("probs1", probs1)
    
    pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values)

This works much better than the model I started with, both with respect to posterior predictive checks and recovery of the true parameter values. The sampling is still quite slow, because it still isn’t a pure theano implementation, but at least the estimates are much better now. Thanks again for the pointer, @cluhmann !

1 Like

I checked in with @ricardoV94 and the gradient issue described with pymc3.distributions.dist_math.incomplete_beta should be fixed in v4. So if you are interested in trying out the v4 beta that is currently available, it might help. Just trying to get you moving quicker!

Thanks, that’s great to hear – I’ll definitely look into that!

I didn’t know that notebook was visible yet. There will be some edits, but the ordinal regression stuff there won’t change.

FYI. I’m going to be working on another example notebook very soon to show off the new distributions pm.OrderedProbit and pm.OrderedLogit. Will try to remember to post it here when it’s done, or at least when I submit the PR.

1 Like