Hi all
I am trying to set up a model to analyze ordinal survey data. The survey questions ask about proportions (e.g. 'for what proportion of X did you Y?), and the response options are intervals of proportions (e.g. ‘less than 5% of X’, ‘between 5% and 25% of X’, etc.).
There is a similar ordinal model in chapter 23 of Kruscke’s DBDA 2, which someone has implemented in pymc3. I’ve changed this model to fit to my data and purposes, and get the following:
nYlevels = 6
thresh = np.array([0.05,0.25,0.5,0.75,0.95])
#a function to calculate the probabilities for all the thresholds
@as_op(itypes=[tt.dvector, tt.dvector], otypes=[tt.dmatrix])
def outcome_probabilities( alpha, beta):
out = np.empty(((nYlevels), int(12)), dtype=np.float64)
n = stats.beta(alpha,beta) #I changed this one from stats.norm to stats.beta, and now estimate alpha and beta in the model
out[0] = n.cdf(theta[0])
out[1] = n.cdf(theta[1]) - n.cdf(theta[0])
out[2] = n.cdf(theta[2]) - n.cdf(theta[1])
out[3] = n.cdf(theta[3]) - n.cdf(theta[2])
out[4] = n.cdf(theta[4]) - n.cdf(theta[3])
out[5] = 1 - n.cdf(theta[4])
return out
#the model
with pm.Model() as ordinal_model:
theta = thresh
mu = pm.Uniform('mu',0,1,shape=int(12))
var = pm.Uniform('var',0,(mu*(1-mu)),shape=int(12))
alpha = ((1 - mu) / var - 1 / mu) * (mu ** 2)
beta = alpha * (1 / mu - 1)
pr = outcome_probabilities( alpha, beta)
y = pm.Categorical('y', pr.T, observed=all_samples.T)
To test this model, I sampled data from random beta distributions, and then transformed this into ordinal data using the thresholds from the response options (‘thresh’ in the code above). The model ostensibly runs well with the simulated data, but is sometimes very far off the true means and variances. In particular, the 94%HDI of the posterior is very far away from the true mean when the true mean is relatively high (say 0.4 and up). In those cases, the model generally vastly underestimates the mean (but also not always). For many random pairs of alpha and beta, however, the estimates for the mean and var are very accurate.
I’m not sure what might be the reason that the model works so well for some pairs of alpha and beta, but not for others. This is what I’ve looked at:
- The sampling looks fine, as far as I can tell, but is very slow (perhaps because I cannot use NUTS because of the custom Theano Operation to calculate the thresholded cumulative probabilities?).
- I’ve tried the same model but without priors for ‘mu’ and ‘var’, and with various exponential or halfnormal priors for ‘alpha’ and ‘beta’, but that leads to very similar results.
- I’ve tried using varying sample sizes for the simulated data. This seems to make a difference, with fewer random pairs of alpha and beta for which the estimate of the mean is off, and when the estimates are off they are off a bit less (but the true mean is still very far out of the 90% HDI). However, even with sample sizes of 20k (which is far less than the survey I’m hoping to run), the model still strongly underestimates the mean for pairs of alpha and beta that lead to a high mean.
Is there anything I can change, check or try?Or is it simply very difficult to estimate the mean from this kind of ordinal data?
Thanks!
EDIT: Here’s how I simulated the data:
# get 12 samples from beta distributions with random alpha's and beta's between 0 and 20
samples_dct = {}
for i in range(1,13):
alpha = random.uniform(0.0001, 20)
beta = random.uniform(0.0001, 20)
samples_dct[i] = np.random.beta(alpha, beta,size=500)
# the thresholds, and the category that corresponds to them
ordinal_scores = {0.05:1,0.25:2,0.5:3,0.75:4,0.95:5, 1:6}
#transform the samples into ordinal data
ordinal_data = copy.deepcopy(samples_dct)
samples = list(ordinal_data.values())
for sample in samples:
for key,value in ordinal_scores.items():
sample[sample < key ] = value
#look at first sample to check
first_sample = pd.DataFrame(data = [samples[0], samples_dct[1]], index = ['ordinal score','true value']).T
#turn into array of categorical pd.Series
all_samples = []
for i in samples:
sample = pd.Series(i).astype('category').cat.codes.values
all_samples.append(sample)
all_samples = np.array(all_samples)