[Conceptual] Bayesian Chi-Squared estimation

(This is approximately a mirror of my question on stackexchange. I’m using PyMC3 for inference, and I thought maybe this community would have a higher density of experienced opinions :wink: )

I’m trying to develop a replacement for a Chi-Sq test that uses Bayesian parameter estimation in a hierarchical model instead of a frequentist point estimate (analogous to replacing a t-test with BEST).

The original problem involves an experiment with two groups of participants, each of whom performs a repeated measures task on which three types of error can be made. I want to estimate the between-group difference in the distribution of error types.

The model structure I’m using:

\mu_{population, i} \sim \mathcal{N}(0, 100)

\sigma_{population, i} \sim \mathcal{HalfCauchy}(5)

\mu_{group, i} \sim \mathcal{N}(\mu_{population, i}, \sigma_{population, i})

\sigma_{group, i} \sim \mathcal{HalfCauchy}(5)

\mu_{subject, i} \sim \mathcal{N}(\mu_{group, i}, \sigma_{group, i})

\theta_{subject, i} = \frac {e^{\mu_{subject, i}}} {\sum_{\hat{i}} e^{\mu_{subject, \hat{i}}}}

k_{subject, i} = \mathcal{Multi}(\vec{\theta}_{subject}, n_{subject})

Two comments on the choices I’ve made; would love to hear any further criticisms:

  1. Groups and subjects are technically hierarchical in this experiment, which is an assumption that’s not possible to encode for a standard \chi^2 test as far as I know, but all subjects come from the same originating pool, so it makes sense to have the population-level prior.

  2. It’s difficult to work with priors over Dirichlet parameters, so instead of using the Dirichlet conjugate prior for multinomial distributions, I’ve followed a suggestion I’ve seen from Gelman and others to use a softmax operation on normally distributed variables to produce the multinomial parameter.

    A. Interestingly, when the per-group parameters are normally distributed, the sum of their rescaled squared differences is \chi^2 distributed, which makes a very natural contrast to sample the posterior over, as it mirrors the procedure and assumptions involved in the frequentist \chi^2 test: \sum_j { \frac {(\mu_{0, j} - \mu_{1, j})^2} {\sigma_{0, j}^2 + \sigma_{1, j}^2} }

So my intent is to sample the posterior over this contrast. This gets me to question 1: have I made appropriate choices in this procedure?

The second question is more directly related to PyMC3 and HMC. When sampling from this, model, I run into convergence issues, even using reparameterization trick for the normal distributions. I’m using multinomial distribution as the final output likelihood; would using a categorical distribution instead help with the convergence if the number of participants is relatively small (<50)?

Model code:

with pm.Model() as chisq_model:
    mu_population = pm.Normal('mu_population', 
                              mu=0, sd=100,
                              shape=num_types)
    
    sigma_population = pm.HalfCauchy('sigma_population', 
                                     5,
                                     shape=num_types)

    mu_group_offset = pm.Normal('mu_group_offset', 
                                mu=0, sd=1, 
                                shape=(num_groups, num_types))
    
    mu_group = pm.Deterministic('mu_group', 
                                mu_population + sigma_population * mu_group_offset)
    
    sigma_group = pm.HalfCauchy('sigma_group', 
                                5,
                                shape=(num_groups, num_types))
    
    mu_subject_offset = pm.Normal('mu_subject_offset', 
                                  mu=0, sd=1, 
                                  shape=(num_subjects, num_types))
    
    mu_subject = pm.Deterministic('mu_subject',
                                  mu_group[groups] + mu_subject_offset * sigma_group[groups])
    
    theta_subject = pm.Deterministic('theta_subject',
        T.tensor.exp(mu_subject) / T.tensor.exp(mu_subject).sum(axis=1, keepdims=True))
    
    k_subject = pm.Multinomial('k_subject', 
                               n=ns, p=theta_subject[subjects],
                               observed=y)
    
    difference_in_means = pm.Deterministic('difference in means',
                                           ((mu_group[0] - mu_group[1]) ** 2).sum())
    
    effect_size = pm.Deterministic('effect size',
                                   (((mu_group[0] - mu_group[1]) ** 2) 
                                     / (sigma_group[0] ** 2 + sigma_group[1] ** 2)).sum())

Have you tried doing this with more informative priors? N(0, 100) seems enormous if your data is normalized; even the HalfCauchy(5) might be a bit over-the-top.

Can you explain a little why you want a posterior over the weighted some of squares? Does it have a particular meaning in your system? If I wanted a sense of whether the per-group heterogeneity is warranted by the data, I would compute a Bayes Factor for the alternate model of the fixed groups being equal to the observed mean of the data. (Or you could run two samplings and compute model likelihoods to take the ratio).

Yeah, I’ve tried more informative/less broad priors (tighter normal and uniforms instead of HalfCauchy). I think I borrowed those values from tutorial originally, but changing them didn’t seem to improve anything. Definitely not married to them though.


My goal is approach all my analyses as parameter estimation. For some related between-groups analyses, I’m using hierarchical models with a binomial outcome variable to estimate group-level differences in outcome likelihoods (these are working well; pretty much text book models taken straight from Krushke’s book).

I could also approach those as model comparison and compute Bayes factors, but my view is that reporting Bayes factors is less intuitive and less informative than reporting the entire posterior. For these reasons plus consistency with other analyses, I’d like to replace a \chi^2 test with an appropriate parameter estimation technique, the same way I’ve replaced binomial tests/t-tests elsewhere.

I’d also like the analysis to be accessible to a more classically trained audience. The \chi^2 test compares a test statistic \sum_i {\frac {(n_i - np_i)^2} {np_i}} to a reference \chi^2 distribution that assumes the differences are minimal (in which case the RVs were independent). Example here.

The main reason for using softmax regression for the categorical/multinomial parameter is tractability - it’s difficult to deal with priors on Dirichlet parameters. But the analogy with the frequentist \chi^2 test it exposes is an attractive property .


Bracketing questions about the procedure itself, I think the implementation problem here may have to do with something in Categorical/Multinomial. For instance, when I try to use Categorical, I have ~20k observations; the value from chisq_model.check_test_point() is -inf for k_subject, and I get a Bad Energy error if I try to sample.

(This is true even if I fix the categorical parameter like Categorical('k_subject', p=[.3, .3, .4], observed=y); y is a single ~20k dimensional vector containing {0, 1, 2}.)

Just a quick idea for debugging this model. Does it work if you replace

pm.Multinomial('k_subject', n=ns, p=theta_subject[subjects], observed=y)

with

pm.Normal('k_subject', mu=ns*theta_subject[subjects],
          sd=tt.sqrt(ns*theta_subject[subjects], observed=y)

or even

pm.Normal('k_subject', mu=ns*theta_subject[subjects], 
             sd=pm.HalfNormal('sd_', np.sqrt(ns)), observed=y)

?

Ah, I thought this was a pretty clever idea, but it had the same issues as the multinomial likelihood (max tree depth and target accept level warnings). At least it sampled much faster :wink:

Not sure it’s clear enough to be interesting, but here are the traces from that run:

In the meantime, I have the example model from Krushke’s book ch22 (Poisson regression for contingency tables) working just fine on this dataset. Might modify the model setup code (replace some shape kwargs with loops over variable construction) and see if I can get it working with Poisson likelihood…

Yeah sigma_population looks like it’s causing the problems.

It parametrizes the variance between group means; and it looks like you have < 10 groups. I don’t think that’s enough group observations to overcome the massive tail of a HalfCauchy(5). Even with the divergences, it looks like a HalfNormal(3) might be a bit better – see if you still get a happier sampler this way.

@junpenglao I know we’ve switched to Arviz in the new pymc3; but I find the condensed plots above really useful for spot checking like this. Any chance we could have something like

pm.traceplot(trace, style='condensed')

to bring back this old plotting behavior?

Re multinational regression, you should take a look at Multinomial hierarchical regression with multiple observations per group ("Bad energy issue")

Re pymc3 style of trace plot, there is an issue: https://github.com/arviz-devs/arviz/issues/94 but nobody is actively working on it yet…