Unexpected results from logistic regression model

dorian821 · February 13, 2023, 11:14am

Hi,

I’m experimenting with a hierarchical log-reg model to evaluate the differences between various pooling schemes. I’m currently trying to understand the result from the no-pooling setup where each group is modelled independently.

I’m getting unexpected results, however, that seem to depend on whether multiple groups are modelled in parallel in the same pm.Model or not, and not whether the there is pooling, shared priors, or some other form of information sharing between the group models. Something I wouldn’t expect to be the case.

The data is very simple, there’s just one binary variable/feature called “exposure”.

To demonstrate the issue causing confusion, I’ve coded two models. One is a single model that takes as input a dataframe containing only one group:

def create_single_model(X, y):
    with pm.Model() as model:
        #define data

        regressors = pm.ConstantData("regressors", X)

        approval = pm.ConstantData("approval", y)

        #define priors
        intercept = pm.Normal('intercept', mu=0, sigma=5)

        betas = pm.Normal('betas', mu=0, sigma=5, shape=X.shape[1])

        # calc logit
        logit = intercept + (regressors * betas).sum(axis=-1)

        #get probabilities
        p_i = pm.Deterministic('p_i', pm.math.invlogit(logit))

        #define the bernoulli likelihood
        y_observed = pm.Bernoulli('y_obs', p=p_i, observed=approval)
        trace = pm.sample(1000, tune=1000, cores=2, nuts=dict(max_treedepth=15, target_accept=.97)
        )
        return dict(
            model =  model, 
            trace = trace
        )

The other model iterates over the groups to create N independent models that are sampled from together.

def create_no_pooling_iter_model(data, model_features, samples=1000, burn_in=1000, target_accept=0.9, n_chains=2):

    group_idxs, groups = pd.factorize(data.group)
    n_features = len(model_features)
    coords = {
        "group": groups,
        "group_id": np.arange(len(group_idxs)),
        "feature_id" : np.arange(len(model_features)),
    }

    with pm.Model(coords=coords) as model:

        intercepts = {}
        betas = {}
        logits = []
        targets = []
        for name, group in data.groupby('group'):

            grp_data = pm.ConstantData(name, group[model_features])

            targets.append( group.target )

            intercepts[name] = pm.Normal(f'intercept_{name}', mu=0, sigma=5)

            betas[name] = pm.Normal(f'betas_{name}', mu=0, sigma=5, shape=n_features)

            ind_logit = intercepts[name] + (grp_data * betas[name]).sum(axis=-1)

            # calc logit
            logits.append( ind_logit )

        logit = aesara.tensor.concatenate(logits)
        target = pm.ConstantData("target", pd.concat(targets) )
        
        #get probabilities
        p_i = pm.Deterministic('p_i', pm.math.invlogit(logit))

        #define the bernoulli likelihood
        y_observed = pm.Bernoulli('y_obs', p=p_i, observed=target)
        trace = pm.sample(samples, tune=burn_in, cores=n_chains, nuts=dict(max_treedepth=15, target_accept=.97))
    return dict(
        model =  model, 
        trace = trace
    )

Here I’ve just grouped by group and exposure variable showing the mean of the summary of the traces of p_i and the actual observed variable called target.

When running the first model on group_1, we get the following results from the p_i probability variable. The model is essentially just matching the data, as should be expected, given the simplicity of the conditional distribution of the target given the exposure.

When I run the “parallel” model, however, I get the following results:

If we look at the results for group 1, we see that the mean posterior probability of the target given the exposure is no where near the actual observed probability of the target given the exposure as it was in the previous model.

I don’t understand why this is. Given that the intercepts and coefficients for the models for each group are completely independent, I would expect these results to mirror the first model fairly closely.

What’s going on here?

dorian821 · February 13, 2023, 11:47am

Another odd fact is that the models learn almost identical parameters.

Here are the parameters for the simple model:

Here are for the parallel model:

group_1 is the group in question.

dorian821 · February 13, 2023, 1:01pm

This turned out to be an indexing error arising from the way the data was processed in the parallel model. After correcting for that the results are identical.

dorian821 · February 13, 2023, 2:06pm

The above question, was an unexpected snag in the exploration of the following question:

The results show that the below data is different than the above, but I don’t understand why. The only difference is that I’m using the pymc-style indexing in the below, while I’m explicitly mapping the distributions to the groups in the above model.

def create_no_pooling_model(data, model_features, samples=1000, burn_in=1000):

    #create index
    if 'idx' not in data.columns:
        data.loc[:, 'idx'] = np.arange(data.shape[0])

    group_idxs, groups = pd.factorize(data.group)
    coords = {
        "group": groups,
        "group_id": np.arange(len(group_idxs)),
        "feature_id" : np.arange(len(model_features)),
    }
    with pm.Model(coords=coords) as model:
        #define data

        group_idx = pm.ConstantData('group_idx', group_idxs, dims="group_id")

        regressors = pm.ConstantData("regressors", data[model_features], dims=("group_id", "feature_id"))

        target = pm.ConstantData("target", data.target)

        index = pm.ConstantData("index", data.idx)

        #define priors
        intercept = pm.Normal('intercept', mu=0, sigma=5, dims="group")

        betas = pm.Normal('betas', mu=0, sigma=5, dims=("group", "feature_id"))

        # calc logit
        logit = intercept[group_idx] + (regressors[group_idx] * betas[group_idx]).sum(axis=-1)

        #get probabilities
        p_i = pm.Deterministic('p_i', pm.math.invlogit(logit))

        #define the bernoulli likelihood
        y_observed = pm.Bernoulli('y_obs', p=p_i, observed=target)
        trace = pm.sample(samples, tune=burn_in, cores=2, nuts=dict(max_treedepth=15, target_accept=.97)
        )
        return dict(
            model =  model, 
            trace = trace,
            data = data
        )

Here is the parameter summary:

Here are the proba summaries:

cluhmann · February 13, 2023, 10:20pm

I can’t quite sort through all the plumbing of the original specification. However, I would strongly recommend avoiding “hand-coded” specifications like this. Indexing is difficult enough as it is. The idea of building up parameters and likelihoods bit by bit seems fraught to me.

Also, if you are needing to set max_treedepth and/or set target_accept=.97, you may have issues with your model specification.

dorian821 · February 14, 2023, 10:05am

Thanks. I appreciate the feedback.

The “parallel/hand-coded” model should ultimately be the same as the cleaner pymc-indexed method. Unless I’m misunderstanding something. A key motivation for posting the question.

I coded things up by hand in an effort to diagnose the oddities I’m observing in the latter cleaner approach. The hand-coded approach is essentially the same as [this](https://mc-stan.org/docs/2_21/stan-users-guide/hierarchical-logistic-regression.html) example in the Stan documentation.

The idea is that we groupby on the data by group to get the group name and group data, and then create the parameter distributions for each group independently, calculate the logits for the group. Then in the end, we just concatenate, invert, and infer.

Good point regarding the target_accept and max_treedepth. These were set during an earlier iteration, and don’t seem to be required now, but even if they are, how can I identify the issues with the latter specification, and why is it different than the former “hand-coded” specification?

dorian821 · February 14, 2023, 3:38pm

This line was the problem. I shouldn’t have indexed the data (i.e. regressors). I don’t quite understand why the syntax is such, but the results when as below are as I would expect they should be.

logit = intercept[group_idx] + (regressors * betas[group_idx]).sum(axis=-1)

Any insight as to why the syntax is this way is much appreciated.

ricardoV94 · February 14, 2023, 10:44pm

What exactly about the syntax doesn’t make sense? Indexing (and basically everything else) just follows the Numpy syntax

dorian821 · February 15, 2023, 9:15am

@ricardoV94 you’re right. After looking at it again it’s obvious. I’m not sure why I didn’t make sense before. thanks for following up.

Topic		Replies	Views
Unexpected prior predictive behaviour Questions	1	408	September 30, 2020
Selecting proirs based on group Questions modeling	2	557	October 3, 2022
Weird/pathological behavior with a simple logistic regression model Questions	3	572	August 13, 2018
Models with different pooling give very different results Questions	2	540	December 26, 2017
Not Understanding the Posterior Questions	2	425	December 31, 2020

Unexpected results from logistic regression model

Related topics