Building the linear equation of a logistic regression model

It was mentioned previously in the post below that I am inefficiently building the linear equation part of the model: Access probability p parameter from pm.BetaBinomial distribution during sample_ppc

The post above contains the code used to construct the linear equation part but I’ll also provide below:

with pm.Model() as model:        
    intercept = pm.Normal('intercept', mu=0.0, sd=5, shape=1)

    link_argument = intercept

    for covariate in covariates:
        if covariates[covariate]['type'] == 'categorical':
            shape = covariates[covariate]['encoder'].classes_.size
        elif covariates[covariate]['type'] in ['metric', 'binary']:
            shape = 1

        sigma = pm.HalfCauchy(f'{covariate}_coeff_sigma', beta=5)
        offset = pm.Normal(f'{covariate}_coeff_offset', mu=0, sd=1, shape=shape)
        coeff = pm.Deterministic(f'{covariate}_coeff', 0.0 + offset * sigma)

        if shape > 1:
            link_argument += coeff[model_variables[covariate]]
        else:
            link_argument += coeff * model_variables[covariate]

    omega = pm.Deterministic('omega', pm.invlogit(link_argument))
    kappa = pm.Exponential('kappa', lam=1e-4)

    alpha = pm.Deterministic('alpha', omega * kappa + 1)     
    beta = pm.Deterministic('beta', (1 - omega) * kappa + 1)     
    likelihood = pm.BetaBinomial(
        'likelihood', alpha=alpha, beta=beta, n=model_variables['n'], observed=model_variables['y_obs']
    )

I have two questions as I start to productionize this model:

1.) what is the more efficient way to build this model instead of the for loop that still takes into account the categorical variables?

2.) Does the inefficient building of the model affect the theano calculations that take place when sampling the posterior distribution?

If the inefficiencies only contribute to the building of the model itself, this is less of a concern to me.

Thanks for the help!

As a pymc3 n00b, I can’t answer all your questions,
But for 1:
You could create offset (and sigma) without a loop as:
offset = pd.Normal(mu=0, sd=1, shape=n_covariates)
sigma = pd.HalfCauchy(beta=5, shape=n_covariates)
and then take the dot product with your data:
pm.math.dot(offset*sigma, model_variables)

You would have to create a column of all ones in your data for the intercept.

Not sure on #2, but why dont you test :wink:

This actually does not take into account the categorical variables properly. This would assign one coefficient to the categorical covariate instead of assigning a new coefficient for each category in the covariate if I’m reading this correctly.

Ah yea, thats fair. I guess you can dummy code your catergorical variables then.
This coding can be done outside of the model block as a data preprocessing step to make the model code a bit cleaner.

TBH, i dont think for a small model like this you are likely to get any speedup using a dot product instead of the way you have it now.

For larger models (and generally for pymc3 model specification), matrix multiplication > for loops.

For more complicated models where you think you might need a for loop, especially when porting from other MCMC packages, this blog post has been helpful https://docs.pymc.io/notebooks/PyMC3_tips_and_heuristic.html

@junpenglao, do you have any suggestion on how to proceed?

I tried vectorizing the above operations into a dot product of the design matrix and coefficient vectors, but that actually resulted in longer computation time since it blew up the graph size I think.

Here is how I vectorized:

def construct_model(model_parameters):
with pm.Model() as model:
    # The bounds of this distribution correspond to machine 0 and machine 1 on logit scale.
    # This means that the baseline number is bounded on [0, 1].
    intercept = pm.Normal('intercept', mu=0, sd=5.0, shape=1)

    link_argument = intercept

    design_matrix = []
    coefficients = []
    
    for parameter in model_parameters:
        model_label = model_parameters[parameter].get('model_label', parameter)

        tensor = model_parameters[parameter]['shared']

        if model_parameters[parameter]['parameter'] != 'covariate':
            continue
            
        if model_parameters[parameter]['covariate_type'] == 'categorical':
            shape = model_parameters[parameter]['encoder'].classes_.size
            
            matrix = []
            for pos in range(shape):
                # Create a vector that is 1 if equal to the position and 0 otherwise, i.e. (one-hot-encode)
                vector = tt.switch(tt.eq(tensor, np.array(pos)), np.array(1.0), np.array(0.0))
                matrix.append(vector)
            matrix = tt.stack(matrix, axis=1)
            print(matrix.eval())
            print(tensor.eval())
    
        elif model_parameters[parameter]['covariate_type'] in ['metric', 'binary']:
            shape = 1
            
            # This turns an array into a vector of dimension Nx1.
            matrix = tensor.dimshuffle(0, 'x')
        
        logger.debug(f'Using parameter {parameter} to create coefficient {model_label}')

        sigma = pm.HalfNormal(f'{model_label}_coeff_sigma', sd=1.0)
        offset = pm.Normal(f'{model_label}_coeff_offset', mu=0, sd=1.0, shape=shape)
        coeff = pm.Deterministic(f'{model_label}_coeff', 0.0 + offset * sigma)

        design_matrix.append(matrix)
        coefficients.append(coeff.dimshuffle(0, 'x'))

    design_matrix = tt.concatenate(design_matrix, axis=1)
    coefficients = tt.concatenate(coefficients, axis=0)
    
    # omega parameter is "mode" of probability p and kappa parameter is "certainty"
    omega = pm.Deterministic('omega', pm.invlogit(tt.dot(design_matrix, coefficients)))
    offset = pm.Exponential('kappa_offset', lam=1, shape=1)
    kappa = pm.Deterministic('kappa', (1 / 1e-3) * offset)

    alpha = omega * kappa + 1
    beta = (1 - omega) * kappa + 1

    observed = model_parameters['impressions_target']['shared']
    n = model_parameters['impressions_anchor']['shared']

    likelihood = pm.BetaBinomial('likelihood', alpha=alpha, beta=beta, n=n, observed=observed)
    
return model

For context, the original way of building the model resulted in gradient calculations of 3.97 ms ± 116 µs per loop.
With the way just posted, it took ~6s per loop.

Is that 3.97 ms per loop a number that is too large or is that a decently fast calculation speed?

Hmm, you are not really vectorizing it, you still have a for loop that likely quite inefficient.
For vectorized parameterization, you have something like a predictor matrix X and you do X.dot(beta)

Got it. I’m on it! Thank you.

Ok, so building that design matrix before the model context with numpy then setting the design matrix as a shared tensor is giving me the similar ~6s per loop. I am still building the coefficients the same way since it’s not clear how else to do it.

Also, is it possible to speed up that 3.97ms per loop even with vectorization or should I pursue other routes?

The long and short of it is that the training takes roughly 15 min and I would like to lower that time by as much as possible.

You should do something like beta = pm.Normal('beta', mu=0., sd=1., shape=(X.shape[1], 1))

But wouldn’t that not capture the hierarchical nature of the model?

You still can - the “trick” is all in the construction of fixed and random effect design matrixes.
You might find eg https://github.com/junpenglao/GLMM-in-Python/blob/master/Playground.py helpful

So, even without the hierarchical effect, building a vector of coefficients and multiplying by the design matrix is still not as fast as the initial version.

def construct_model(model_parameters):
covariates = [parameter for parameter in model_parameters if model_parameters[parameter]['parameter'] == 'covariate']
design_matrix = np.column_stack([model_parameters[covariate]['array'] for covariate in covariates])

shape = design_matrix.shape[1]
design_matrix = shared(design_matrix)

with pm.Model() as model:
    # The bounds of this distribution correspond to machine 0 and machine 1 on logit scale.
    # This means that the baseline number is bounded on [0, 1].
    intercept = pm.Normal('intercept', mu=0, sd=5.0, shape=1)

    coefficients = []
    
    coefficients = pm.Normal('beta', mu=0., sd=1., shape=(shape, 1))
    
    # omega parameter is "mode" of probability p and kappa parameter is "certainty"
    omega = pm.Deterministic('omega', pm.invlogit(intercept + design_matrix.dot(coefficients)))
    offset = pm.Exponential('kappa_offset', lam=1, shape=1)
    kappa = pm.Deterministic('kappa', (1 / 1e-3) * offset)

    alpha = omega * kappa + 1
    beta = (1 - omega) * kappa + 1

    observed = model_parameters['impressions_target']['shared']
    n = model_parameters['impressions_anchor']['shared']

    likelihood = pm.BetaBinomial('likelihood', alpha=alpha, beta=beta, n=n, observed=observed)
    
return model

Fyi the number of features is only 6, but there are 351 covariates, i.e. the shape of the design matrix is (5473, 351).