Speeding up hierarchical ordered logistic regression

mihagazvoda · August 11, 2023, 6:51pm

I have a dataset of book ratings between 1-5 for 1M+ books and thousands of books. One book can have from 1 to a couple of hundred ratings. I want to find the best rated books using hierarchical ordered logistic regression. My model:

with pm.Model() as model:
    mu0 = pm.Normal('mu0', 0, 0.5)
    sigma0 = pm.Exponential('sigma0', 1)
    b = pm.Normal("b", mu0, sigma0, shape=df.book_id.nunique())
    cutpoints = pm.Normal("cutpoints", mu=0, sigma=1, shape=4,
                          transform=pm.distributions.transforms.univariate_ordered,
                          initval=np.linspace(-1, 1, 4))

    rating = pm.OrderedLogistic("rating", eta=b[df.book_id.values], cutpoints=cutpoints,
                                observed=df.rating - 1, compute_p=False) 

    idata = pm.sample(draws=1000, tune=500, chains=4, cores=4, nuts_sampler='numpyro')

Is there a way to speed up my model because it takes at least couple of hours to sample (I haven’t waited till the end actually). It’s okay with posterior is approximated somehow, I’m mostly interested in the mean value of intercept per book_id. Thanks!

twiecki · August 11, 2023, 7:01pm

Try using a non-centered parameterization: Why hierarchical models are awesome, tricky, and Bayesian — While My MCMC Gently Samples

jessegrabowski · August 11, 2023, 7:14pm

It might also be a good use-case for pathfinder ADVI in pymc_experimental?

mihagazvoda · August 11, 2023, 7:17pm

Thanks! To be honest, time was already a problem without hierarchical part. But anyway, will try, thanks!

twiecki · August 11, 2023, 7:22pm

I guess you have millions of data points? That’s definitely up there. Pathfinder is worth a try, or sampling on the GPU perhaps. Or ADVI.

mihagazvoda · August 22, 2023, 12:00pm

Thanks! Yes, I have almost a million observations. I tried pathfinder but I’m getting everything NaN for az.summary. Any advice?

jessegrabowski · August 22, 2023, 4:10pm

Does it work on a smaller subset of the data?

If you’re going to use an approximation like pathfinder I’d encourage you to simulate some data from your priors and do some parameter recovery exercises to make sure it can give the right answer. You can also scale up the quantity of data you generate to see if you even need all million of those data points.

This exercise would also let you know if the nans are due to a bug in the model, a bug in pathfinder, prior scaling, outliers in the data, or something else.

My wild guess is it’s some kind of boundary issue with zero probability in pm.OrderedLogistic.

mihagazvoda · August 23, 2023, 4:09pm

Thanks! For smaller dataset I get nans only for cutpoints so prior for them might be problematic.
Is there a way to fit a model on aggregated data (columns are then book_id, rating, count) instead of book_id, rating? This would decrease the number of rows by a number of 10. I can also open a separate question for that. Maybe @twiecki can also help?

tcapretto · August 23, 2023, 8:44pm

@mihagazvoda do you have a simulated dataset that looks like yours? I think I have an idea

mihagazvoda · August 24, 2023, 7:52am

Thomas, here’s a code for simulated data. My idea was to aggregate and this way potentially speed up the computation. I also have a few covariates that I didn’t include in the simulation. Hope that’s okay:

import numpy as np
import pandas as pd
import pymc as pm
import arviz as az

def calc_rating(book_id_intercept):
    b01, b02, b03, b04 = -1, -0.5, 0.5, 1

    # Calculate log odds
    logodds1 = book_id_intercept - b01
    logodds2 = book_id_intercept - b02
    logodds3 = book_id_intercept - b03
    logodds4 = book_id_intercept - b04

    # Convert log odds to probabilities
    inv_logit = lambda logit: np.exp(logit) / (1 + np.exp(logit))
    prob_2to5 = inv_logit(logodds1)
    prob_3to5 = inv_logit(logodds2)
    prob_4to5 = inv_logit(logodds3)
    prob_5 = inv_logit(logodds4)

    # Calculate individual category probabilities
    prob_1 = 1 - prob_2to5
    prob_2 = prob_2to5 - prob_3to5
    prob_3 = prob_3to5 - prob_4to5
    prob_4 = prob_4to5 - prob_5

    return np.random.choice([1, 2, 3, 4, 5], p=[prob_1, prob_2, prob_3, prob_4, prob_5])

np.random.seed(42)
num_books = 50000

df_agg = pd.DataFrame({
    'book_id': [i for i in range(num_books)],
    'a': np.random.normal(0, 1, num_books),
    'n': np.random.exponential(20, num_books).astype('int') + 1
})

df = df_agg.loc[df_agg.index.repeat(df_agg['n'])].reset_index(drop=True).drop(columns='n')
df['rating'] = df['a'].apply(calc_rating)

with pm.Model() as model:
    abar = pm.Normal('abar', 0, 1)
    sigma = pm.Exponential('sigma', 3)
    za = pm.Normal('za', 0, 1, shape=df.book_id.nunique())
    a = pm.Deterministic('a', abar + za * sigma)
    cutpoints = pm.Normal("cutpoints", mu=0, sigma=1, shape=4,
                          transform=pm.distributions.transforms.univariate_ordered,
                          initval=np.linspace(-1, 1, 4))

    rating = pm.OrderedLogistic("rating", eta=a[df.book_id.values], cutpoints=cutpoints,
                                observed=df.rating - 1, compute_p=False) 

    idata = pm.sample(draws=1000, tune=500, chains=2, cores=2, nuts_sampler='blackjax')

Thanks in advance!

cocodimama · August 24, 2023, 3:39pm

I had a very similar issue last year - see here

Minibatch ADVI dramatically sped up runtime (by a factor of 20+), and the parameter estimates (that was all I needed) were pretty much indistinguishable from a full NUTS MCMC run.

tcapretto · October 2, 2023, 9:28pm

Sorry for the slow response. I think I have something that works better. It uses the fact that you have multiple observations for each book and rating value.

Full code:

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt


from pymc.sampling.jax import sample_numpyro_nuts


def calc_rating(book_id_intercept):
    b01, b02, b03, b04 = -1, -0.5, 0.5, 1

    # Calculate log odds
    logodds1 = book_id_intercept - b01
    logodds2 = book_id_intercept - b02
    logodds3 = book_id_intercept - b03
    logodds4 = book_id_intercept - b04

    # Convert log odds to probabilities
    inv_logit = lambda logit: np.exp(logit) / (1 + np.exp(logit))
    prob_2to5 = inv_logit(logodds1)
    prob_3to5 = inv_logit(logodds2)
    prob_4to5 = inv_logit(logodds3)
    prob_5 = inv_logit(logodds4)

    # Calculate individual category probabilities
    prob_1 = 1 - prob_2to5
    prob_2 = prob_2to5 - prob_3to5
    prob_3 = prob_3to5 - prob_4to5
    prob_4 = prob_4to5 - prob_5

    return np.random.choice([1, 2, 3, 4, 5], p=[prob_1, prob_2, prob_3, prob_4, prob_5])

np.random.seed(42)
num_books = 200 # Only 200 to test quicker, but you can try with the full data size

df_agg = pd.DataFrame({
    'book_id': [i for i in range(num_books)],
    'a': np.random.normal(0, 1, num_books),
    'n': np.random.exponential(20, num_books).astype('int') + 1
})

df = df_agg.loc[df_agg.index.repeat(df_agg['n'])].reset_index(drop=True).drop(columns='n')
df['rating'] = df['a'].apply(calc_rating)

# Count number of observations for every book_id and rating
df_agg = df.groupby(["book_id", "rating"]).size().to_frame("count").reset_index()


# Create model, using potential so we can use the weights (a Custom distribution could be used too)
with pm.Model() as model:
    abar = pm.Normal("abar", 0, 1)
    sigma = pm.Gamma("sigma", alpha=2, beta=5) # Changed it so doesn't put mode at zero
    za = pm.Normal("za", 0, 1, shape=df["book_id"].nunique())
    a = abar + za * sigma # Avoid deterministics

    cutpoints = pm.Normal(
        "cutpoints", mu=0, sigma=1, shape=4,
        transform=pm.distributions.transforms.ordered,
        initval=np.linspace(-1, 1, 4)
    )
    
    eta = a[df_agg["book_id"].to_numpy()]
    cumulative_probs = pm.math.sigmoid(cutpoints - pt.shape_padright(eta))
    p = pt.concatenate(
        [
            pt.shape_padright(cumulative_probs[..., 0]),
            cumulative_probs[..., 1:] - cumulative_probs[..., :-1],
            pt.shape_padright(1 - cumulative_probs[..., -1]),
        ],
        axis=-1,
    )
    pm.Potential(
        "likelihood", 
        df_agg["count"].to_numpy() * pm.Categorical.logp((df_agg["rating"].to_numpy() - 1), p)
    )
    idata = sample_numpyro_nuts(draws=1000, tune=1000, chains=2, random_seed=1234)

idata.posterior["a"] = idata.posterior["abar"] + idata.posterior["za"] * idata.posterior["sigma"]
az.plot_trace(idata, var_names="a");

Edit

Last updated: Mon Oct 02 2023

Python implementation: CPython
Python version       : 3.10.10
IPython version      : 8.12.0

numpy   : 1.24.2
pymc    : 5.8.2
pandas  : 1.5.3
pytensor: 2.16.3
arviz   : 0.15.1

Watermark: 2.4.3

Edit 2

For the record, on my computer, using 1000 books, the original model takes ~10 minutes to sample, while the new implementation takes ~2:15 minutes.

Topic		Replies	Views
Hierarchical Model - Slow Sampling ( 1.8 sec per draw ) - Primary suspect is how the multi-level hierarchy is coded Questions	5	839	November 2, 2021
Hierarchical Model - Slow Sampling Questions	4	1170	March 26, 2020
Sampling a hierarchical ordered logistic regression Questions	3	1622	May 13, 2022
Hierarchical Binomial regression - ineffective sampling Questions	4	902	January 17, 2018
My first (hierarchical) PyMC model is very slow v5	5	812	October 2, 2023

Speeding up hierarchical ordered logistic regression

Related topics