Transitioning from pm.sample() to pm.fit()

Hi,

I am trying to switch my model from pm.sample() to pm.fit() for performance reasons. I have gotten my model to sample well with pm.sample() on a subset of my data–meaning there are 0 divergences, the r_hats are all under 1.01, and the posteriors are reasonable–but my full dataset is much too large for pm.sample() to be a viable option long term, unless there’s some performance boost I’m missing. I am using nutpie and with ~12k records and pm.sample() takes about 6.5min, but my full dataset is 100x that.
So, when I try pm.fit(method=“advi”), I get the following error immediately for several of my parameters:
The current approximation of RV parameter_x.ravel()[0] is NaN.
I’m not understanding how this could be happening when I’m able to use pm.sample() without issue. I have tried using the start parameter, but that doesn’t seem to be doing anything, nor does setting the initvals in the model. Is there specific reparametrization required when switching from the sample method to the fit method?

Thanks

Do you get the error with the full dataset or the same you tried sample with?

I can report that I’ve also experienced this :slight_smile:

Meaning same dataset is less stable than mcmc or with larger dataset?

May need smaller learning rates?

Same dataset.

I assumed it’s a feature of NaN handling in our VI code vs MCMC. MCMC will just give the sample -inf probability if it’s nan, whereas VI fails outright (as reported).

Well MCMC can just stay where it was and try a new trajectory, not sure VI can do anything? It’s a deterministic optimization no?

It could backtrack and adjust learning rate, or try a new batch (in minibatch setting). Optax for example has apply_if_finite, which would for us to have in combination with other stochastic optimization niceties like gradient clipping and learning rate scheduling

@ricardoV94 It occurs on the same dataset that I use pm.sample(). I haven’t tried it out on any larger set.
@jessegrabowski Do you know of any workaround / ways to troubleshoot? I’m not really understanding the NaN issue when my dataset doesn’t have any NaNs..

The nan happens in the optimization, because it gets too extreme gradients probably

Is there a typical workflow for trying to deal with that?

Try a different optimizer, and/or adjust the learning rate. For optimizers with momentum parameters, you can play around with those as well.

As I mentioned, we don’t have fancier tools like gradient clipping/learning rate scheduling/retry on nan (yet!), so help wanted.

1 Like

Will follow up if I find something that works :+1:

In my (admittedly limited) experience, I’ve found RMSProp to be a good one to try first.

I think I may be out of luck for the time being–I’ve tried to run through the the different optimizers and their parameters, but haven’t gotten anything to run. I wonder if you’ve ever tried anything like splitting the dataset into batches and running multiple processes in parallel with pm.sample()? Of course, this assumes my model would sample successfully on datasets that could have slight underlying differences.

Can you post you model, or something similar to it? I’m curious to see if there’s any exotic features that would cause ADVI to just totally fail

Sure, for some context, each record in my dataset is a customer transaction. Each customer will have 1+ records, and each record has a value and a transaction type. I have implemented (or, have tried to implement)

  • a random effect per customer (epsilon)
  • a transaction type effect (tau)

I used beta functions for these to ensure positivity of the gamma likelihood’s parameters. I multiplied those each by 2 to get distributions that somewhat resemble normal distributions centered at 1. Those get multiplied by the alpha parameter. This felt a but hacky but I got it to work–I am relatively new to this stuff and certainly open to suggestions.

The model code is here:

customer_idx, customers = factorize(train, "customer_id")
transaction_idx, transactions = factorize(train, "transaction_id")
transaction_type_idx, transaction_types = factorize(train, "transaction_type")

values = train.select(pl.col("values")).to_numpy().squeeze()

coords = {
    "customers": customers,
    "transactions": transactions,
    "transaction_types": transaction_types
}

with pm.Model(coords=coords) as model:

    # population level params
    pop_mu_1 = pm.Normal("pop_mu_1", mu=8.25, sigma=0.25)
    pop_mu_2 = pm.Normal("pop_mu_2", mu=2.5, sigma=0.25)
    
    # customer level beta
    customer_beta = pm.Normal(
        'customer_beta',
        mu=pop_mu_2,
        sigma=0.2,
        dims="customers"
    )
    
    # Random effect
    alpha_e = pm.Normal("alpha_e", mu=2.5, sigma=0.2)
    beta_e = pm.Normal("beta_e", mu=4.7, sigma=0.3)
    epsilon = pm.Beta("epsilon", alpha=alpha_e, beta=beta_e, dims="customers")
    
    # transaction type level
    # non-centering alpha_t
    m_alpha_t = pm.Normal('m_alpha_t', mu=4, sigma=0.4)
    s_alpha_t = pm.Normal('s_alpha_t', mu=4, sigma=0.4)
    z = pm.Beta('z', alpha=3.5, beta=3.5, dims='transaction_types')
    alpha_t = pm.Deterministic('alpha_t', m_alpha_t + s_alpha_t * z, dims="transaction_types")
    
    beta_t = pm.Normal('beta_t', mu=5.6, sigma=0.25)
    
    tau = pm.Beta(
        'tau',
        alpha=alpha_t,
        beta=beta_t,
        dims=("customers", "transaction_types")
    )  
    
    transaction_alpha = pm.Deterministic(
        'transaction_alpha',
        pop_mu_1 * (2*epsilon[customer_idx]) * (2*tau[customer_idx, transaction_type_idx]),
        dims="transactions"
    )
    
    # Likelihood
    y_gamma = pm.Gamma(
        'y_gamma',
        alpha=transaction_alpha[transaction_idx],
        beta=customer_beta[customer_idx],
        observed=values,
        dims="transactions"
    )
1 Like