Transitioning from pm.sample() to pm.fit()

eklim0407 · May 19, 2025, 10:01pm

Hi,

I am trying to switch my model from pm.sample() to pm.fit() for performance reasons. I have gotten my model to sample well with pm.sample() on a subset of my data–meaning there are 0 divergences, the r_hats are all under 1.01, and the posteriors are reasonable–but my full dataset is much too large for pm.sample() to be a viable option long term, unless there’s some performance boost I’m missing. I am using nutpie and with ~12k records and pm.sample() takes about 6.5min, but my full dataset is 100x that.
So, when I try pm.fit(method=“advi”), I get the following error immediately for several of my parameters:
The current approximation of RV parameter_x.ravel()[0] is NaN.
I’m not understanding how this could be happening when I’m able to use pm.sample() without issue. I have tried using the start parameter, but that doesn’t seem to be doing anything, nor does setting the initvals in the model. Is there specific reparametrization required when switching from the sample method to the fit method?

Thanks

ricardoV94 · May 20, 2025, 5:44am

Do you get the error with the full dataset or the same you tried sample with?

jessegrabowski · May 20, 2025, 6:58am

I can report that I’ve also experienced this

ricardoV94 · May 20, 2025, 7:19am

Meaning same dataset is less stable than mcmc or with larger dataset?

May need smaller learning rates?

jessegrabowski · May 20, 2025, 7:21am

Same dataset.

I assumed it’s a feature of NaN handling in our VI code vs MCMC. MCMC will just give the sample -inf probability if it’s nan, whereas VI fails outright (as reported).

ricardoV94 · May 20, 2025, 8:39am

Well MCMC can just stay where it was and try a new trajectory, not sure VI can do anything? It’s a deterministic optimization no?

jessegrabowski · May 20, 2025, 9:32am

It could backtrack and adjust learning rate, or try a new batch (in minibatch setting). Optax for example has apply_if_finite, which would for us to have in combination with other stochastic optimization niceties like gradient clipping and learning rate scheduling

eklim0407 · May 20, 2025, 1:21pm

@ricardoV94 It occurs on the same dataset that I use pm.sample(). I haven’t tried it out on any larger set.
@jessegrabowski Do you know of any workaround / ways to troubleshoot? I’m not really understanding the NaN issue when my dataset doesn’t have any NaNs..

ricardoV94 · May 20, 2025, 1:27pm

The nan happens in the optimization, because it gets too extreme gradients probably

eklim0407 · May 20, 2025, 1:43pm

Is there a typical workflow for trying to deal with that?

jessegrabowski · May 20, 2025, 1:58pm

Try a different optimizer, and/or adjust the learning rate. For optimizers with momentum parameters, you can play around with those as well.

As I mentioned, we don’t have fancier tools like gradient clipping/learning rate scheduling/retry on nan (yet!), so help wanted.

eklim0407 · May 20, 2025, 2:00pm

Will follow up if I find something that works

jessegrabowski · May 20, 2025, 2:01pm

In my (admittedly limited) experience, I’ve found RMSProp to be a good one to try first.

eklim0407 · May 28, 2025, 1:17pm

I think I may be out of luck for the time being–I’ve tried to run through the the different optimizers and their parameters, but haven’t gotten anything to run. I wonder if you’ve ever tried anything like splitting the dataset into batches and running multiple processes in parallel with pm.sample()? Of course, this assumes my model would sample successfully on datasets that could have slight underlying differences.

jessegrabowski · May 28, 2025, 1:28pm

Can you post you model, or something similar to it? I’m curious to see if there’s any exotic features that would cause ADVI to just totally fail

eklim0407 · May 28, 2025, 3:26pm

Sure, for some context, each record in my dataset is a customer transaction. Each customer will have 1+ records, and each record has a value and a transaction type. I have implemented (or, have tried to implement)

a random effect per customer (epsilon)
a transaction type effect (tau)

I used beta functions for these to ensure positivity of the gamma likelihood’s parameters. I multiplied those each by 2 to get distributions that somewhat resemble normal distributions centered at 1. Those get multiplied by the alpha parameter. This felt a but hacky but I got it to work–I am relatively new to this stuff and certainly open to suggestions.

The model code is here:

customer_idx, customers = factorize(train, "customer_id")
transaction_idx, transactions = factorize(train, "transaction_id")
transaction_type_idx, transaction_types = factorize(train, "transaction_type")

values = train.select(pl.col("values")).to_numpy().squeeze()

coords = {
    "customers": customers,
    "transactions": transactions,
    "transaction_types": transaction_types
}

with pm.Model(coords=coords) as model:

    # population level params
    pop_mu_1 = pm.Normal("pop_mu_1", mu=8.25, sigma=0.25)
    pop_mu_2 = pm.Normal("pop_mu_2", mu=2.5, sigma=0.25)
    
    # customer level beta
    customer_beta = pm.Normal(
        'customer_beta',
        mu=pop_mu_2,
        sigma=0.2,
        dims="customers"
    )
    
    # Random effect
    alpha_e = pm.Normal("alpha_e", mu=2.5, sigma=0.2)
    beta_e = pm.Normal("beta_e", mu=4.7, sigma=0.3)
    epsilon = pm.Beta("epsilon", alpha=alpha_e, beta=beta_e, dims="customers")
    
    # transaction type level
    # non-centering alpha_t
    m_alpha_t = pm.Normal('m_alpha_t', mu=4, sigma=0.4)
    s_alpha_t = pm.Normal('s_alpha_t', mu=4, sigma=0.4)
    z = pm.Beta('z', alpha=3.5, beta=3.5, dims='transaction_types')
    alpha_t = pm.Deterministic('alpha_t', m_alpha_t + s_alpha_t * z, dims="transaction_types")
    
    beta_t = pm.Normal('beta_t', mu=5.6, sigma=0.25)
    
    tau = pm.Beta(
        'tau',
        alpha=alpha_t,
        beta=beta_t,
        dims=("customers", "transaction_types")
    )  
    
    transaction_alpha = pm.Deterministic(
        'transaction_alpha',
        pop_mu_1 * (2*epsilon[customer_idx]) * (2*tau[customer_idx, transaction_type_idx]),
        dims="transactions"
    )
    
    # Likelihood
    y_gamma = pm.Gamma(
        'y_gamma',
        alpha=transaction_alpha[transaction_idx],
        beta=customer_beta[customer_idx],
        observed=values,
        dims="transactions"
    )

Topic		Replies	Views
Pm.sample Parameters and Optimization v5 sampling	3	196	June 25, 2024
SamplingError("Bad initial energy") Questions	2	495	May 26, 2020
NaN occurred in optimization at first Iteration with ADVI Questions	3	744	February 18, 2020
Long Sampling time \| sample from batches? Questions	2	356	January 17, 2023
# of ADVI samples before HMC; Sampling with correlated latent vars Questions	2	609	November 17, 2017

Transitioning from pm.sample() to pm.fit()

Related topics