Variational method in pymc

with pm.Model() as logistic_model:
        
    beta_0=pm.Normal('beta_0', 0, 4)
    beta_1=pm.Normal('beta_1', 0, 4)
    beta_2=pm.Normal('beta_2', 0, 4)   
    
    feature_1 = pm.Data("feature_1", value = X_train['feature_1'], mutable = True)
    feature_2 = pm.Data("feature_2", value = X_train['feature_2'], mutable = True)
    label = pm.Data("label", value = y_train, mutable = True)
    
    observed=pm.Bernoulli("binary_label", pm.math.sigmoid(beta_0 + beta_1 * feature_1 + beta_2 * feature_2), observed = label)



with logistic_model:
    mean_field = pm.fit(n=100000, method='advi')
    trace = mean_field.sample(2000)
    az.plot_trace(trace)



Here I am making a pymc regression model, and I am using varational methods to sample the coefficients. I want to ask whether variational methods could get multiple sample chains, like NUTS do. In this plot I got 1 chain for each parameter.

Your question is a reasonable one, but it turns out the answer to this:

I want to ask whether variational methods could get multiple sample chains, like NUTS do.

is a little more complicated.

In short, there are no “chains” for the mean-field variational inference (MFVI) method that PyMC uses. The reason that they appear when using the No-U-Turn sampler is that NUTS is a member of the class of Markov chain Monte Carlo methods in which the Markov chains are indeed sequences of autocorrelated values. MFVI does not use autocorrelated values, hence no chains.

The variational inference method works differently - it works by first assuming a (potentially multivariate) gaussian approximation to the posterior, followed by many steps of optimizing the parameters of the approximation. Once that is done, we generate samples by drawing from that multivariate Gaussian. It turns out that we can draw those samples independently without using a sequence of intermediate sample values like with MCMC. Thus, to get 1000 independent samples under the MFVI approximation, we simply draw 1000 multivariate gaussian samples. To do the same with NUTS, we would typically need to sample well over 1000 actual draws in order to get the equivalent of 1000 independent samples. There would be no point in getting multiple “chains” for MFVI because all the MFVI samples are independent to begin with, hence we can avoid the notion of sequential correlation automatically.

What’s more, the MFVI will converge to the same local optimum regardless of starting conditions. Here’s an animation I made that shows how the VI approximation (the green blob) fits the true distribution (black contours and points) even from multiple starting points.

There are blends of variational and MCMC methods for inference explored in research, but that’s a very deep rabbit hole and they aren’t actively used much in PyMC though one notable exception is using MFVI as an initialization routine for NUTS’ starting point.

2 Likes