What's the difference between pm.fit(..., method = 'ADVI' ) and pm.ADVI()

Hi!

I am new to PyMC and there are a few questions concerning ADVI that I cannot figure out by myself.

I was using PSIS to diagnose the ADVI’s performance, where the method is proposed in this paper. My model is a GMM, and the implementation of PSIS is based on Junpeng’s notebook. You could find my notebook here. In addition, Junpeng only implemented for ADVI(meanfiled) case, and I expanded Junpeng’s implementation to the situations of ADVI.FullRank(). The core part of codes are listes as follows:

#PSIS
from pymc3.math import logsumexp
from scipy.special import logsumexp as sp_logsumexp
from pymc3.distributions.dist_math import rho2sd

def log_important_ratio(approx, nsample):
    logp_func = approx.model.logp

    # in ADVI there are only 1 group approximation
    approx_group = approx.groups[0]
    if approx.short_name == "mean_field":
        mu_q = approx_group.params[0].eval()
        std_q = rho2sd(approx_group.params[1]).eval()
        logq_func = st.norm(mu_q, std_q)
    elif approx.short_name == "full_rank":
        packed_chol_q = approx_group.params[0]
        mu_q = approx_group.params[1].eval()
        dim = mu_q.shape[0]
        chol_q = pm.expand_packed_triangular(dim, packed_chol_q, lower=True).eval()
        cov_q = np.dot(chol_q, chol_q.T)
        logq_func = st.multivariate_normal(mu_q, cov_q)
        
    dict_to_array = approx_group.bij.map

    p_theta_y = []
    q_theta = []
    samples = approx.sample_dict_fn(nsample)  # type: dict
    points = ({name: records[i] for name, records in samples.items()}
              for i in range(nsample))

    for point in points:
        p_theta_y.append(logp_func(point))
        q_theta.append(np.sum(logq_func.logpdf(dict_to_array(point))))
    p_theta_y = np.asarray(p_theta_y)
    q_theta = np.asarray(q_theta)
    return p_theta_y-q_theta

However, I found that using pm.fit(..., method = 'ADVI') and pm.ADVI() gives me different k value. It is the same case with pm.fit(.., method='fullrank_advi') andpm.FullRank ()`. the k values are as followed:

nsample = 10000
ratio_advi_mf = log_important_ratio(advi_mf.approx, nsample) # via pm.ADVI()
_, k = pm.stats._psislw(ratio_advi_mf[:, None], 1)
k # array([0.17293144])

ratio_advi_mf2 = log_important_ratio(advi_mf2, nsample) # via pm.fit(..., method='advi')
_, k2 = pm.stats._psislw(ratio_advi_mf2[:, None], 1)
k2 # array([0.42639058])

ratio_advi_fr = log_important_ratio(advi_fr.approx, nsample) # via pm.FullRankADVI()
_, k_fr = pm.stats._psislw(ratio_advi_fr[:, None], 1)
k_fr # array([0.42639058]) # array([0.9365361])

ratio_advi_fr2 = log_important_ratio(advi_fr2, nsample) # via pm.fit(..., method='fullrank_advi')
_, k_fr2 = pm.stats._psislw(ratio_advi_fr2[:, None], 1)
k_fr2 # array([1.37102098])

Furthermore, using pm.fit, we could set obj_optimizer, but there is no such option in ADVI. And this made me doubt whether pm.fit and pm.ADVI() gives the same result, and which one is based on the original ADVI algorithm.

Lastly, could you please check if my implemetation for FullRankADVI case is correct, since the algo is giving me k value that is larger than 1.

Many thanks!

I think it is more likely the stochastic in model fitting. PSIS is very sensitive when I was playing around with it. Do you see any difference in the estimation?

The estimation gives me the similar results on the toy data sets. But in general, pm.fit() seems to give a larger k value than pm.ADVI() / pm.FullRankADVI(), after running it for several times.

BTW, sometimes pm.stats._psislw() is giving me negative k value, but isn’t the shape parameter of pareto distribution always positive?

Hmmm, my reasoning would be that different optimizer is used, which returns different estimation from approx model.