Help with copula model with gamma marginals

theano

#1

Let’s say I’ve generated a bivariate gamma from a gaussian copula model, such as:

import numpy as np
import pymc3 as pm
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns


sigma=np.array([[1,-.8],[-.8,1]])
alphas=np.array([1,2])
betas=np.array([2,2])
N=1000

u=np.random.multivariate_normal((0,0),sigma,size=N)
c=sp.stats.norm.cdf(u)
generated=sp.stats.gamma.ppf(c,alphas,betas)

joint-plot for reference:

sns.jointplot(generated[:, 0],generated[:, 1], kind='kde', stat_func=None)

image

Now I want to fit a copula model with pymc3. In my understanding this would mean doing something like:

with pm.Model() as model:
    
    chol_packed = pm.LKJCorr('chol_packed',n=2, eta=2)
    chol = pm.expand_packed_triangular(2, chol_packed)
    sigma = pm.Deterministic('sigma', chol.dot(chol.T))
    us_dist = pm.MvNormal.dist(mu=np.zeros(2), chol=chol,shape=2)
    
    alphas=pm.HalfNormal('alphas',1,shape=2)
    betas=pm.HalfNormal('betas',1,shape=2)
    
    ps=gamma_cdf(generated,alphas,betas)
    
    logp=pm.Potential('logp',us_dist.logp(pm.math.probit(ps)).sum())

where gamma_cdf is a theano-functioning gamma cumulative density function implementation which unfortunately there’s not any (both in theano and pymc3). Moreover I have to mention that I need to fit this model in a large dataset with ADVI therefore i need also the grads.

So it all boils down to have the (lower) incomplete gamma function implemented as a theano ops in order to calculate the cdf:
image
This can be done (quite) easily wrapping up the scipy special funciton gammainc but the big issue is around getting the derivatives in. Some reference here; as you can see the partial derivative wrt s involves a Meijer G-function. This is where I got stuck…
Is there any way out?
The nearest thing already implemented in pymc3 is the beta incomplete function that is somehow related (but I haven’t found much reference about this).

Any help would be appreciated.
Thank you


#2

theano have imcomplete gamma function and log-gamma implemented, you can probably write the cdf of gamma distribution in theano directly.

FYI, you might also find some inspiration of inferencing copula in this WIP notebook: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/WIP/vector_transformation_copulas.ipynb


#3

Thank you for the answer, the notebook is very useful! I wish i found that before :smiley:

Could you get me any pointer to the theano implementation of the incomplete gamma function?
The only this that i found is hidden somewere here but to me it seems just c code without gradients and not exposed in python.

Thank you


#4

Do you think it is possible to workaround it using theano.tensor.gamma or theano.tensor.gammaln?


#5

I’m afraid it’s not (at least with my current knowledge of math).
Moreover I was looking at this pull request in which gamma CDF was not implemented due to missing a incomplete gamma function implementation.

There are two possible way to handle this:

  • Wrap mpmath package incomplete gamma/maijer G function implementation in a theano op, but I expect this to be very slow (functions don’t support vectorization/broadcasting natively)
  • Use lognormal distribution instead of gamma

I guess the second option is the most feasible one.


#6

So I did the lognormal model and kinda works. The results are disappointing though.

Data generation:

import numpy as np
import pymc3 as pm
import scipy.special as ss
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import theano.tensor as tt

corr=.8
_sigma=np.array([[1,corr],[corr,1]])
_mus=np.array([-1,2])
_sigmas=np.array([.5,.5])
N=10000

u=np.random.multivariate_normal((0,0),_sigma,size=N)
c=sp.stats.norm.cdf(u)
generated=sp.stats.lognorm.ppf(c,_sigmas,scale=np.exp(_mus))

image

Model:

def lognormal_cdf(values,mu,sigma):
    return pm.math.invprobit((tt.log(values)-mu)/sigma)

with pm.Model() as model:
    
    #hackish way to create cholesky mat. that yields a covariance mat. with 1s on the diagonal and corr anywhere else
    corr = pm.LKJCorr('corr',n=2, eta=3.5)[0]
    chol = tt.set_subtensor(tt.eye(2,2)[([1,1], [0,1])],[corr,tt.sqrt(1-tt.pow(corr,2))]) 

#     r =  pm.Uniform('r',lower=-1, upper=1)
#     cov = pm.Deterministic('cov', 
#                            tt.stacklists([[1., r],
#                                           [r, 1.]]))
    
    mus=pm.Normal('mus',0,1,shape=2)
    sigmas=pm.HalfNormal('sigmas',1,shape=2)
    
    ps=lognormal_cdf(generated,mus,sigmas)
    cs=pm.math.probit(ps)
    
    us_dist = pm.MvNormal('copula',mu=np.zeros(2), chol=chol,observed=cs)
    obs= pm.Lognormal('obs',mu=mus,sd=sigmas,observed=generated)
    
    trace=pm.sample(1000)

Trace:

As you can see correlation is higher than true value (~0.92 vs 0.8), also sigma is higher (~0.85 vs 0.5)

If i simulate the posterior bivariate lognormals I get:

sample=pm.sample_ppc(trace,model=model,vars=[us_dist,mus,sigmas],samples=10000)
sumulated_c=sp.stats.norm.cdf(sample['copula'])
sumulated=sp.stats.lognorm.ppf(sumulated_c,sample['sigmas'],scale=np.exp(sample['mus']))

image

I’ve tried to use LKJCorr with high eta instead of uniform for correlation but results were similar.
Any idea on how to improve results?

@junpenglao : Do you remember why (in your notebook) you choose to input the observed values twice in the models? (i.e. once the transformed values as an observed multivariate normal, once the untrasformed as the observed marginals).
It makes sense and helps a lot the model (I tried without doing this trick and the model is even worse) but i guess the likelihood of the model becomes a “mixture” between a proper copula one and a model that has independent marginals. Am i missing something?


#7

I think what I am doing is observed on the marginal, and also correct for the correlation using pm.Potential, you do need both of them to work, and the marginal does helps a lot of model convergence as you can see in the models with marginal only.

In your code you were not correcting for the Jacobian? you do need it for the estimation to be accurate.


#8

Thank you @junpenglao, with the jacobian adjustment the model fits correctly.

Unfortunately i’ve come across this bug, as I try to sample from the the posterior predictive distribution (which requires to sample from the multivariate (that works) and from the mus and sigmas (that dont)). I’ll post my error on the same issue.

Working model for reference:

import numpy as np
import pymc3 as pm
import scipy.special as ss
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import theano.tensor as tt
from theano import shared

corr=0.6
_sigma=np.array([[1,corr],[corr,1]])
_mus=np.array([-1,2])
_sigmas=np.array([.5,.5])
N=10000

u=np.random.multivariate_normal((0,0),_sigma,size=N)
c=sp.stats.norm.cdf(u)
generated=sp.stats.lognorm.ppf(c,_sigmas,scale=np.exp(_mus))

def lognormal_cdf(values,mu,sigma):
    return pm.math.invprobit((tt.log(values)-mu)/sigma)

def jacobian_det(f_inv_x, x):
    grad = tt.reshape(pm.theanof.gradient(tt.sum(f_inv_x), [x]), x.shape)
    return tt.log(tt.abs_(grad))

with pm.Model() as model:
    
    data=shared(generated)
    
    #hackish way to create cholesky mat. that yields a covariance mat. with 1s on the diagonal and corr anywhere else
    corr = pm.LKJCorr('corr',n=2, eta=2)[0]
    chol = tt.set_subtensor(tt.eye(2,2)[([1,1], [0,1])],[corr,tt.sqrt(1-tt.pow(corr,2))]) 

#     r =  pm.Uniform('r',lower=-1, upper=1)
#     cov = pm.Deterministic('cov', 
#                            tt.stacklists([[1., r],
#                                           [r, 1.]]))
    
    mus=pm.Normal('mus',0,1,shape=2)
    sigmas=pm.HalfCauchy('sigmas',beta=2,shape=2)
    
    ps=lognormal_cdf(data,mus,sigmas)
    cs=pm.math.probit(ps)
    
    us_dist = pm.MvNormal('copula',mu=np.zeros(2), cov=cov,observed=cs)
    pm.Potential('jacob_det', jacobian_det(cs, data))

    trace=pm.sample(1000)
    

When I then try to sample:

sample=pm.sample_posterior_predictive(trace,model=model,vars=[us_dist,mus,sigmas],samples=10000)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-14dbe60ba96f> in <module>()
----> 1 sample=pm.sample_posterior_predictive(trace,model=model,vars=[us_dist,mus,sigmas],samples=10000)
      2 
      3 simulated_c=sp.stats.norm.cdf(sample['copula'])
      4 simulated=np.exp(sp.stats.norm.ppf(simulated_c)*sample['sigmas']+sample['mus'])

~\AppData\Local\Continuum\Anaconda3\envs\newroot\lib\site-packages\pymc3\sampling.py in sample_posterior_predictive(trace, samples, model, vars, size, random_seed, progressbar)
   1118     # draw once to inspect the shape
   1119     var_values = list(zip(varnames,
-> 1120                           draw_values(vars, point=model.test_point, size=size)))
   1121     ppc_trace = defaultdict(list)
   1122     for varname, value in var_values:

~\AppData\Local\Continuum\Anaconda3\envs\newroot\lib\site-packages\pymc3\distributions\distribution.py in draw_values(params, point, size)
    310     while to_eval or missing_inputs:
    311         if to_eval == missing_inputs:
--> 312             raise ValueError('Cannot resolve inputs for {}'.format([str(params[j]) for j in to_eval]))
    313         to_eval = set(missing_inputs)
    314         missing_inputs = set()

ValueError: Cannot resolve inputs for ['copula']

Thank again @junpenglao for the help!