How to add an L1 Regularization on the likelihood when use pymc3 to sample a MCMC

When I use pymc3 to do a sparse regression, I would like to add an L1 regularization to the objective function. How to achieve this in pymc3?

Thanks a lot!

Depending on how you want to do this.

A Bayesian way: If you place a Laplace prior on the parameters (regression coefficients), you get essentially a L1 regularized regression.

A non-Bayesian way: add the L1 regularization using pm.Potential

X = ... # Your design matrix
lam = ... # constant controlling the sparsity 
with pm.Model() as m:
   betas = pm.Flat('betas', shape=X.shape[1])  < == DONT do this usually, it is just to replicate the L1 regularization completely
   pm.Potential('sum_of_square_error', pm.math.sum(pm.math.square(y - pm.math.dot(X, betas))))
   pm.Potential('l1', lam*pm.math.sum(pm.math.abs(betas)))

If you optimize the model m using pm.find_MAP, the solution is the sparse regression solution.

5 Likes

Thank you so much!:smile::smile::

I am also very interested in l1 regularization, but I still donā€™t get it. So when you are saying ā€œIf you place a Laplace prior on the parameters (regression coefficients)ā€, then you essentially mean something like this?

# l1 regression
# where x is shape (10, m) and y is shape (10, 1). 
with pm.Model() as l1_model:
    Ī» = 1e-10 # is this our penalization control parameter?
    l1 = pm.Laplace("l1", 0, Ī»)
    
    Ī² = pm.Normal("beta", l1, 1, shape=(x.shape[1], 1))
    Ļµ = pm.Normal("epsilon", 0, 1)
    Ļƒ = pm.HalfCauchy("sigma", 1)
    
    y_hat = tt.dot(x, Ī²) + Ļµ
    ar = pm.Normal("ar", y_hat, sd=Ļƒ, observed=y)

I am not so sure if this works as expected since my betas look very similar to the ā€œnormalā€ model:

with pm.Model() as model:
    Ī² = pm.Normal("beta", 0, 100, shape=(x.shape[1], 1))
    Ļµ = pm.Normal("epsilon", 0, 1)
    Ļƒ = pm.HalfCauchy("sigma", 1)
    
    y_hat = tt.dot(x, Ī²) + Ļµ
    ar = pm.Normal("ar", y_hat, sd=Ļƒ, observed=y)

What @junpenglao meant was to put the Laplace prior over beta. Consider the following model:

with pm.Model() as l1_model:
    lam = 1e-10 # is this our penalization control parameter?
    
    Ī² = pm.Laplace("beta", 0, 1/lam, shape=(x.shape[1],))
    epsilon = pm.Laplace("epsilon", 0, 1/lam)
    
    y_hat = tt.dot(x, beta) + epsilon
    ar = pm.Normal("ar", y_hat, sd=1, observed=y)

The modelā€™s logp can be written as

\sum[-\frac{1}{2}(y-\hat{y})^{2} - \lambda (|\beta| + |\epsilon|)]

This is just like minimizing the mean squared difference and adding an L1 regularization over the regression parameters. Thatā€™s why @junpenglao said that if you used find_map, you would get the maximum likelihood parameters with the L1 regularization.

1 Like

As a nice sidenote, if you had placed gaussian priors, you would have gotten something similar to L2 regularization. As with the L1 case, you just have to look at the resulting logp that would be used by find_map.

Like these two, there are many more analogies between Bayesian priors and different forms of regularizations. @brandonwillard is much more knowledgeable than me on the subject.