Additive Gaussian Process Priors

nickresnick · July 26, 2018, 5:04pm

I’m trying to apply a similar model to the one defined in section 2.2 of this paper. Here is the stan code for the model, if that helps.

My data has the following columns: user_id, t, l, y, total_revenue. Each row is a daily discretized observation window. t is calendar time of that day, encoded as days-since-some-start-date. l is the cumulative day count since the first observation for that user (i.e. lifetime). y is whether the user made a purchase on that day. and total_revenue is the cumulative revenue observed for that user up through that day.

The data has 10,000 users and 57,89,283 rows.

Here’s my model:

with pm.Model():
    # GP kernel hyperpriors for calendar time effects
    etasq_short = pm.Normal("etasq_short", 0, 5)
    etasq_long = pm.Normal("etasq_long", 0, 5)
    rhosq_cal = pm.Normal("rhosq_cal", T/2, T)
    
    # hyperprior for user lifetime effect
    etasq_life = pm.Normal("etasq_life", 0, 5)
    rhosq_life = pm.Normal("rhosq_life", T/2, T)
    
    # hyperprior for total cumulative revenue effect
    etasq_rev = pm.Normal("etasq_rev", 0, 5)
    rhosq_rev = pm.Normal("rhosq_rev", 20, 10)
    
    # mean function hyperprior for long-term calendar effect
    mu = pm.Normal("mu", 0, 5)
    
    # linear mean function hyperpriors for lifetime and total revenue effects 
    lambda_life1 = pm.Normal("lambda_life1", 0, 5)
    lambda_rev1 = pm.Normal("lambda_rev1", 0, 5)
    
    # Individual user intercepts
    sigma_sq = pm.Normal("sigma_sq", 0, 2.5)
    delta = pm.Normal("delta", 0, sd=sigma_sq, shape=N)
    
    # Unidemensional GPPs
    alpha_short = pm.gp.Latent(cov_func=etasq_short**2 * gp.cov.ExpQuad(1, ls=rhosq_cal))
    alpha_long = pm.gp.Latent(pm.gp.mean.Constant(mu), cov_func=etasq_long**2 * pm.gp.cov.ExpQuad(1, ls=rhosq_cal))
    alpha_life = pm.gp.Latent(pm.gp.mean.Linear(coeffs = lambda_life1), cov_func=etasq_life**2 * pm.gp.cov.ExpQuad(1, ls=rhosq_life))
    alpha_rev = pm.gp.Latent(pm.gp.mean.Linear(coeffs = lambda_rev1), cov_func=etasq_rev**2 * pm.gp.cov.ExpQuad(1, ls=rhosq_rev))
    
    # Additive GPP
    alpha = alpha_short + alpha_long + alpha_life + alpha_rev
    
    # Latent function
    f = alpha.prior("f", X=gpp_data[["t", "t", "l", "total_revenue"]].as_matrix())
    
    # Latent function + individual intercept
    y = f + delta[user_id]
    
    # Logistic likelihood
    likelihood = pm.Bernoulli("likelihood", pm.invlogit(y), observed = gpp_data["y"].values)
    
    trace = pm.sample(2000)

This produces an error:

ValueError: Input dimension mis-match. (input[0].shape[1] = 5789283, input[1].shape[1] = 4)

when trying to add the mean functions together. When I remove all the mean functions (i.e. assume they’re all zero), I get a MemoryError.

So 1. Curious what I’m doing wrong re: mean functions, and 2. Curious if I’m supplying incorrect, or unnecessarily large, data to the model, because 57mm rows for 10k users seems a little ridiculous. Answering the second question requires reading up on the model, so I understand if that’s off-topic for this discourse.

bwengals · July 26, 2018, 5:45pm

n is not really 6 million right? You’ll need to repeatedly construct and invert an n \times n matrix, which would be rough… How big is the data set used in the paper? I think they must have subsampled it somehow. If you can use a data set where n is in the thousands, you should be alright. There is a pull request here (https://github.com/pymc-devs/pymc3/pull/2951) which is an approximation of Latent for larger data sets, but it won’t get you to n=6 million. If n is 6 million, your best bet is to check GPFlow, specifically their SVGP class, which uses a variational approximation, inducing points, and can optimize the hyperparameters using stochastic gradient descent. Depending on the complexity of the dataset, the model may suffer from how a loss of flexibility due to the approximation.

When using the additive functionality, you need to make sure that X is the same for all the mean functions and covariance functions. This is the purpose of the input_dim and active_dims arguments. Say your model is a sum of two GPs, each with different input variables, say x1 and x2. You’ll need to form x1 and x2 into a single n \times 2 matrix X, and do

cov1 = pm.gp.cov.ExpQuad(2, ..., active_dims=[0])  # input_dim=2, active_dims=[0], the first column of X
cov2 = pm.gp.cov.ExpQuad(2, ..., active_dims=[1])  # input_dim=2, active_dims=[1], the second column of X

This is likely the source of the input dimension mismatch error.

Also, your eta and rho parameters need to be non-negative for these covariance functions to be valid. I’d recommend Gamma or HalfNormal distributions. A lot of other GP implementations take as input log(\eta) or \log(\rho), so normal priors are OK. PyMC3 covariance functions take in \eta and \rho directly.

Some other stuff I’m noticing:

You probably wan’t to do with pm.Model() as model: so you can do predictions later.
Do you mean to have t repeated here? X=gpp_data[["t", "t", "l", "total_revenue"]]

nickresnick · July 27, 2018, 7:08pm

Hi Bill, thank you for your help!

Are there input_dum and active_dims arguments for mean functions as well? From the source code it doesn’t look like it. The two Linear mean functions are the problem-- it looks like they expect coeffs to be of length 4, since X has 4 columns, but of course I only want it to act on one column of the matrix.

Regarding your questions about N: If you look at the stan code (lines 115 to 127), it looks like the covariance matrices themselves only need to be UxU where U is the unique number of inputs for the given GPP. For example, the covariance matrix for the GPPs that depend on calendar time t would be TxT, where T is the number of periods observed in the data. In my case, that’s around 1950. It’s the bernoulli likelihood that needs to be computed for N data points.

So I’m thinking the fix is to define separate priors for each GP and specify only the data that that GPP is concerned with, and then add the priors together? i.e.

f_short = alpha_short.prior("f_short", X=gpp_data[["t"]].drop_duplicates().as_matrix())
f_long = alpha_long.prior("f_long", X=gpp_data[["t"]].drop_duplicates().as_matrix())
f_life = alpha_life.prior("f_life", X=gpp_data[["l"]].drop_duplicates().as_matrix())
f_rev = alpha_life.prior("f_rev", X=gpp_data[["total_revenue"]].drop_duplicates().as_matrix())
    
# likelihood theta: latent functions + individual intercept
theta = f_short[t] + f_long[t] + f_life[l] + f_rev[total_rev] + delta[user_id]

Where t, l, and total_rev are the actual values from the data. In stan, the likelihood theta is:

// gppm likelihood
for(m in 1:M)
  theta[m] <- alpha_long[t[m]]+alpha_short[t[m]]+alpha_life[l[m]]+alpha_rev[total_rev[m]]+delta[id[m]];

where their M is our N.

Getting an error now: IndexError: index 575 is out of bounds for size 455 implying I’m not doing the right thing indexing into the latent functions, but haven’t quite figured that out yet.

Regarding your other comments:

Good call on eta and rho, thanks!
Yes, should use as model, was just trying to get the sampling working
I do actually want two "t"s. I think this is more apparent in the code snippet in this post. Both the short term and long term GPPs act on calendar time, i.e. t.

nickresnick · July 27, 2018, 8:37pm

Yeah, this is a bit of a sidebar, but I guess I don’t totally understand how indexing works in pymc3.

For example, if you have string user ids and want per-user intercepts, most online resources say to do intercept[user_id] where I thought user_id is the list of user ids in the data, but it looks like instead user_id should be a set of integer indices? It would be easy enough to create monotonically increasing ids, but this is harder for something like total_revenue where the distance between two values matters.

bwengals · July 30, 2018, 3:55am

No… maybe there should be though. To get around it for Linear, you can make a single X with all variables in the columns. Then, zero out the coefficients for predictors that aren’t part of different mean functions. Another way is to define two custom mean functions that take another argument (which is basically equivalent to active_dims) to slice the inputs during __call__.

Ahhh ok I think I see now. Sorry I totally missed that at first. How the GP lib is set up now, when you add two gp objects, internally each of their covariance matrices is added before the Cholesky decomp, which breaks when they aren’t the same size, as in your case.

The way you’re doing your fix is the right idea, and is equivalent to what is done in the stan code. It’s hard to see where your indexing is failing without being able to inspect t, l, total_rev etc. The indices should be integers. PyMC3 indexing uses Theano entirely, so the Theano docs will have the info you need for the indexing rules. Stan indexes using integers too though? If their total_rev weren’t an integer I think the stan code should fail too right?

If you’re still stuck, feel free to post some fake data / notebook example

Topic		Replies	Views
Hierarchical Gaussian Process Questions	1	639	February 19, 2021
Understanding intuition behind the gaussian process regression in pymc blogpost "moddeling changes marketing effectiveness over time" version agnostic	3	590	October 13, 2022
Vectorized autoregressive model Questions	5	2309	January 2, 2020
General approach to convergence issues version agnostic gaussian_process , modeling	2	559	June 16, 2023
Avoiding looping when using GP prior on latent variables version agnostic gaussian_process	8	594	March 30, 2022

Additive Gaussian Process Priors

Related topics