Cox Process in PyMC3

I have some D-dimensional count data that I’m trying to model using Cox Process in PyMC3. My observations of counts are Poisson distributed with the rate equal to an exponentiated GP. For D=2, my data looks like:

x_data = np.array([[135, 145], [300, 286],...,[311, 312]])
x_data.shape is (281, 2)

I’m defining the model in PyMC3 as follows:

N, D = x_data.shape
with pm.Model() as model:
    cov_func = pm.gp.cov.ExpQuad(D, ls=0.1*np.ones(D))
    gp = pm.gp.Latent(cov_func=cov_func)
    f = gp.prior('f', X=x_data)

    obs = pm.Poisson('obs', mu=np.exp(f), observed=x_data)

Notice that x_data is D=2 dimensional, however when I check the output dimension of f = gp.prior, I get:

f.tag.test_value.shape is (281,)

As a result, I’m getting an error which occurs at the following line

---> obs = pm.Poisson('obs', mu=np.exp(f), observed=x_data)
ValueError: Input dimension mis-match. (input[0].shape[1]=2, input[1].shape[1]=281)

I would expect that the GP output f to have dimensions (281, 2) the same as x_data. Does anyone have an idea how to fix the dimension mismatch?

I might be misunderstanding something, let me know if I’m off somewhere. You compute the covariance function over x_data and also use x_data as your observed counts. This doesn’t make sense. What the rate mu is a function of, should be set to X in gp.prior, and the observed counts at each of the X locations should be givent to observed= in pm.Poisson.

I was trying to convert the Cox process example from Edward to PyMC3. The original example can be found at: https://github.com/blei-lab/edward/blob/master/examples/cox_process.py

In the inference step, x_data is fed into both the input to the covariance matrix and as observed Poisson counts:

inference = ed.KLqp({f: qf}, data={x: x_data, x_ph: x_data})

where x_ph is an input to RBF and x are observations. I tried the example in Edward and the posterior predictive check generates data from the posterior qf very close to the original x_data. However, I’m not sure if giving x_data as an input to gp.prior is the right thing to do.

I can’t say I understand their example… I see they have a paper that is somewhat related, so I’d like to look into it further there. The values in x_data look like counts to me, but I don’t see where the location information comes in. The GP code in PyMC3 would expect something like:

X: 2D coordinates of the shot, eg. a stack of (x,y) pairs.
y: #of shots made at each coordinate.

So if X had dimension (100,2), y would have dimension (100, ). You could additionally index on player if you had multiple players’ shots.

We also have a cox process model in the doc: https://github.com/pymc-devs/pymc3/blob/master/docs/source/notebooks/cox_model.ipynb
(warning, this doc is poorly documented…)

Thank you for your comments. I realized two things: (1) GPs approximate scalar-valued functions so the output of gp.prior is correct and (2) my dataset is not well suited for use with the Cox process.

I’m using the basketball shot data from Kaggle: https://www.kaggle.com/dansbecker/nba-shot-logs/
which does not have coordinates of where the shots were taken. I’m going to find a new toy dataset to play with and update this thread. Regarding vector-valued GP, I posted a question on stack-overflow.

GPs can certainly be used to model vector outputs! For a lot of detail, check out this review.

The simplest way to model a vector output with a GP is to include as an additional predictor that is an index for the output.

Here’s some code to generate one draw from a GP whose output is two vectors. Each one is a function of the first column of X, and are correlated with each other.

np.random.seed(20)

t = np.linspace(0, 10, 100)[:,None]
output_1 = np.ones((50, 1))
output_2 = 2*np.ones((50, 1))
X = np.hstack((t, np.vstack((output_1, output_2))))

l1 = 2.0; l2 = 1.0
# l1 is lengthscale of variation over t, l2 gives similarity between output1 and output2
cov = pm.gp.cov.ExpQuad(2, [l1, l2])

# draw 1 sample from GP prior with mean zero
y = pm.MvNormal.dist(mu=np.zeros(100), cov=K).random(size=1)

plt.plot(np.linspace(0,10,50), y[:50], 'r', label="output 1");
plt.plot(np.linspace(0,10,50), y[50:], 'b', label="output 2");
plt.legend();![ex|640x480](upload://wlbedOZHQgchYS2Tkj6414XNKGn.png)
1 Like

I got the Cox Process to work on a new dataset of California redwoods available here (which was extracted from R spatstat package). The dataset contains the locations of redwood trees in a square sampling region. I constructed the dataset by discretizing the region and creating two training sets: y_data that contains the tree counts and x_data that contains (x, y) coordinates of the 2D-bin centers. The experimental results and a brief description of the Cox process are summarized in the following ipython notebook. The complete code is listed below:

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import pymc3 as pm

import theano
import theano.tensor as tt

#load data
print “loading data…”
train_df = pd.read_csv(‘./data/redwoods.csv’)

plt.figure()
plt.scatter(train_df[‘redwoodfull.x’], train_df[‘redwoodfull.y’])
plt.title(‘California Redwoods’)
plt.xlabel(‘X1’)
plt.ylabel(‘X2’)
plt.savefig(‘./figures/redwoods_dataset.png’)

#discretize spatial data
D = 2 #dimension
num_bins = 64
hist, xedges, yedges = np.histogram2d(train_df[‘redwoodfull.x’].values, train_df[‘redwoodfull.y’].values, bins=num_bins)
xcenters = xedges[:-1] + 0.5 * (xedges[1:] - xedges[:-1])
ycenters = yedges[:-1] + 0.5 * (yedges[1:] - yedges[:-1])

plt.figure()
plt.imshow(hist, cmap=‘hot’)
plt.savefig(‘./figures/redwoods_hist.png’)

#form input arrays
y_data = np.zeros(num_bins * num_bins) #observations (counts)
x_data = np.zeros((num_bins * num_bins, 2)) #locations

cnt = 0
for i in range(num_bins):
for j in range(num_bins):
y_data[cnt] = hist[i][j]
x_data[cnt,:] = np.hstack((xcenters[i], ycenters[j]))
cnt += 1
#end for
#end for

print “creating the model…”
with pm.Model() as model:

#hyper-parameter priors
#l = pm.Gamma('l', alpha=2, beta=2)   
#eta = pm.HalfCauchy('eta', beta=3)
#cov_func = eta**2 * pm.gp.cov.ExpQuad(D, ls=l*np.ones(D))

#Gaussian Process
cov_func = pm.gp.cov.ExpQuad(D, ls=0.1*np.ones(D))
gp = pm.gp.Latent(cov_func=cov_func)
f = gp.prior('f', X=x_data)

obs = pm.Poisson('obs', mu=tt.exp(f), observed=y_data)

print “model inference…”
with model:
trace = pm.sample(1000, burn=500, thin=2)

#generate plots
plt.figure()
pm.traceplot(trace, varnames=[‘f’])
plt.savefig(‘./figures/redwoods_trace.png’)

ftrace = np.mean(trace[‘f’][500::2,:], axis=0)
ftrace = np.reshape(ftrace, (num_bins, num_bins))
latent_rate = np.exp(ftrace)

plt.figure()
plt.imshow(latent_rate)
plt.savefig(‘./figures/redwoods_latent_rate.png’)

Here are the two plots of the tree histogram (on the left) and the inferred latent rate function (on the right):

We can see that the latent rate heatmap matches the histogram: it identifies regions of high tree density along the diagonal as well as patches of trees above and below the diagonal. This can be useful in for example analyzing land features that give rise to high tree growth.

I tried including priors on GP hyper-parameters, however, the sampling would take longer to complete, so I left it in comments. I welcome any comments or suggestions for improvement. I feel like this is a neat application of GPs and perhaps the material in the ipython notebook could be added as part of GP examples. On the other hand, it may not be complex enough :slight_smile:

2 Likes

This looks pretty nice!

But 100%|██████████| 1500/1500 [64:46:58<00:00, 155.48s/it] wow! Is it really takes this long?!
I would be interested to see if it could be speed up using:
1, More informative prior
2, Sparse approximation
3, How the result looks like with ADVI.

1 Like

Also, that’s a really nice project you are working on - are you planning to port all the pymc2 code to pymc3?

That looks really great! Definitely seems to do a good job. I for one would like to see this added to the GP examples notebook. Way too slow though! I see the same speed issue when I ran your notebook.

You may have noticed the work being done by @jordan-melendez over in another thread. One way to get a speed boost here is to use the MatrixNormal model he’s working on. Instead of modeling the full covariance matrix (which is 64^2 x 64^2 !), you could have a separate covariance over the x and y dimensions – one 64 x 64 covariance matrix per dimension. This would be much much faster. What I’m thinking of is in this paper, in particular sec. 4.7.

1 Like

I also experiment with 2D GP with a similar smoothing problem few months back, here is the notebook https://github.com/junpenglao/Bayesian_Smoothing_EyeMovement/blob/master/Fixation_Mapping.ipynb
Maybe you will also find it useful (it also has the bouns of comparing GPy, GPflow, and pm.gp).

In general, binning is really not ideal as the resolution of the grid is arbitrary and higher resolution (even by a little) meaning the problem of the complexity increase crazily (O(n^2)?). One solution I tried is to sample from the space some zeros (e.g., double the data input with zeros for y_data, and x_data randomly from the empty space).

Also, in higher dimension problems, it is very crucial to use informative prior. Something that I mention in the notebook, and also recently in the case study of Mike Betancourt: https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html

1 Like

I’m working on the same log Gaussian Cox process and I’m having trouble figuring out how to get estimates of the GP posterior mean evaluated at new coordinates. In my case, I want to use a smaller number of bins so that the dimension of the GP is not so high during fitting, but I want to be able to quickly predict at new points. Ideally, I would use gp.predict(Xnew, diag=True) but this isn’t available for the Latent class. Any ideas here?

Can you use gp.conditional?

I had thought that doing so would be too computationally expensive because I’d like to predict at m new points after fitting at n observations; my thinking was that this was going to be O(m^3). I’ve been looking at the docs recently and it looks like the distribution from gp.conditional actually doesn’t require inverting any extra matrices beyond the ones that are used to fit the GP to begin with. Does that sound right?

yep that’s right. Predicting at new points is memory intensive, but not as computationally intensive. You can reuse the n x n matrix from fitting, but you need to compute the n x m matrix between the new points and the observed. Is m much bigger than n?
Also, you define your GP, then sample, then add the conditional and sample it with pm.sample_posterior_predictive, so it doesn’t require refitting to get predictions at new X points.

2 Likes

Great, that’s good to know. m is indeed quite large, but I’m not as worried about memory usage for this project. Thanks for the advice!

1 Like