I got the Cox Process to work on a new dataset of California redwoods available here (which was extracted from R spatstat package). The dataset contains the locations of redwood trees in a square sampling region. I constructed the dataset by discretizing the region and creating two training sets: y_data that contains the tree counts and x_data that contains (x, y) coordinates of the 2D-bin centers. The experimental results and a brief description of the Cox process are summarized in the following ipython notebook. The complete code is listed below:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pymc3 as pm
import theano
import theano.tensor as tt
#load data
print “loading data…”
train_df = pd.read_csv(‘./data/redwoods.csv’)
plt.figure()
plt.scatter(train_df[‘redwoodfull.x’], train_df[‘redwoodfull.y’])
plt.title(‘California Redwoods’)
plt.xlabel(‘X1’)
plt.ylabel(‘X2’)
plt.savefig(‘./figures/redwoods_dataset.png’)
#discretize spatial data
D = 2 #dimension
num_bins = 64
hist, xedges, yedges = np.histogram2d(train_df[‘redwoodfull.x’].values, train_df[‘redwoodfull.y’].values, bins=num_bins)
xcenters = xedges[:-1] + 0.5 * (xedges[1:] - xedges[:-1])
ycenters = yedges[:-1] + 0.5 * (yedges[1:] - yedges[:-1])
plt.figure()
plt.imshow(hist, cmap=‘hot’)
plt.savefig(‘./figures/redwoods_hist.png’)
#form input arrays
y_data = np.zeros(num_bins * num_bins) #observations (counts)
x_data = np.zeros((num_bins * num_bins, 2)) #locations
cnt = 0
for i in range(num_bins):
for j in range(num_bins):
y_data[cnt] = hist[i][j]
x_data[cnt,:] = np.hstack((xcenters[i], ycenters[j]))
cnt += 1
#end for
#end for
print “creating the model…”
with pm.Model() as model:
#hyper-parameter priors
#l = pm.Gamma('l', alpha=2, beta=2)
#eta = pm.HalfCauchy('eta', beta=3)
#cov_func = eta**2 * pm.gp.cov.ExpQuad(D, ls=l*np.ones(D))
#Gaussian Process
cov_func = pm.gp.cov.ExpQuad(D, ls=0.1*np.ones(D))
gp = pm.gp.Latent(cov_func=cov_func)
f = gp.prior('f', X=x_data)
obs = pm.Poisson('obs', mu=tt.exp(f), observed=y_data)
print “model inference…”
with model:
trace = pm.sample(1000, burn=500, thin=2)
#generate plots
plt.figure()
pm.traceplot(trace, varnames=[‘f’])
plt.savefig(‘./figures/redwoods_trace.png’)
ftrace = np.mean(trace[‘f’][500::2,:], axis=0)
ftrace = np.reshape(ftrace, (num_bins, num_bins))
latent_rate = np.exp(ftrace)
plt.figure()
plt.imshow(latent_rate)
plt.savefig(‘./figures/redwoods_latent_rate.png’)
Here are the two plots of the tree histogram (on the left) and the inferred latent rate function (on the right):
We can see that the latent rate heatmap matches the histogram: it identifies regions of high tree density along the diagonal as well as patches of trees above and below the diagonal. This can be useful in for example analyzing land features that give rise to high tree growth.
I tried including priors on GP hyper-parameters, however, the sampling would take longer to complete, so I left it in comments. I welcome any comments or suggestions for improvement. I feel like this is a neat application of GPs and perhaps the material in the ipython notebook could be added as part of GP examples. On the other hand, it may not be complex enough