Indexing and interpolating data grids using pymc variables

Firstly, let me thank Chris, Colin and Eric for their help during the Cleveland pycon. Also thanks to Junpeng Lao for this very detailed solution which I have tried to apply to my problem:

I work in astronomy and most of our fittings depend on external data grids. I am trying to update some of my indexing and interpolating operations from pymc2 to pymc3 but I am having troubles with it.

Let’s hope this example helps: let’s say I want to use the ideal gas law (PV=nRT) to measure the specific gas constant (Rs) from some local temperature measurements. Under these conditions the equation can be rearranged to T = P / (D * Rs), where (T, P, D) are the air temperature, pressure and density. Le’ts assume the pressure remains pretty constant and that the local density magnitude can be calculated from a 2 dimensional grid which depends on two variables (x and y).

Looking at this question solution I tried to do a marginalized approach:

import os
os.environ[“MKL_THREADING_LAYER”] = “GNU”
import numpy as np
import theano
import theano.tensor as tt
import pymc3 as pm
import matplotlib.pyplot as plt

Constants

R_air_true = 286.986 # J / (kg * K)

#Street measurements
P_true = 101325 # kg * m * s**-2
n_obs = 10
T_obs = np.random.uniform(low=10, high=12, size=n_obs) + 273
T_sigma = 1 # K

Model ingredients data:

axis_length = 10
X_coords = np.linspace(5, 40, axis_length)
Y_coords = np.linspace(10, 20, axis_length)
density_grid = np.linspace(0.5, 2, 100).reshape((axis_length, axis_length))

Converting to theano variables

X_coords_t = theano.shared(X_coords)
Y_coords_t = theano.shared(Y_coords)
density_grid_t = theano.shared(density_grid)

Control the sparsity?

Dirprior = 1./axis_length

Model using street measurements

with pm.Model() as outside_temperature:

# Priors for the air gas constant
R_air = pm.Normal('R_air', mu=300, sd=200)

# Priors for the grid indeces
beta1_x = pm.Gamma('beta1_x', alpha=Dirprior, beta=1., shape=axis_length)
beta2_y = pm.Gamma('beta2_y', alpha=Dirprior, beta=1., shape=axis_length)

# Variables for marginalized vectors
x_idx_vector = pm.Deterministic('x', beta1_x/tt.sum(beta1_x))
y_idx_vector = pm.Deterministic('y', beta2_y/tt.sum(beta2_y))

# Values for X and Y coordinates (not needed in this example)
x = x_idx_vector.dot(X_coords_t)
y = y_idx_vector.dot(Y_coords_t)

# Computing interpolated value from grid
grid_column = density_grid_t.dot(x_idx_vector)
rho_air = grid_column.dot(y_idx_vector)

# Simulated expectation value
T_synth = P_true / (R_air * rho_air)

# Declare the model likelihood:
Y_obs = pm.Normal('Y_obs', mu=T_synth, sd=T_sigma, observed=T_obs)

# Run NUTS sampler
trace = pm.sample(4000, tune=2000)

Display summary

print pm.summary(trace)

Display trace

pm.traceplot(trace)
plt.show()

However, I am not getting a good solution and the sampling seems very slow:

Then tried generate two priors for the x and y variables indexes from which I do three interpolations: 2 for the x and y intervals and one bilinear interpolation for the grid:

import os
import numpy as np
os.environ[“MKL_THREADING_LAYER”] = “GNU”
import theano
import theano.tensor as tt
import pymc3 as pm
import matplotlib.pyplot as plt

Constants

R_air_true = 286.986 # J / (kg * K)

#Street measurements
P_true = 101325 # kg * m * s**-2
n_obs = 10
T_obs = np.random.uniform(low=10, high=12, size=n_obs) + 273
T_sigma = 1 # K

Model ingredients data:

axis_length = 10
X_coords = np.linspace(5, 40, axis_length)
Y_coords = np.linspace(10, 20, axis_length)
density_grid = np.linspace(0.5, 2, 100).reshape((axis_length, axis_length))

Converting to theano variables

X_coords_t = theano.shared(X_coords)
Y_coords_t = theano.shared(Y_coords)
density_grid_t = theano.shared(density_grid)

Model using street measurements

with pm.Model() as outside_temperature:

# Priors for the air gas constant
R_air = pm.Normal('R_air', mu=300, sd=100)

# Priors for the grid indeces
x_idx = pm.Uniform('x_idx0', lower=0, upper=axis_length)
y_idx = pm.Uniform('y_idx0', lower=0, upper=axis_length)

# Use floor operaton to get the closest points in the grid
x_idx1, y_idx1 = tt.cast(pm.math.floor(x_idx), 'int64'), tt.cast(pm.math.floor(y_idx), 'int64')
x_idx2, y_idx2 = x_idx1 + 1, y_idx1 + 1

# Interpolate the values for X and Y coordinates
x_in = X_coords_t[x_idx1] + (x_idx - x_idx1) * (X_coords_t[x_idx2] - X_coords_t[x_idx1])/(x_idx2 - x_idx1)
y_in = Y_coords_t[y_idx1] + (y_idx - y_idx1) * (Y_coords_t[y_idx2] - Y_coords_t[y_idx1])/(y_idx2 - y_idx1)

# Neighbouring points for the grid for X and Y coordinates:
x1, x2 = X_coords_t[x_idx1], X_coords_t[x_idx2]
y1, y2 = Y_coords_t[y_idx1], Y_coords_t[y_idx2]
z11, z12 = density_grid_t[x_idx1, y_idx1], density_grid_t[x_idx1, y_idx2]
z21, z22 = density_grid_t[x_idx2, y_idx1], density_grid_t[x_idx2, y_idx2]

#Bilinear interpolation of the grid
rho_air = (z11 * (x2 - x_in) * (y2 - y_in) +
        z21 * (x_in - x1) * (y2 - y_in) +
        z12 * (x2 - x_in) * (y_in - y1) +
        z22 * (x_in - x1) * (y_in - y1)) / ((x2 - x1) * (y2 - y1))

# Simulated expectation value
T_synth = P_true / (R_air * rho_air)

# Declare the model likelihood:
Y_obs = pm.Normal('Y_obs', mu=T_synth, sd=T_sigma, observed=T_obs)

# Run NUTS sampler
trace = pm.sample(4000, tune=2000)

Display summary

print pm.summary(trace)

Display trace

pm.traceplot(trace)
plt.show()

However, this code crashes very fast with this error:

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
File “/home/vital/PycharmProjects/thesis_pipeline/spectrum_fitting/pymc3_testing.py”, line 211, in
y1, y2 = Y_coords_t[y_idx1], Y_coords_t[y_idx2]

HINT: Use the Theano flag ‘exception_verbosity=high’ for a debugprint and storage map footprint of this apply node.

I wonder if anyone would please give me any advice on the right way to explore external data grids in a pymc3 model so I can get started in the right track. Also, I have read one of the Junpeng Lao solutions advising to use the Interpolated distribution. I have not found any example of this distribution: How could it be applied here?

Thanks

Hmm, I dont think the indexing model you are doing here is the same as in Indexing using a free random variable

Your interpolation of the data grid model is more similar to a Gaussian process model. Maybe it would be more clear if you can write down your indexing model direct?

Hi Junpeng Lao! Thank you very much for your reply.

The little I understand about gaussian processes is like they are like a generalization of a lookup table but without indexing, you just through random values and get some values back… that sounds a bit like what I am doing but I thought gaussian processes are only for cases in which you have lots of data… In these models I am considering model the number of inputs is of the same order or smaller than the length of the axes grid you want to interpolate from… So I am not sure if it can work.

Writting my own indexing is something I have been trying to do. Using theano functions I can do something like this to get the index of the current X free variable value:

tt.argmin(tt.abs(X_coords - x)

In pymc3 I did not find the argmin function. Do you think these operations could be replaced with any of the logic functions available in pymc3?

Thanks again.