2d Gaussian process IndexError

This is an example similar to an example for applying 2d gaussian process by @fonnesbeck. X_.shape=(10970, 2) and Y_.shape = (10970,). After running it I get this error at the line related to posterior predictive:

if samples.shape[0] == 1 and size == 1:

IndexError: tuple index out of range

Wondering if you could help me to resolve it?thanks!

#%%
X_shared=th.shared(X_)
Y_shared=th.shared(Y_)
#%%
nd = 30
z1, z2 = np.meshgrid(np.linspace(X_[:,0].min(), X_[:,0].max(), nd), np.linspace(X_[:,1].min(), X_[:,1].max(), nd))
Z = np.concatenate([z1.reshape(nd*nd, 1), z2.reshape(nd*nd, 1)], 1)

nd = 15
xu1, xu2 = np.meshgrid(np.linspace(X_[:,0].min(), X_[:,0].max(), nd), np.linspace(X_[:,1].min(), X_[:,1].max(), nd))
Xu = np.concatenate([xu1.reshape(nd*nd, 1), xu2.reshape(nd*nd, 1)], 1)
#%%
with pm.Model() as spatial_model:
    
    l = pm.HalfCauchy("l", beta=3, shape=(2,))
    sf2 = pm.HalfCauchy("sf2", beta=3)
    sn2 = pm.HalfCauchy("sn2", beta=3)

#    K = pm.gp.cov.ExpQuad(2, l) * sf2**2
    K = pm.gp.cov.Matern32(2,l) * sf2**2
    
    gp_spatial = pm.gp.MarginalSparse(cov_func=K, approx="FITC")
    obs = gp_spatial.marginal_likelihood("obs", X=X_shared, Xu=Xu, y=Y_shared, noise=sn2)

    trace = pm.sample(cores=1)
#%%

with spatial_model:
    f_pred = gp_spatial.conditional('f_pred2', Z)
    samples = pm.sample_posterior_predictive([trace], vars=[f_pred], samples=100)
#%%

To add more to this question here is what I get after running the code:
1_pic

You would really have to post where your data comes from so anyone that wants to help could actually run your code.

@Gon_FThank you for the message, but the problem should not be the data, I explained the data shape in my explanation. To explain more, X_ , input is longitude and altitude, and output is a measurement for each of them. You can assume a grid that y is 2X_[:,0]+3X_[:,1]+noise. If it really matters I would be happy to share the actual data as well.

I just mentioned it because it gives me peace of mind to know the error is not coming from the data : P , and it allows me to work along with the code and run it, instead of just abstractly analyzing it.

1 Like

data.csv (1.5 MB)
@Gon_F here is the data, there are some missing values and I used (pandas dropna) to discard them.
Thanks!

I am wondering if the code looks good? or there is an error in it that I couldn’t fine? What is the reason for the error?

The model does run on my system, using current master (macOS and Python 3.7.3). Can you try that (or the release candidate that was just pushed)?

Hey,
Sorry I missed your message, and thanks so much for your attention; after seeing it I updated my pymc3. For some reason I couldn’t use the developer version using git clone. So I have just updated it to PyMC3 version v.3.7 from v.3.6.
The code now works, but I still have two problems after this update:
1- this warning still is there, and I do not know how to solve it:
`

C:\Users\xxxxx\AppData\Local\Continuum\anaconda3\envs\pymc3env\lib\site-packages\theano\tensor\basic.py:6611: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.

`
2-

File “C:\Users\xxxxx\AppData\Local\Continuum\anaconda3\envs\pymc3env\lib\site-packages\pymc3\plots_init_.py”, line 22, in call
“ArviZ is not installed. In order to use {0.attr}:\npip install arviz”.format(self)

ImportError: ArviZ is not installed. In order to use plot_forest:

This Arviz error is really annoying! Last time when this happened I had to re install whole Anaconda. Even when I install it, and installation is done successfully, it keep giving me this error.

So, using the following code on Pymc3 3.7, everything works out without any errors if I run the code on a subset of the first 400 points in the dataset.

Apologies for taking so long to reply: I kept trying to run the whole model but even on my fast computer, it kept taking 20+ hrs., and my python session would quit before that based on some settings I could not change.

If the code below also works out for you, I imagine the sheer size of your dataset stretches the GP methods to their limits, and whatever breaks first is the error you end up getting.

Data import + pre-preprocessing:

data_discourse = pd.read_csv('data_discourse.csv')

np.shape(data_discourse)

data_discourse.head(10)

data_discourse.dropna(inplace=True)

data_discourse.reset_index(inplace=True, drop=True)

np.shape(data_discourse)

data_discourse.head(10)

X_ = data_discourse.iloc[:, :2].values
Y_ = data_discourse.iloc[:, 2].values

X_[:5]
Y_[:5]

Data preprocessing:

import theano as th
import numpy as np

#%%
X_shared=th.shared(X_)
Y_shared=th.shared(Y_)

nd = 30
z1, z2 = np.meshgrid(np.linspace(X_[:,0].min(), X_[:,0].max(), nd), np.linspace(X_[:,1].min(), X_[:,1].max(), nd))
Z = np.concatenate([z1.reshape(nd*nd, 1), z2.reshape(nd*nd, 1)], 1)

nd = 15
xu1, xu2 = np.meshgrid(np.linspace(X_[:,0].min(), X_[:,0].max(), nd), np.linspace(X_[:,1].min(), X_[:,1].max(), nd))
Xu = np.concatenate([xu1.reshape(nd*nd, 1), xu2.reshape(nd*nd, 1)], 1)

Model:

with pm.Model() as spatial_model:
    
    l = pm.HalfCauchy("l", beta=3, shape=(2,))
    sf2 = pm.HalfCauchy("sf2", beta=3)
    sn2 = pm.HalfCauchy("sn2", beta=3)

#    K = pm.gp.cov.ExpQuad(2, l) * sf2**2
    K = pm.gp.cov.Matern32(2,l) * sf2**2
    
    gp_spatial = pm.gp.MarginalSparse(cov_func=K, approx="FITC")
    obs = gp_spatial.marginal_likelihood("obs", X=X_shared[:200], Xu=Xu, y=Y_shared[:200], noise=sn2)

    trace = pm.sample(cores=2, chains=2)

with spatial_model:
    f_pred = gp_spatial.conditional('f_pred2', Z)
    samples = pm.sample_posterior_predictive([trace], vars=[f_pred], samples=100)

Thanks so much. Actually it worked for me after updating pymc3. One of the things I need to do is to manually write a code to generate posterior predictive for this model. Since at the end of the day I will pass this to people who don’t have pymc3. For simple models like Multiple regression, this is easy for me to construct, but for this model is more confusing. Specially I see instead of changing x_shared, we used “gp_spatial.conditional('f_pred2' , Z)”. I am wondering if you can advice me how to write posterior predictive using my posterior for this model? Is there any Layman reference? ~Thanks!

To continue the discussion above, the parameters are sn2, sf2, l. Can I claim “sn2” is something like standard deviation of residuals? and how can I make posterior predictive (without using pymc3) if I have posterior samples for sn2, sf2, l from pymc3? Is there any easy read reference?

I am confused by what you mean just now, but if you want to write A) an analytical expression for your posterior, that would be difficult, and I would hopefully direct you to Rasmussen’s GP book, or B) create predictions without Pymc3, after gaining sampled chains from Pymc3, that would be recreating a complex component manually, which imho would not be easy, and I have no good reference for that.

The best I would hope is maybe one of the developers can chime in on what posterior predicting based on new data precisely does in Pymc3.

Thanks! You got it. So I want to use chains I have for sn2 and sf2 and l to generate a posterior predictive if someone ask me what is the function value for new x1 and x2. However, I do not want to use pymc3 anymore, I just want to get chains for sn2 and sf2 and l from pymc3 and then write a code to use them to generate posterior predictive. This is easy for regression models, but I am not sure how it gonna be exactly for Gaussian process.