Broadcasting problem with Metropolis Sampling

I’m experiencing what seems to be the same problem as highlighted in various other posts (#1983, #1083, #1304), although none of the proposed solutions have worked for my model.

I am running a Gaussian Process Regression in a for loop where for each iteration in the loop my target ‘y’ variable changes and I use the same model structure. The code runs fine (although significantly slower) if my sampling option is pm.NUTS(), however fails if it is Metropolis, with error: “The broadcastable pattern of the input is incorrect for this op. Expected (True,), got (False,).”

I’ve tried both flattening and reshaping my priors, and also setting the patternbroadcast to True for each, but have had no luck. I would appreciate any help! Also, any comments on the general structure of the code are welcome. I’ve been trying to set this up correctly for some time, as running in a for loop has been leading to “memory leakage” problems (#1825), so I hope my current setup is memory efficient… - On this note, I’ve found I have to define a new conditional function for each loop, as I get the problem that the variable already exists if I keep the variable name the same. Is this normal? Many thanks.

Below is my code: - I’ve just randomly generated some fake data for the example here:

#'data' (see below) is a 3D matrix, where each ij entry will be predicted. The 3rd dimension of data is size: n
#X is the same for each prediction and has dimensions nxN
#Xnew is a 1D vector with dimensions 1xN

np.random.seed(1)
data = np.random.rand(60,60,30)
X = np.random.rand(30,100)
Xnew = np.random.rand(1,100)

shared_data = theano.shared(data[0,0,:], borrow=True)
        
with pm.Model() as model:
    # hyperpriors
    ℓ_mu = pm.Normal('ℓ_mu', mu=-10, sd=2)
    ℓ_sd = pm.Gamma('ℓ_sd', alpha=10, beta=10)
    σf_mu = pm.Normal('σf_mu', mu=-1, sd=2)
    σf_sd = pm.Gamma('σf_sd', alpha=5, beta=10)
    σn_mu = pm.Normal('σn_mu', mu=-1, sd=2)
    σn_sd = pm.Gamma('σn_sd', alpha=5, beta=10)

    # priors
    logℓ = pm.Normal('log(ℓ)', mu=ℓ_mu, sd=ℓ_sd)
    logσf = pm.Normal('log(σf)', mu=σf_mu, sd=σf_sd)
    logσn = pm.Normal('log(σn)', mu=σn_mu, sd=σn_sd)
    ℓ = tt.exp(logℓ)
    σf = tt.exp(logσf)
    σn = tt.exp(logσn)
    μprior = pm.gp.mean.Zero()
    Σprior = σf * pm.gp.cov.ExpQuad(input_dim=X.shape[1],active_dims=np.arange(X.shape[1]),ls=ℓ)

    # GP function
    gp = pm.gp.Marginal(mean_func=μprior, cov_func=Σprior)
    y_ = gp.marginal_likelihood('y_',X=X, y=shared_data, noise=σn)

for i in range(60):
    for j in range(60): #dimensions of the 'data' matrix above
        ID = i*60 + j
        
        shared_data.set_value(data[i,j,:])

        with model:
            db = pm.backends.Text('./backends/'+str(ID))
            pm.sample(draws=1000, chains=5, step=pm.Metropolis(), random_seed=1, progressbar=False, trace=db)
            trace = pm.backends.text.load('/backends/'+str(ID))
            point = {}
            for obj in trace[-1]:
                array = np.asarray([np.nanmean(np.nanmean(trace.get_values(str(obj), burn=1000//2, combine=False),axis=1))])
                point[str(obj)] = array
            func = gp.conditional(str(ID), Xnew=Xnew, pred_noise=True)
            μpost, Σpost = gp.predict(Xnew, point=point, pred_noise=True)

I got this working with Metropolis sampling in the end. I changed my priors to make them more ‘informative’ after reading Robust Gaussian Processes in Stan, and somehow this solved the broadcasting issue.

I also realised that it is not necessary to define the conditional distribution, as the prediction function calls this internally. That speeds things up a bit. Updated code below:

np.random.seed(1)
data = np.random.rand(60,60,30)
X = np.random.rand(30,100)
Xnew = np.random.rand(1,100)

shared_data = theano.shared(data[0,0,:], borrow=True)
        
with pm.Model() as model:
    ℓ_alpha = pm.HalfNormal('ℓ_alpha', sd=3)
    ℓ_beta = pm.HalfNormal('ℓ_beta', sd=4e-7)
    σf_sd = pm.HalfNormal('σf_sd', sd=0.1)
    σn_sd = pm.HalfNormal('σn_sd', sd=0.01)

    ℓ = pm.InverseGamma('ℓ', alpha=ℓ_alpha, beta=ℓ_beta)
    σf = pm.HalfNormal('σf', sd=σf_sd)
    σn = pm.HalfNormal('σn', sd=σn_sd)
    μprior = pm.gp.mean.Zero()
    Σprior = σf * pm.gp.cov.ExpQuad(input_dim=X.shape[1],active_dims=np.arange(X.shape[1]),ls=ℓ)

    # GP function
    gp = pm.gp.Marginal(mean_func=μprior, cov_func=Σprior)
    y_ = gp.marginal_likelihood('y_',X=X, y=shared_data, noise=σn)

for i in range(60):
    for j in range(60): #dimensions of the 'data' matrix above
        ID = i*60 + j
        
        shared_data.set_value(data[i,j,:])

        with model:
            db = pm.backends.Text('./backends/'+str(ID))
            pm.sample(draws=1000, chains=5, step=pm.Metropolis(), random_seed=1, progressbar=False, trace=db)
            trace = pm.backends.text.load('/backends/'+str(ID))
            point = {}
            for obj in trace[-1]:
                array = np.asarray([np.nanmean(np.nanmean(trace.get_values(str(obj), burn=1000//2, combine=False),axis=1))])
                point[str(obj)] = array
            μpost, Σpost = gp.predict(Xnew, point=point, pred_noise=True)

But you really shouldnt be using Metropolis especially with GP - it is a high dimensional problem and MH most likely give you bias or even wrong result

Thanks for your response!

Ideally I would use NUTS for sampling. However I just don’t know if it’s feasible. In this instance where I have 3600 (60x60) predictions to make; using Metropolis sampling this takes approximately 30 hours to run. I understand that NUTS is faster at finding effective samples, but in terms of pure run time, Metropolis appears to be significantly faster (I haven’t run this through all the way with NUTS, but I think it could take anywhere up to and beyond 5 times as long). Furthermore I eventually plan to run this on a higher resolution dataset where I have >100,000 predictions to make (I am predicting fields of geo-spatial data).

In any case, I will check the Metropolis result and see how it performs. If it turns out I need to switch to NUTS I think the best thing for me to do will be to divide the data into swaths and run them in parallel on my University’s computer servers. Not ideal, but hey.

Thanks again.