GP speedup for astrophysics model

I am working on a model to determine physical parameters for Active Galactic Nuclei. The model tries to solve a deconvolution problem where the data i have in a specific band is the result of a convolution between a X-ray driving function and a transfer function. We do not have data for the X-ray driving function, but we know that it is described by a GP with a specific covariance function, namely an exponential one. The transfer function for one data band is the sum of two lognormals, which describe different part of the object. I define the transfer function using an input array of delay times (tau) and for the GP i define a prior at points evenly spaced over the time range of my data. I interpolate the resulting convolution to obtain the value at the same time points as my data so i can input them in my likelihood function to compare with the data i have. This is done for all the data bands i have.

Here is an example of my data.
index
and the model i am using

nGP=879
XGP=np.linspace(np.min(X_tot),np.max(X_tot),nGP)#evenly spaced points over data range
XGP=np.reshape(XGP,(len(XGP),1))
ntau=879
tau=np.linspace(0.0,879.0,ntau)#delay times

with pm.Model() as convmodel:
############################################
#define driving function as Gaussian Process
############################################
#find way to use g band as first guess of value 
ℓ = pm.TruncatedNormal('ℓ', mu=np.sqrt(5.0/2.0), sigma=np.sqrt(5.0/2.0), lower=0.0, upper=np.sqrt(10.0/2.0))#timescale of variation for the driving function, order of days for x-ray
#REMEMBER time scale is 2*ℓ^2 so remember to rewrite as ℓ_true=2*ℓ^2
η = pm.TruncatedNormal('η', mu=5.0, sigma=5.0, lower=0.0, upper=10.0)#long term standard deviation for the driving function
cov = η**2 * pm.gp.cov.Exponential(1, ls=ℓ)#using same cov as light curve interpolation
gp = pm.gp.Latent(cov_func=cov)
f = gp.prior("f", X=XGP)
f = f.reshape((1,1,len(XGP),1))
##############
#Define priors
##############
#Universal Dusty Torus parameters for the uniform temperature DT
sigma_DT=pm.TruncatedNormal('sigma_DT', mu=1.17609126,sigma=1.17609126, lower=0.0,upper=3.40119738)#needs a source for scale
m_DT=pm.Uniform('m_DT', lower=0.0, upper=np.log(100.0))#we expect serveral tens to hundreds of days from the nature letter
theta_DT=pm.Uniform('theta_DT', lower=0.0, upper=np.max(tau)/1.2)#add later when simple model is staple
#Accretion Disk paramters
#J band
Jsigma_AD=pm.TruncatedNormal('Jsigma_AD', mu=1.17609126,sigma=1.17609126, lower=0.0,upper=3.40119738)# Shappee 2014 suggests somewhere between 0-30 days so log that 
Jtheta_AD=pm.Normal('Jtheta_AD',mu=0.0,sigma=10.0)#add later 
Jm_AD=pm.Uniform('Jm_AD', lower=0.1, upper=np.log(100.0))#AD has 3-5 times smaller lags than DT 
#H band
Hsigma_AD=pm.TruncatedNormal('Hsigma_AD', mu=1.17609126,sigma=1.17609126, lower=0.0,upper=3.40119738)# Shappee 2014 suggests somewhere between 0-30 days so log that 
Htheta_AD=pm.Normal('Htheta_AD',mu=0.0,sigma=10.0)#add later 
Hm_AD=pm.Uniform('Hm_AD', lower=0.1, upper=np.log(100.0))#AD has 3-5 times smaller lags than DT 
#K band
Ksigma_AD=pm.TruncatedNormal('Ksigma_AD', mu=1.17609126,sigma=1.17609126, lower=0.0,upper=3.40119738)# Shappee 2014 suggests somewhere between 0-30 days so log that 
Ktheta_AD=pm.Normal('Ktheta_AD',mu=0.0,sigma=10.0)#add later 
Km_AD=pm.Uniform('Km_AD', lower=0.1, upper=np.log(100.0))#AD has 3-5 times smaller lags than DT  
#BB and power law parameters
T=pm.TruncatedNormal('T', mu=tt.log(1400),sigma=0.22, lower=tt.log(1000.0),upper=tt.log(2300))#taken from nature letter
K_0=pm.TruncatedNormal('K_0', mu=1.0,sigma=1.0, lower=0.0,upper=10.0)#is it BB/powr or powr/BB?
index=pm.Normal('index', mu=1.5, sigma=0.5)#sign depends on diffmag definition change to -2 to -1
#Note for index: we have taken the transformation from F_nu to F_lamb into account with the index value.
                     
#Different wavelength for different bands, not a free paramter 
#REMIR filters in nm
Jwav=1250.0
Hwav=1625.0
Kwav=2150.0
#Sloan filters for ROSS2 in nm
#gwav=475.4
#rwav= 620.4
#iwav=769.8
#zwav=966.5

#Define constants 
wav_0 = 1122.4#Reference wavelength in nm, use 500?
h = 6.626e-34#Plancks constant in J*s
c = 299792458.0#speed of light in m/s
k = 1.38e-23#Boltzmanns constant in J/K

#Peak Black Body from uniform torus temperature
wav_peak = 2.898*10**6/tt.exp(T)
b_max = 4.967#h*c/(1e-9*wav_peak*k*tt.exp(T))
BB_max = 1.0/( (wav_peak**5) * (tt.exp(b_max) - 1.0) )

#Universal lognormal for Dusty Torus 
exp_DT = -((tt.log((tau-theta_DT)/tt.exp(m_DT)))**2/(2*sigma_DT**2)) 
front_DT = 1.0/((tau-theta_DT)*sigma_DT*tt.sqrt(2*np.pi))
lognorm_DT = front_DT*tt.exp(exp_DT)
lognorm_DT = tt.switch(tt.isnan(lognorm_DT), 0.0, lognorm_DT)

#Dusty Torus transfer equation for J band
Jb = h*c/(1e-9*Jwav*k*tt.exp(T))
JBB = (1.0/( Jwav**5 * (tt.exp(Jb) - 1.0) ))/BB_max
JPsi_DT = JBB*lognorm_DT

#Dusty Torus transfer equation for H band
Hb = h*c/(1e-9*Hwav*k*tt.exp(T))
HBB = (1.0/( Hwav**5 * (tt.exp(Hb) - 1.0) ))/BB_max
HPsi_DT = HBB*lognorm_DT

#Dusty Torus transfer equation for K band
Kb = h*c/(1e-9*Kwav*k*tt.exp(T))
KBB = (1.0/( Kwav**5 * (tt.exp(Kb) - 1.0) ))/BB_max
KPsi_DT = KBB*lognorm_DT

#Accretion Disk transfer equation for the J band
Jpowr = K_0*(Jwav/wav_0)**(index)    
Jexp_AD = -((tt.log((tau-Jtheta_AD)/tt.exp(Jm_AD)))**2/(2*Jsigma_AD**2))
Jfront_AD = 1.0/((tau-Jtheta_AD)*Jsigma_AD*tt.sqrt(2*np.pi))
Jlognorm_AD = Jfront_AD*tt.exp(Jexp_AD)
Jlognorm_AD = tt.switch(tt.isnan(Jlognorm_AD), 0.0, Jlognorm_AD)
JPsi_AD = Jpowr*Jlognorm_AD

#Accretion Disk transfer equation for the H band
Hpowr = K_0*(Hwav/wav_0)**(index)    
Hexp_AD = -((tt.log((tau-Htheta_AD)/tt.exp(Hm_AD)))**2/(2*Hsigma_AD**2))
Hfront_AD = 1.0/((tau-Htheta_AD)*Hsigma_AD*tt.sqrt(2*np.pi))
Hlognorm_AD = Hfront_AD*tt.exp(Hexp_AD)
Hlognorm_AD = tt.switch(tt.isnan(Hlognorm_AD), 0.0, Hlognorm_AD)
HPsi_AD = Hpowr*Hlognorm_AD

#Accretion Disk transfer equation for the K band
Kpowr = K_0*(Kwav/wav_0)**(index)    
Kexp_AD = -((tt.log((tau-Ktheta_AD)/tt.exp(Km_AD)))**2/(2*Ksigma_AD**2))
Kfront_AD = 1.0/((tau-Ktheta_AD)*Ksigma_AD*tt.sqrt(2*np.pi))
Klognorm_AD = Kfront_AD*tt.exp(Kexp_AD)
Klognorm_AD = tt.switch(tt.isnan(Klognorm_AD), 0.0, Klognorm_AD)
KPsi_AD = Kpowr*Klognorm_AD

#########################
#Full transfer equations
#########################
Jtransfer = JPsi_DT + JPsi_AD
#JPsi = pm.Deterministic('JPsi', Jtransfer)
Jtransfer = Jtransfer.reshape(((1,1,len(tau),1)))
Htransfer = HPsi_DT + HPsi_AD
#HPsi = pm.Deterministic('HPsi', Htransfer)
Htransfer = Htransfer.reshape(((1,1,len(tau),1)))
Ktransfer = KPsi_DT + KPsi_AD
#KPsi = pm.Deterministic('KPsi', Ktransfer)
Ktransfer = Ktransfer.reshape(((1,1,len(tau),1)))

#The convolutions
######################################################################
#'half': pad input with a symmetric border of filter rows // 2
#rows and filter columns // 2 columns, then perform a valid convolution. 
#For filters with an odd number of rows and columns, 
#this leads to the output shape being equal to the input shape.
######################################################################
Jconvol=theano.tensor.nnet.conv2d(f,Jtransfer,border_mode='half')
Jcomp=interpolate(XGP[:,0],Jconvol[0,0,:,0],XJ[:,0])
Hconvol=theano.tensor.nnet.conv2d(f,Htransfer,border_mode='half')
Hcomp=interpolate(XGP[:,0],Hconvol[0,0,:,0],XH[:,0])                     
Kconvol=theano.tensor.nnet.conv2d(f,Ktransfer,border_mode='half')
Kcomp=interpolate(XGP[:,0],Kconvol[0,0,:,0],XK[:,0])

#Define likelihoods
k = pm.TruncatedNormal('k', mu=1.0, sigma=1.0, lower=0.0, upper=3.0)#Noise boost factor 
Jlikelihood = pm.Normal('yJ', mu=Jcomp, sigma=k*yJerr, observed=yJ)
Hlikelihood = pm.Normal('yH', mu=Hcomp, sigma=k*yHerr, observed=yH)
Klikelihood = pm.Normal('yK', mu=Kcomp, sigma=k*yKerr, observed=yK)
######################################################################
#max_treedepth, default=10
#The maximum tree depth. Trajectories are stopped when this depth is reached.
#early_max_treedepth, default=8
#The maximum tree depth during the first 200 tuning samples.
######################################################################

tracetransfer = pm.sample(4000,tune=1000,init='advi+adapt_diag',chains=2,cores=8)

Ideally the convolution would be for two continous function, but that is not possible, since we would have to know the value of the GP at all times, so instead i tried to specify the GP at regularly spaced points. Running the model with so many GP variables slows it down considerably, to the point where just the initialization would take 15 hours or more. Are there anyone who has ideas on how to improve the speed of this model? Can the GP be defined in a different way that is faster? Thank you for your time.