HMC and NUTS are slow compared to Metropolis Hastings

I am trying to model the dwell times of buses. For that I have 5 parameters and 2 covariate information.
The parameters are beta_0, beta_1_ON, beta_1_OFF, beta_2_ON and beta_2_OFF.
The covariates are ON_shared and OFF_shared which are shared theano variables.
The log-likelihood is defined by the logistic distribution,

self.alpha = self.beta_1_ON*self.ON_shared + self.beta_1_OFF*self.OFF_shared + self.beta_0                                                                           
self.beta = self.beta_2_ON*self.ON_shared + self.beta_2_OFF*self.OFF_shared                                                                                          
                                                                                                                                                              = pm.math.log(self.alpha)                                                                                                                                    
self.s = tt.power(self.beta, -1)                                                                                                                                     
log_DWELLS_obs = pm.Logistic('log_DWELLS_obs_{0}'.format(index),, s=self.s, observed=pm.math.log(dwell_time))

The initial priors are as follows,

self.beta_1_ON = pm.Normal('beta_1_ON_{0}'.format(index), mu=3, sd=1)                                                                                            
self.beta_1_OFF = pm.Normal('beta_1_OFF_{0}'.format(index),mu=3, sd=1)                                                                                           
self.beta_2_ON = pm.Normal('beta_2_ON_{0}'.format(index), mu=1, sd=0.5)                                                                                          
self.beta_2_OFF = pm.Normal('beta_2_OFF_{0}'.format(index), mu=1, sd=0.5)                                                                                                                                                                                                                                                              
self.beta_0 = pm.Normal('beta_0_{0}'.format(index), mu=1, sd=0.5)                                                                                                
self.use_prior = False 

For subsequent iterations I discard the burn-in samples and then define a from_posterior function to generate a distribution from these samples like,

self.beta_1_ON = from_posterior('beta_1_ON_{0}'.format(index), self.beta_1_ON_samples[index][self.beta_1_ON_samples[index] > 0].values)                          
self.beta_1_OFF = from_posterior('beta_1_OFF_{0}'.format(index), self.beta_1_OFF_samples[index][self.beta_1_OFF_samples[index] > 0].values)                              
self.beta_2_ON = from_posterior('beta_2_ON_{0}'.format(index), self.beta_2_ON_samples[index][self.beta_2_ON_samples[index] > 0].values)                          
self.beta_2_OFF = from_posterior('beta_2_OFF_{0}'.format(index), self.beta_2_OFF_samples[index][self.beta_2_OFF_samples[index] > 0].values)                               
self.beta_0 = from_posterior('beta_0_{0}'.format(index), self.beta_0_samples[index][self.beta_0_samples[index] > 0].values)

The index here is just the current datapoint in our database of bus dwell times. The from_posterior is defined as follows,

def from_posterior(param, samples):                                                                                                                                      
     # Create a posterior distribution for variable param using the given samples.                                                                                    
     # Returns the class From_posterior. Calling its method logp returns the log                                                                                      
     # probability of observing a particular value.                                                                                                                   
     class From_posterior(pm.Continuous):                                                                                                                             
             def __init__(self, *args, **kwargs):                                                                                                                     
                     self.from_posterior_logp = _from_posterior_logp()                                                                                                
                     super(From_posterior, self).__init__(*args, **kwargs)                                                                                            
             def logp(self, value):                                                                                                                                   
                     return self.from_posterior_logp(value)                                                                                                           
     class From_posterior_logp:                                                                                                                                       
             def __init__(self, samples):                                                                                                                             
                     smin, smax = np.min(samples), np.max(samples)                                                                                                    
                     # use 100 points from the smallest to the largest given sample                                                                                   
                     self.x = np.linspace(smin, smax, 100)                                                                                                            
                     # Evaluate the Gaussian Kernel Density Estimate (KDE) of the samples                                                                             
                     # over the points in self.x                                                                                                                      
                     self.y = stats.gaussian_kde(samples)(self.x)                                                                                                     
             def from_posterior_logp(self, value):                                                                                                                    
                 # Return the log probability of observing value.                                                                                                     
                 x, y = self.x, self.y                                                                                                                                
                 # y0 = np.min(y) / 100 # what was never sampled should have a small probability but not 0                                                            
                 # Return zero probability for a value smaller than the min sample or                                                                                 
                 # larger than the max sample observed                                                                                                                
                 y0 = 0 # This works, not sure why                                                                                                                    
                 return np.array(np.log(np.interp(value, x, y, left=y0, right=y0)))                                                                                   
     from_posterior_logp = From_posterior_logp(samples)                                                                                                               
     def _from_posterior_logp():                                                                                                                                      
             @as_op(itypes=[tt.dscalar], otypes=[tt.dscalar])                                                                                                         
             def logp(value):                                                                                                                                         
                     return from_posterior_logp.from_posterior_logp(value)                                                                                            
             return logp                                                                                                                                              
     return From_posterior(param, testval=np.median(samples))    

I then make use of the Metropolis-Hastings sampler. This takes about 10 seconds in each iteration. In order to compare other samplers with this one, I decided on using NUTS and HamiltonianMC as well. Since these are gradient based methods, I defined a new from_posterior function which makes use of the Interpolated class.

def from_posterior(param, samples):                                                                                                                                      
 smin, smax = np.min(samples), np.max(samples)                                                                                                                        
 width = smax - smin                                                                                                                                                  
 x = np.linspace(smin, smax, 100)                                                                                                                                     
 y = stats.gaussian_kde(samples)(x)                                                                                                                                   
 # x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])                                                                                                   
 x = np.concatenate([[x[0]], x + 1e-10, [x[-1] + 3 * width]])                                                                                                         
 y = np.concatenate([[0], y, [0]])                                                                                                                                    
 return pm.Interpolated(param, x, y)

The problem I am facing is that both NUTS and HMC are significantly slower than Metropolis-Hastings. They take about ~120 seconds and ~40 seconds respectively. But since these methods are gradient based, I was under the impression that they would be efficient and faster compared to Metropolis-Hastings. I think that my approach might be wrong and since I am new to this library I was hoping if you can let me know if I am not using the library in correct manner.