Implementing a new distribution (loop over x and y in logp)

Is there any way to use Theano.scan inside logp? I am trying to loop over observed variable x and the error covariance which are both fixed and have the same shape[0]. But I cannot make this work.

mu = tt.vector("mu")
cov = tt.dmatrix("cov")
covErr = tt.dmatrix("covErr")
x = tt.dmatrix("x")

def map_func(i, x, mu, cov, covErr):
    return multivariate.MvNormal.dist(mu=mu, cov=cov+covErr[i]).logp(x[i])

result, updates = theano.scan(fn=map_func,
                              sequences=[tt.arange(x.shape[0])],
                              n_steps=x.shape[0],
                              non_sequences=[x, mu, cov, covErr])

final_result = tt.sum(result)

calculate_logp = theano.function(inputs=[x, mu, cov, covErr], outputs=final_result, updates=updates)

class MvNormalwErr2(pm.distributions.Continuous):
    
    def __init__(self, mu, cov, covErr, *args, **kwargs): 
        super(MvNormalwErr2, self).__init__(*args, **kwargs)
        self.mu = mu = tt.as_tensor_variable(mu)
        self.cov = cov = tt.as_tensor_variable(cov)
        self.covErr = covErr = tt.as_tensor_variable(covErr)

        self.mean = self.median = self.mode = self.mu = self.mu
                    
    def logp(self, x):
        return calculate_logp(x, self.mu, self.cov, self.covErr)

PS: I have tried something like the following in my model, but it is super slow, I am trying to get to the speed of MvNormal. For 200 observation I have a hard time to run it, so it is not scalable to 100k observations.

with pm.Model() as model3:
     packed_L = pm.LKJCholeskyCov('packed_L', n=2,
                             eta=2., sd_dist=pm.HalfCauchy.dist(2.5))

     L = pm.expand_packed_triangular(2, packed_L)
     sig = pm.Deterministic('sig', L.dot(L.T))
     mu = pm.Normal('mu', 0., 10., shape=2,
              testval=x.mean(axis=0))
     #obs = MvNormalwErr2('obs', mu=mu, cov=sig, covErr=covErr, observed=x)
     obs = []
     for i in range(x.shape[0]):
          obs.append(pm.MvNormal('obs'+str(i), mu=mu, cov=sig+covErr[i], observed=x[i]))
     obs = tt.stack(obs)

IIUC, you have two loops: for each slice of x, you loop over a covErr which is some error matrix/scaler?
I think it is going to be slow in any case if there are a lot of elements in covErr. But looking at your code below, seems covErr and x has the same number in the dimension you want to loop over? If that’s the case you can build it in one theano.scan loop