Vectorized autoregressive model

Unfortunately the current implementation of pm.AR does not allow multidimensional rho:

A bit of rewrite of the pm.AR the model seems to run:

import theano.tensor as tt
from theano import scan
from pymc3.distributions.distribution import Continuous
from pymc3.distributions.continuous import get_tau_sd, Normal, Flat


class AR2d(Continuous):
    R"""
    Autoregressive process with p lags.
    .. math::
       x_t = \rho_0 + \rho_1 x_{t-1} + \ldots + \rho_p x_{t-p} + \epsilon_t,
       \epsilon_t \sim N(0,\sigma^2)
    The innovation can be parameterized either in terms of precision
    or standard deviation. The link between the two parametrizations is
    given by
    .. math::
       \tau = \dfrac{1}{\sigma^2}
    Parameters
    ----------
    rho : tensor
        Vector of autoregressive coefficients.
    sd : float
        Standard deviation of innovation (sd > 0). (only required if tau is not specified)
    tau : float
        Precision of innovation (tau > 0). (only required if sd is not specified)
    constant: bool (optional, default = False)
        Whether to include a constant.
    init : distribution
        distribution for initial values (Defaults to Flat())
    """

    def __init__(self, rho, sd=None, tau=None,
                 constant=False, init=Flat.dist(),
                 *args, **kwargs):

        super(AR2d, self).__init__(*args, **kwargs)
        tau, sd = get_tau_sd(tau=tau, sd=sd)
        self.sd = tt.as_tensor_variable(sd)
        self.tau = tt.as_tensor_variable(tau)

        self.mean = tt.as_tensor_variable(0.)

        rho = tt.as_tensor_variable(rho, ndim=2)
        if constant:
            self.p = rho.shape[0] - 1
        else:
            self.p = rho.shape[0]

        self.constant = constant
        self.rho = rho
        self.init = init

    def logp(self, value):
        if self.constant:
            results, _ = scan(lambda l_, obs, p, rho: rho[l_ - 1] * obs[p - l_:-l_],
                              outputs_info=None, sequences=[tt.arange(1, self.p + 1)],
                              non_sequences=[value, self.p, self.rho[1:]])
            x = tt.sum(results, axis=0)
            eps = value[self.p:] - self.rho[0] - x
        else:
            results, _ = scan(lambda l_, obs, p, rho: rho[l_ - 1] * obs[p - l_:-l_],
                              outputs_info=None, sequences=[tt.arange(1, self.p + 1)],
                              non_sequences=[value, self.p, self.rho])
            x = tt.sum(results, axis=0)
            eps = value[self.p:] - x

        innov_like = Normal.dist(mu=0.0, tau=self.tau).logp(eps)
        init_like = self.init.logp(value[:self.p])

        return tt.sum(innov_like) + tt.sum(init_like)

You can have a look at this small notebook as a demonstration:

However, the implementation doesnt seems to improve the speed a lot, as you can see from the notebook.