MA should need scan for the same reason you can represent an AR model using simple OLS, but not an MA model. You need access to the error estimates \hat{\epsilon}_t at every time step, which isn’t possible unless you iteratively compute estimates for \hat{y}_t (\mu_t in your nomenclature).
I assume higher order AR processes should be possible without scan by just using clever slicing, something like:
mu_t = pm.Normal('hidden state', mu=0, sigma=1, shape=T)
phi = pm.Normal('MA params', mu=0, sigma=1, shape=2)
mu_hat = (mu_t * phi[0] + at.concatenate([np.nan], mu_t[1:]) * phi[1]).cumsum()
Would need to work out the details with the padding. Also need to write a transformation to constrain samples of the phis to the space of stationary distributions, otherwise this will easily explode the sampler for long time series/higher order processes.