Apparently inconsistent posteriors in autoregressive model

Juan_Martinez · December 20, 2024, 7:27am

Basically I’m doing something along the lines of (this is just a code sketch):

    mu = pm.Normal("mu", mu=0, sigma=0.5)
    sigma = pm.HalfNormal("sigma", sigma=0.2)
    traffic_0 = pm.LogNormal("traffic_0", mu=mu, sigma=sigma)

And I expect that the posterior samples of traffic_0 are similar to taking the posterior samples of mu and sigma and sampling from a lognormal.
However I actually get quite different distributions.

mus = az.extract(idata.posterior, var_names=["mu"])
sigmas = az.extract(idata.posterior, var_names=["sigma"])
samples = pm.draw(pm.LogNormal.dist(mu=mus.values, sigma=sigmas.values))
bins = np.linspace(0, 5, 50)
traffic_0 = az.extract(idata.posterior, var_names=["traffic_0"])

plt.hist(samples, density=True, alpha=0.5, bins=bins)
plt.hist(traffic_0.squeeze(), density=True, alpha=0.5, bins=bins)
plt.legend(["Lognormal with $\mu$,$\sigma$ posteriors", "traffic_0 posterior"])

This feels like I must be missing something very basic but I can’t see what. Now what I’m doing is slightly more complex but not that much.

Autoregressive traffic model

I’m setting up an AR model where the initial value and the innovations come from the same lognormal distribution. In the true data generating process initial values and innovations are actually different, so this posterior “inconsistency” actually adjusts the data better, but I wanted to try a simpler version and I just fail to see how the results I’m getting are possible with the model I’ve defined.

The following is a simplified version of the model I’m using. If the issue is not straightforward and requires debugging the model let me know and I’ll work on an MRE.

def ar_dist(ar_init, rho, mu, sigma, n_steps, size):
    def ar_step(traffic_t1, rho, mu, sigma):
        innov = pm.LogNormal.dist(mu=mu, sigma=sigma)
        traffic = traffic_t1 * rho[0] + innov * (1 - rho[0])
        return traffic, collect_default_updates([innov])
    #
    ar_innov, _ = pytensor.scan(
        fn=ar_step,
        outputs_info=[{"initial": ar_init, "taps": range(-lags, 0)}],
        non_sequences=[rho, mu, sigma],
        n_steps=n_steps,
        strict=True,
    )
    return ar_innov


with pm.Model(coords=coords, check_bounds=False) as ar_model:
    # AR parameters
    rho = pm.Beta("rho", 5, 1, dims=("lags",))
    mu = pm.Normal("mu", mu=0, sigma=0.5)
    sigma = pm.HalfNormal("sigma", sigma=0.2)
    # AR process
    traffic_0 = pm.LogNormal("traffic_0", mu=mu, sigma=sigma, dims=("lags",))
    traffic_t = pm.CustomDist(
        f"traffic_t",
        traffic_0,
        rho,
        mu,
        sigma,
        stops - lags,
        dist=ar_dist,
        dims=("steps"),
    )
    traffic = pm.Deterministic(
        name="traffic",
        var=pt.concatenate([traffic_0, traffic_t], axis=-1),
        dims="stops",
    )
    pm.Exponential("delay", traffic, observed=obs, dims=("instances", "stops"))

I first realised the issue by plotting the first 20 steps (I’m using a shifted geometric rather than an exponential, hence the discrete hdi). In the plot below you can see that for the posterior the initial traffic and the innovations (the latter being the value towards which the traffic converges as steps increase) come from different distributions despite being modelled by the same parameters. At first I thought it could be an issue with scan and the updating of \mu and \sigma but the innovations are actually consistent with these posterior, it’s the initial value (i.e., outside scan) that is not.

Versions

pymc 5.17.0

jessegrabowski · December 21, 2024, 3:11am

In the simplified example, traffic_0 is still a parameter to be learned. It’s conditioned on mu and sigma, but not uniquely determined by them. Consider the “non-centered” formulation:

mu = pm.Normal("mu", mu=0, sigma=0.5)
sigma = pm.HalfNormal("sigma", sigma=0.2)
offset = pm.Normal("offset", mu=0, sigma=1, size=n)
traffic_0 = pm.Deterministic("traffic_0", pm.math.exp(mu + sigma * offset))

Written this (equivalent) way, it’s clear you need to estimate n “offset” parameters for traffic_0. This is what is going on. Your plot is comparing a posterior (in orange) with a mixed posterior + prior (in blue)

Juan_Martinez · December 23, 2024, 4:14pm

Thanks Jesse! The reference to the non-centered formulation apart from being very elucidating may be helpful down the path as I’ve read that it has sampling implications in hierarchical models that can be relevant in case I encounter convergence issues following that path.

I now see that I’m setting priors for the traffic_0 as opposed to setting traffic_0 as a LogNormal distributed variable with parameters mu and sigma (at least I was right in that it was something very basic that I was missing!). As I understand it, the blue does represent the posterior distribution of innov (they are not strictly speaking innovations but the distribution towards which the AR model tends overtime) in the AR model, the relevant difference with respect to traffic_0 being the use of .dist I presume.

That brings me to the following related question: is there a way to tie traffic_0 and innov so that they follow the same distribution? I’ve been looking around and found this Frozen or non-adaptive distribution for random variable - #10 by GBrunkhorst which lead me to try:

import pytensor.tensor as tt
rand_stream = tt.random.utils.RandomStream()
...

    traffic_0 = pm.Deterministic("traffic_0", pm.math.exp(mu + sigma * rand_stream.normal(0, 1, size=(lags))), dims=("lags",))

but it fails. Then I read that worked for v3 and the way to go about it in v5 seems to be this Incorrect step assignments with custom step function - #5 by ricardoV94 which looks quite involved (from my perspective of limited familiarity with the library). I generally understand these complications as a sign that what I’m trying to do doesn’t have much sense to begin with; but I can’t help finding it reasonable to impose some structure on the model forcing these variables to be identically distributed.

jessegrabowski · December 23, 2024, 5:11pm

Can you write down the model you’re trying to estimate as a system of equations?

ricardoV94 · December 23, 2024, 5:27pm

They follow the same prior, but why would they follow the same posterior?

ricardoV94 · December 23, 2024, 5:29pm

You can still make predictions where you resample the initial point if that’s what you’re concerned about.

Otherwise there might be a subtle mismatch between the model you have in mind and the one you wrote

Juan_Martinez · December 24, 2024, 8:09am

I’d say for the AR part (it’s actually an exponential smoothing) there’s a single equation. Denoting traffic with \tau, traffic memory with \rho and “traffic reference” with r I have the following.

\tau_t = \rho \tau_{t-1} + (1 - \rho) r_t \\ r_t \sim \text{LogNormal}(\mu, \sigma) \\ \tau_0 \sim \text{LogNormal}(\mu, \sigma)

By “traffic reference” I mean the value of traffic towards which the process converges. It’s meant to be dependent on circumstances (like weather or time) but for this basic example I ignored that part, so I’m assuming there’s a single traffic reference and I’d like the process to start at that value. In summary, \tau_0 and r_t (traffic_0 and innov) follow the same generating process.

You’re right, I think my confusion came from the fact that I had spent some time within ar_step (I had made plenty of mistakes there) and in that context

        innov = pm.LogNormal.dist(mu=mu, sigma=sigma)

means:

let innov be a lognormal RV with parameters mu and sigma.

Now since in my head innov and traffic_0 were meant to be the “same” thing, when I wrote

    traffic_0 = pm.LogNormal("traffic_0", mu=mu, sigma=sigma, dims=("lags",))

my mind jumped through the analogy and read:

let traffic_0 be a lognormal RV with parameters mu and sigma,

which of course makes no sense because I’m using that very same syntax to define priors everywhere else (which is what bayesian model definition is all about anyway). That is, the right reading is

let traffic_0’s prior be a lognormal distribution with parameters mu and sigma.

Then my question was, is there a way to do what I thought I was doing (i.e., to not set a prior but just specify a random variable)? I tried using .dist for traffic_0 since that seemed to be the key difference with innov but I get some errors (besides I have to replace the argument dims by size, and I really like dims as it’s very helpful for debugging and handling data afterwards).

Would that be with the do operator?
Yeah my concern is more about the latter part, the fact that I’m not really doing what I thought I was doing.

ricardoV94 · December 24, 2024, 8:26am

I don’t think you want a fixed distribution since there’s no theoretical basis for those afaict.

The innovations in the autoregressive process are also priors with different posteriors.

I don’t know why you’re confused about the posterior for the initial points but not the innovations.

Just a guess, but what you may want is to resample it when you do posterior_predictive which you can buy including it in var_names.

ricardoV94 · December 24, 2024, 8:33am

Actually the prior for initial points and innovations is not even the same as your scan is outputting a shifted and scaled LogNormal distribution that depends on previous values.

So using your terminology your timeseries translate to

let traffic at time t be a shifted and scaled lognormal prior with parameters mu, sigma, rho and the previous traffic value (at time t-1).

It may very well be that a shifted and scaled LogNormal is still another LogNormal with specifically parametrized mu and sigma, but it’s still rather different than the prior you defined for the initial points.

[Comment was edited]

jessegrabowski · December 24, 2024, 1:46pm

This simply isn’t true. Initial conditions for time series models need to be provided as part of the problem definition. They cannot be generated by the system itself, otherwise they wouldn’t be the initial condition.

It’s important to be clear between an ETS model and an AR model. As written, you just have an AR(1) with a funny parameterization – the (1 - \rho) factor on the innovation term can be pulled into the mean and variance terms of the r_t distribution, getting you back to the base case of \tau_t = \rho \tau_t + \eta_t.

An ETS model would instead be written:

\begin{align} \tau_t &= \ell_{t-1} + \epsilon_t \\ \ell_t &= \ell_{t-1} + \rho \epsilon_t \end{align}

Where \ell_t is a latent state describing the level of the system at time t, and \epsilon_t are innovations. In this case you also need to describe the initial level of the system \ell_0, which is a distinct object from the innovations.

You mentioned convergence, but ETS models don’t converge to a steady state in general. This simple model will produce constant forecasts of either the sample mean of the data (when \rho = 0) or the last observed data point (when \rho = 1). An AR model will converge to \tau_{ss} = \frac{\alpha}{1 - \rho}, where \alpha is a deterministic drift component (I guess zero in your case, but maybe not). One can absolutely use that formula to initialize stationary processes – it’s what I recommend for stationary state space models in the statespace module.

Here is an example notebook on SARIMA models, and another on ETS models that might be helpful? I strongly recommend to run the ETS notebook in a colab to playaround with the interactive plots.

Juan_Martinez · December 24, 2024, 2:46pm

Thank you very much for sticking through Ricardo and Jesse, I see I was mixing confusion at several levels and I know it’s often far from pleasant to disentangle such a mess.

This is basically on point (sorry I didn’t understand when you mentioned resampling earlier). When I plot

with ar_model:
    pred = pm.sample_posterior_predictive(idata.posterior, var_names=["traffic_0"])

traffic_0 = az.extract(pred.posterior_predictive, var_names=["traffic_0"])
plt.hist(samples, density=True, alpha=0.5, bins=bins)
plt.hist(traffic_0.squeeze(), density=True, alpha=0.5, bins=bins)
plt.legend(["Lognormal with $\mu$,$\sigma$ posteriors", "traffic_0 posterior"])
plt.show()

I get two matching distributions.

Good to know, certainly I’m not at the point of going beyond well established theory.

I wasn’t confused about the innovations because they were much more similar in scale to the lognormal distribution I had in mind, so they seemed just about right (although now that I checked in more detail they clearly weren’t), so your confusion about my lack of confusion was on point too.

I’m probably not using the term innovations correctly, innov in the code above is referring to the variable that affects changes in traffic, but it’s not the output of scan (i.e., innov is a variable inside ar_step). That’s why for me “innovations” do not depend on previous values, but of course I should use the right terminology.

I think this is again a case of misuse of terminology on my part, with generating process I wasn’t thinking about the AR but about anything that generates random samples (i.e., in my mind: “same generating process = same distribution”). I’ll pay more attention to this from now on.

This looks very interesting! I’m about to go on vacation so I don’t have time to look at it in more detail at the moment but I certainly will when I come back. Just recently I was wondering about the differences between ETS and AR(1) and I think your discussion of this example will be very helpful to put my mind in order. Also I love your work on state space models, it was definitely on my path but I figured I would start with scan first since it seemed simpler and potentially more flexible (?), but I’ve struggled quite a bit so I’m ready to give that alternative a go.

Thank you very much for your effort and your patience, it’s been really helpful!

bob-carpenter · December 24, 2024, 8:42pm

Although not my main point, it’s typical in an autoregressive model like this to give the initial point a much broader prior scale than the innovations.

The linear AR(1) model is often written regression style as

\tau_t = \tau_{t-1} + \epsilon_t, \quad \epsilon_t \sim \text{normal}(0, \sigma).

But if you try to do this with a lognormal,

\tau_t = \tau_{t-1} + \epsilon_t, \quad \epsilon_t \sim \text{lognormal}(0, \sigma),

there’s a problem. Because \epsilon_t > 0, the sequence can only grow monotonically.

Instead, you need to think of the model purely generatively, where the linear case is

\tau_t \sim \text{normal}(\tau_{t-1}, \sigma),

and the lognormal case is

\tau_t \sim \text{lognormal}(\log \tau_{t-1}, \sigma).

I’ve never seen an approach like yours (though my experience is hardly comprehensive). Did you devise this yourself or is the approach of taking a fixed chunk out of \tau_t to allow up or down moves done elsewhere?

Juan_Martinez · January 10, 2025, 9:44am

I’m back!

On ETS VS AR

Thanks for recommeding the example notebooks, the interactive plots on the ETS one were particularly helpful. (I have some comments on the finance discussion at the intro although I also understand the aim is more motivating the demonstration rather than discussing finance itself, if you are interested I could expand on this).

Staying on the ETS example, identifying variables as follows: P_t \rightarrow r_t, \; l_t \rightarrow \tau_t, the problem looks pretty much the same (in this light, specifying the inital level l_0 is equivalent to specifying the initial traffic as I had to do anyway). I think the point of contention would be in the way innovations are modelled, as I see it the “Nvidia model” opens the door for negative prices, with no practical implications given the distance of the level to 0, but it’s something I’d rather avoid in my case.

Looking at Rob Hyndman’s lay out of a simple exponential smoothing it also looks like what I had in mind with the only difference that I use \rho = 1 - \alpha because I wanted the parameter to be interpreted as the degree of “memory” in traffic (as opposed to the degree of “forgetfulness” which seemed less natural):

Rob Hyndman's 8.1

Following Bob’s comment I’m not sure how well my model would fit into an AR(1). According to Rob Hyndman’s Table 9.4 an analogy can be drawn between ETS(A,N,N) and ARIMA(0,1,1), but here they say an AR(1) could work too (something which I haven’t understood yet).

On terminology

After reviewing some definitions I think there’s been a recurrent theme of me using certain terms very broadly while you using them in very specific senses. I think that had I been more aware of these specific meanings I would have conveyed more accurately what I was trying to convey, on the bright side it’s another instance of learning (I think it’s also very hard to communicate effectively in written form without very tight context).
I leave below a summary of my understanding of our misunderstandings in case anyone wants to point out corrections or it helps clarifying in any way:

exponential smoothing: in the context of time series analysis this is often linked to ETS. More generally, the formula \tau_t = \rho \tau_{t-1} + (1-\rho)r_t is an exponential smoothing, just because the output \tau_t is a smoothing of r_t which is done in an exponential fashion (older lags receive exponentially lower weights).
autoregressive: literally “self-regression”, in its broader meaning implies any process where future values depend on past ones (example). In its narrower meaning it implies AR(n) models where the regression form is actually specified.
convergence: in the context of time series modelling convergence is often understood as the value that predictions tend towards. More generally, it can mean any kind of “tending to”. Traffic being smooth cannot change drastically from one instant to another, if traffic is light with good weather conditions and weather suddenly worsens dramatically, traffic will “tend to” heavy, and this convergence reference is represented by r_t (of which the actual traffic \tau is nothing other but an exponential smoothing).
generating process: in the context of AR discussion the AR itself is the generating process, however there’s also the true generating process that created the data that’s being modelled with an AR (e.g., I can use some kind of simulator to generate data and then model it with an AR process).

With respect to innovations I was using the term as the source of newness in the model, hence identified with r_t, which I don’t think is right. According to Wikipedia it actually is: the difference between the observed value of a variable at time t and the optimal forecast of that value based on information available prior to time t. I have to say that I find the naming choice in here a little bit confusing since according to the definition above innovations would correspond to the pm.Normal.dist(sigma=sigma) term as opposed to non-initial steps of the AR process. Perhaps I went a little bit too off topic here so feel free to ignore this if it makes sense to you.

On Bob’s comment

Thanks for your input Bob. If with innovations you refer to the Wikipedia definition above I would agree that it generally makes more sense to give them a narrower prior scale compared to the initial point. Since previously I was misusing the term innovations that may be the source of confusion.

Yes, having only positive innovations (\epsilon_t) is problematic, and it begs the question: “if you know innovations are not centered around 0 why not include that in the model”? That’s why I shouldn’t have called r_t innovation, instead, it’s the reference towards which the traffic tends toat any given moment. That is, traffic now (\tau_t) is an average of previous taffic (\tau_{t-1}) and reference traffic (r_t). This is what I understand as exponential smoothing (but given Jesse’s point I may have to give this another thought) and it is from this viewpoint a standard model. I’m not sure how well it fits into the AR(1) framework (see my commentary on ETS vs AR and on the two definitions of autoregressive above) but I tend to think somewhat poorly since at best the configuration looks quite odd.

Edit

I was going to edit my initial post to add a summary of the solution so that it’s more easily accessible but I see it’s not possible, perhaps it’s not necessary anyway.

jessegrabowski · January 10, 2025, 10:30am

This is why one always works with returns or log prices. To remedy the situation Bob identified in your case, you’d want to start with a multiplicative model, x_t = x_{t-1}^\rho e^{z_t} where z_t \sim N(0, \sigma), then log both sides to get back to the base case. The same logic works in the ETS case, where one can convert multiplicative errors back to additive by simply modeling logs, then exponentiating the outputs.

That’s what it’s says in the linked example. “Sample the innovations” refers to the sequence \{\epsilon_1, \epsilon_2, \dots, \epsilon_T\} that generate the time series dynamics. collect_default_updates is specifically relevant to these, because they are being created inside a scan Op, so the random number generators need to be updated inside that loop.

Re: convergence, this is still a somewhat ill-defined term (or at least overloaded, since its meaning varies from context to context). I would prefer to talk about stationarity. If your traffic model is stationary around some reference point with temporary autocorrelated deviations, you’re probably looking at an MA model more than an AR.

Anyway it sounds like you found a solution, which is great! Please do post it if you want to share it, it benefits the community to have lots of examples from different domains floating around.

Juan_Martinez · January 10, 2025, 3:02pm

Sorry I didn’t really specify what I was referring to, I didn’t mean the text which is fine but the naming of the variable ar_innov in the code, which refers to the non-initial steps. I guess you can always make the case that ar_innov means the part of the ar that includes innovations so it’s not a big deal either.

    ar_innov, _ = pytensor.scan(
        fn=ar_step,
        outputs_info=[{"initial": ar_init, "taps": range(-lags, 0)}],
        non_sequences=[rho, sigma],
        n_steps=timeseries_length - lags,
        strict=True,
    )

Sure I’ll keep that in mind! Right now I’m still working on a diluted version of what I want to do but when I have the complete model (which is still for a proof of concept / toy problem) I’ll consider uploading.

Thanks again!

jessegrabowski · January 10, 2025, 3:11pm

Yes I agree this variable is not correctly named. You should open a PR to change it

Topic		Replies	Views
Bug in fast sample posterior predictive? Questions	9	1498	March 14, 2021
Sample_posterior_predictive using prior instead of posterior parameters? v5 bug , modeling	0	36	August 27, 2024
Strange results with simple linear reg + variational inference Questions	0	370	April 9, 2021
Sampling with a general lognormal model Questions	9	936	December 3, 2019
SamplingError("Bad initial energy") Questions	2	495	May 26, 2020