Infinite expansions à la Taylor

Hi again,

While I am here, I have another question (although the minimal example I used to illustrate my question kind of gave me the answer already).
In my applications, the theoretical models are almost always described as some kind of expansion. Taylor expansions, spectral decompositions or whatever. In any case, we physicists love expansions. Which we then truncate to any order we consider sufficient. In practice this choice of order is often data-driven, and triggers endless discussions about how we deal with the bias-variance tradeoff, which often becomes an ill-defined tradeoff: bias versus my-fit-crashed. The state-of-the-art in my discipline then consists in fitting various models with different orders and performing a pseudo-Bayesian model average.
In theory, Bayesian inference provides a very elegant way to deal with that: I can build a (single) very high-order model, to much higher order than what a MLE approach would allow, and what should happen is just that, for the highest-order terms on which I don’t have any signal, the posterior should be essentially identitical to the prior. In practice however, when I add more ‘useless’ terms into my models they tend to slow down and then destabilise the sampling quite quickly. Which has been a bit of a surprise to me, because I’m very familiar with the HMC algorithm in another context, and I know it can do much better than my typical O(10)-parameter fits: I use it everyday to sample lagrangians with billions of variables and plenty of flat directions.

Do you know standard-ish methods and tricks to deal with that?

I am adding below a minimal example with mock data, which you can just copy-paste and run on your side.
Consider that the objective of this fit is to extrapolate our data to x=0, so a_0 is what we want (whose actual value is 0.42) and all the other Taylor coefficients (whose actual value is 1.0) are just nuisance parameters.

import sys
print(sys.version)
import numpy as np
import arviz as az
import matplotlib.pyplot as plt
import pymc as pm
import pytensor.tensor as pt
from pytensor.tensor import slinalg
print(f"Running on PyMC3 v{pm.__version__}")

3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0]
Running on PyMC3 v5.5.0


ncfg = 100
nX = 20
X = np.linspace(0.1,2.0,nX)
true_intercept = 0.42
true_sigma = 0.4
true_Y = true_intercept + X/(1+X) # Taylor coeffs will all be of order 1
rng = np.random.RandomState(0)
data_Y = rng.normal(loc=true_Y,scale=true_sigma*np.ones(nX),size=(ncfg,nX))
print(data_Y.shape)

def build_taylor_model(order):
    with pm.Model() as taylor_model:
        a = pm.Normal('a',mu=0,sigma=10,shape=order+1)
        monomials = X[None,:]**np.arange(0,order+1)[:,None]
        mu = pt.dot(a,monomials)
        sigma = pm.Gamma('sigma',alpha=1,beta=1)
        Yobs = pm.Normal('Yobs',mu=mu,sigma=sigma,observed=data_Y)
    return taylor_model

def build_sample_and_plot(order):
    taylor_model = build_taylor_model(order)
    with taylor_model:
        taylor_trace = pm.sample()
        display(az.summary(taylor_trace,round_to=6))
        lines = ( ('a', {}, true_intercept), ('a', {}, 1.0) )
        az.plot_trace(taylor_trace,lines=lines)
        plt.subplots_adjust(hspace=0.8)
        plt.show()

build_sample_and_plot(3)

trace3

build_sample_and_plot(6)

trace6

build_sample_and_plot(10)

trace10

In this example things actually behave very nicely compared to what I typically see in real-life applications, but it does have some issues.
You see that what happens here is that:

  1. At low order, our estimate of a_0 is biased as expected, at higher order this bias disappears
  2. The error on a_0 increases as the model has more and more freedom. So infinite order (which in this simple case could probably be derived analytically) is not a possibility. Of course this would increase less strongly if I had put a more informative prior
  3. The high-order terms always keep a strong bias. In particular the latest order.

Now, something which looks like an important remark: If I replace my X by

X = np.linspace(0.1,0.9,nX)

then the sampling works much better. However naively you could expect that it should be worse, because it has less leverarm to extract the coefficients.
My understanding is that it is certainly due to the fact that I built my priors such that I enforce a radius of convergence larger than 1: if every a_i were taking the extreme value 10 then the function would be 10/(1-x). So if all the X of my data are inside {\cal B}(0,1) I should observe some kind of convergence, but if my X contains data outside this ball then my prior includes divergent functions. Does this make sense to you? Is it a well-known thing that priors for a Taylor fit have to be something like a_i\sim{\cal N}(0,(\max{X})^{-i})

Here are the results of the order 10 inference with data in [0.1,0.9]. You can see that for i>2 the signal on a_i is fully dominated by the prior as naively expected and the r_\mathrm{hat} values are good. While all parameters are compatible with their true value. Also, what I am not showing is that the error bar on a_0 at order 10 is not really worse than at order 4: once we have reached a sufficiently high order it saturates.

Finally, I tried to replace my truth with

true_Y = true_intercept + 2*X/(1+2*X) # A true function with a smaller radius of convergence

in order to check whether the radius of convergence of the true function is important or only the i-dependence of the priors. Things still work nicely, the radius of convergence of the Taylor expansion of the true function doesn’t seem to be important, which is good news since we do not know the true function in advance.