Inferring covariance of measurements from multiple observations

Summary of Problem:

I’m looking to infer the covariance budget of a spectrograph, which has been used to measure spectra of astronomical objects \{g_i\}. I have potentially up to 3 measurements of each object (several-thousand objects in total), and I denote each such observation as g_i^j. Each spectrum is taken on some wavelength grid \lambda_i^j (in principle, this can differ slightly between reobservations of the same galaxy, but in most cases, it’s the same), and the measured quantity is the flux-density f_i^j, a vector of length ~4500. Because of imperfect calibration, two observations f_i^j and f_i^{j'} of the same object may not be consistent up to the stated uncertainties v_i^j, v_i^{j'} in each measurement. Put another way, there’s a \lambda-dependent covariance in df^{j,j'} = f_i^j - f_i^{j'}.

Furthermore, we have reason to believe that df^{j,j'} varies smoothly with \lambda. There is also likely a zeropoint component (i.e., the absolute calibration of the instrument is off, so \int_\lambda df is not necessarily zero). I would like to infer the covariance K(\lambda,\lambda') that determines df.

First attempt:

I tried a quick mockup of the problem just using one object, with two total observations, and with only ~10 values of \lambda on the same grid. This mockup represents the difference between two vectors as a realization of a multivariate normal, with mean of zero, and covariance equal to the sum of the covariance of both measurements (which include both the stated measurement uncertainty and the zeropoint offset).

with pm.Model() as model:
    # correlation matrix, eta set to 1.1 (close to uniform) for simplicity
    C_triu = pm.LKJCorr('C_triu', eta=1.1, n=nl)

    C = pm.Deterministic(
        'C', T.fill_diagonal(
            C_triu[np.zeros((nl, nl), dtype=np.int64)], 1.))
    # sd of each
    sigma0 = pm.math.sqrt(v0)
    sigma1 = pm.math.sqrt(v1)
    zeropoint = pm.Exponential('z', lam=0.5)
    with warnings.catch_warnings():
        sigma_diag0 = T.diag(sigma0 + zeropoint)
        sigma_diag1 = T.diag(sigma1 + zeropoint)

    # full covariance of each
    cov0 = pm.Deterministic(
        'cov0', T.nlinalg.matrix_dot(sigma_diag0, C, sigma_diag0))
    cov1 = pm.Deterministic(
        'cov1', T.nlinalg.matrix_dot(sigma_diag1, C, sigma_diag1))

    # model mean of difference as zero    
    df = pm.MvNormal('df', mu=T.zeros(nl), cov=cov0 + cov1,
                     observed=f0 - f1)

n_samples = 2000

with model:
    trace = pm.sample(n_samples, tune=1000, chains=4, init='adapt_diag',
    posterior_predictive = pm.sample_posterior_predictive(trace)

Before I try to scale this up to thousands of dimensions and thousands of objects (good luck, future self), I am wondering if there is a more scalable way to do this. For instance, is it possible/preferable to make this into a GP?

It looks like the shape of your observations, F, is something like (5000, 4500, 3) for (object, wavelength, observation). I would seek to consider the impact of the three dimensions on the flux-density measurement. Assuming there are no interactions, the vectorized (stacked) observation vector \vec f (of length 5000 x 4500 x 3) could be modeled as:

\vec f \sim N(\mu_1(g)+\mu_2(\lambda)+\mu_3, K_1(g,g)\otimes K_2(\lambda,\lambda) \otimes \Sigma_3)

My reading of your model is that you set parameters \mu_1 = \mu_2 = \mu_3 = 0; K_1 = \Sigma_3 = \mathrm{identity}). Is this correct?

If so, then your model looks a lot like a Gaussian process with covariance function K_2

F_{i, \cdot, k} \sim \mathbf{GP}(\lambda)

I would like to infer the covariance K(\lambda, \lambda')

There are kind of two approaches to this. One way is to compute the model likelihoods for several choices of kernels, and choose the one which maximizes the BIC. Another way is to define a mixture kernel

K(\lambda, \lambda) = \sum_{i=1}^k w_i^2 K_i(\lambda, \lambda; \theta_i)

and place priors over the weights w_i^2 and kernel parameters \theta_i.

I would recommend a GP in this case as it keeps the number of parameters relatively low (compared to the full covariance); and because it appears that the wavelength grid can differ between observations, and the GP makes dealing with this trivial – the grid need not even be the same size across observations.