Pardon for the long response time, I have been very over-subscribed. Thank you for response, it means a lot to me that you’d take time out of your day to throw me a bone.
At the moment I am working entirely in simulation, as the physical device is under development. That’s why I was plugging the simulation into the observed quantity. I wasn’t clear on this critical point, mea culpa. In my defense it’s hard to be clear when you also have to be vague!
I got approval to post this though:
As you look at that picture, please realize I am not using that plate notation correctly, because I don’t actually know what I’m doing (I just need to do it anyways). Also, never buy one of those fancy glass blackboards. They look slick when they’re clean, but they suck to actually use.
The stuff on the left of the dashed line happens at “low rate”, which for me means about 10 Hz. The stuff on the right must happen at 200 Hz. Because of those constraints I suspect I am moving outside of what PyMC can help me with, but that’s fine: if I need to reimplement everything from scratch in C++ that is what’ll happen. That might be simpler than fighting with these pytensor errors…
Ultimately, I want to know everything I can about h
, and that hinges on c
. Well, that and if g
and h
sufficiently reflect some poorly understood physics. And for once its poorly understood by more people than just me.
a
and b
are both 10 dimensional normal distributions, with known mu and sigmas; diagonal covariance. (I guess that for purposes of this plot I could have said it was a single 20 dimensional Gaussian. Oh well.)
f
, g
, and h
are all deterministic nonlinear functions. f
is fast to compute, but g
and h
are a little expensive, say 1,000 FLOPs per observation. g
and h
model some interesting physics which is very poorly understood. I have derivatives for everything. (Actually, h
is more complicated, but more on that later.)
I make 100 observations of d
and e
, which are both 3 dimensional vectors, corrupted by measurement error I assume to be Gaussian, with known covariance. However, they will both be corrupted by gross outliers at an unknown rate (should be very low for d
, and below 10% for e
, God willing).
It is very difficult to make statements about c
, but I do have a good initialization, and it should be somewhat slowly varying in time.
Currently operating only in simulation and under a dozen dubious assumptions, I am using Levenberg-Marquardt with a pseudo-Huber loss function to recover the MAP estimate of c
, then making the Laplace approximation to get a Gaussian with covariance from the negative inverse Hessian of the log probability; why wasn’t I taught this in university? This is a lot better than nothing, but I suspect as the project progresses that we will find c
is much nastier than a Gaussian. I’m having nightmares of heteroskedastic bananas.
(Should I still call it a MAP estimate even though I don’t have a prior for c
, just a
and b
? Or is this just a maximum likelihood estimate?)
Currently I’m only checking the MAP values of a
and b
to make sure they’re reasonable. Much more than 2 std deviations away probably means I need to report partial system failure and continue to use the previous estimate of c
. Other than that, they’re nuisance parameters, along with s
.
That was the left side of the graph, now for the high rate stuff.
t
is time (observed exactly). After I have c
, I have to temporally project it to c_t
. Exactly how I do this is up in the air; maybe I can just set c_t := c
, maybe I just scale the variance with time, maybe I can even temporally predict it in a more interesting way (I have some evidence this is possible, but I am very data starved right now).
For the most part, only my observed variables (thick white border) are time varying: the mu’s and sigma’s of a
, b
, c
, s
are all constant and known, as are the sigma’s on d
and e
(of the inliers, at least).
x
is a unit 3 dimensional vector that pops into existence at 200 Hz. I know it exactly and have no other information about it.
h
is a little tricky. I drew it as a deterministic variable, but I fear that was misleading. I think this part in particular is not very Bayesian, but it’s demanded of me. The evaluation of h
begins with sampling a uniform random point on a unit disk (this is not in the diagram I provided). Even if I knew c_t
, s
, and x
exactly with infinite precision I would still have to evaluate h
many times, each time getting a different unit 3 vector (direction). I expect these to be pretty tightly clustered in most scenarios, but sometimes smeared out like a banana or teardrop. I would like to know everything I can about what distribution gets generated when I do that.
Ultimately these samples don’t even measure what I really want, it just approximates it. There is only one true direction, but there is no way of knowing which, even in principle, except that it is likely to be near the mode of that distribution I get from sampling h
(assuming my various assumptions hold). It’s not obvious to me how to incorporate this knowledge, so I might just neglect it. (Maybe I should seek an estimate that maximizes the probability mass within some fixed bound?)
I’m currently just computing the geometric median of those samples on-manifold, which seems
very reasonable. I’m currently providing something like a confidence bound by projecting those points to the tangent space around the geometric median (taking care to preserve geodesic distance from that tangent point) and fitting a Gaussian to those transformed points. I don’t think my real data will be a Gaussian, but I’m not sure if the downstream systems can handle more than a Gaussian “confidence interval”. Maybe one of the Fisher Bingham distributions would be better, but I haven’t had time to look at that in depth.
That was the high-rate stuff, now for my concrete questions:
-
I probably can’t think to ask every question I need answered, so please tell me what I need to know.
Actually, where is a better place to ask for help?
-
Is it possible that PyMC can be made to run fast enough for my purposes? I’m assuming I’d have to use ADVI, as I need to adhere to latency guarantees.
-
I understand that the Laplace approximation (and variational inference in general) underestimates variance, so I’d like to understand how I can improve my estimate of the variance of c
.
-
In my research I came across BBVI black box variational inference, especially a 2023 paper by Giordano, Ingram, and Broderick. Am I correct in thinking I could use this to fit a Fisher Bingham, Kent, etc distribution to all my samples of h
? Or easily try out different distributions for c
? It seems like this might actually be tractable inside my compute budget; am I missing something?
-
Solely for my curiosity, why does that paper have only 9 citations? And it’s not like the authors are nobodies: Broderick is a tenured professor at MIT. Even Ranganath, Gerrish, and Blei’s 11-year-old paper on BBVI only has 106 citations, which is wild to me. Obviously, I’m an outsider to statistics, but these techniques seem extremely practical… what gives?
-
How should I handle gross outliers in d
and e
without harming my estimate of c
? Currently I’m using the pseudo-Huber loss, but that means that gross outliers still change the position of the MAP. I’m thinking that I should use that to get close to the MAP, then mark some of the observations as outliers, then continue optimizing only on the inliers. (Essentially, truncated least squares.) I’ve seen the mean-field approach, but it seems a little abstract and less efficient.
-
Is there anything I can say in a principled way about what form c
or h
will take, say by looking at f
and g
, without just getting some real data and looking at it?
-
Is the approach of Levenberg Marquardt followed by Laplace approximation sensible? Is there a better way of doing that, assuming c
is sufficiently well approximated by a Gaussian?
-
To rephrase question 3, if I wanted to accommodate a c
that was, say, a multivariate skew Gaussian, or a leptokurtic, heteroskedastic banana, what do I do? Other than scream in fright.
Again, thank you. Really. I wouldn’t be making these long posts if it didn’t actually matter that I get this right, and to a lot more people than just me. I’m going to look around for better venues to ask for guidance. I really wish I had been taught this stuff in university! I took I don’t know how many classes on numeric methods and still I’m a babe in the woods.