Engineer needs help modeling uncertainty of physical system measurement ( TypeError: Variables that depend on other nodes cannot be used for observed data. )

Apologies for being vague, and for my total lack of stats knowledge. :slight_smile: I will try to avoid accidental misuse of stats terms by avoiding them as much as possible. :slight_smile: And if anything is unclear I will gladly clarify as much as I am able.

I have a somewhat complicated physical system I need to know the internal state of. I can not make direct measurements of this state, but I can make indirect, approximate measurements of some function of that state. And of course the initial calibration of my system is approximate.

So far I have used non-probabilistic methods (Levenberg Marquardt) to fit a model of the system to my measurements. This works well, but I need more than a best fit. I would like a distribution of my belief of what the internal state is, given my measurements.

(My actual problem is more complicated than what I’m describing here, but I think if I can solve this I will be able to figure the rest out on my own.)

I model my system as a nonlinear deterministic function of both its internal state and my initial calibration. For now I can assume my model is correct and exactly captures the physical system. The internal state has about 10 dimensions, and the initial calibration has about 20 dimensions. I can provide reasonable estimates of my calibration error, and I know a feasible region for the internal state. I think I should be able to get by with Gaussians or other well behaved distributions.

I can take upwards of a few hundred different measurements, but there is a catch. Each measurement is of a slightly different function of my model/system. I can not repeat a measurement (except in simulation). All measurements are corrupted by noise, and I can provide a prior of this noise.

Essentially, in pseudocode I have:

# internal_state is a 10 dimensional random variable
# calibration is a 20 dimensional random variable
# functions is a list of known nonlinear deterministic functions
# sigma is some known constant

measurements = []
for i in range(len(functions)):
	func = functions[i]
	measurements.append( func(internal_state, calibration) + N(0, sigma) )

# given our measurements, what is the distribution of our belief of what internal_state is?

I have tried to look at an extremely simplified toy variant of my problem (and I’m probably going about it wrong):

mu = [2., 3.]
with model:
    internal_state = pm.Normal("internal_state", mu=mu[0])
    calibration = pm.Normal("calibration", mu=mu[1])

    def toy_function(a, b):
    	# In reality I would have many different but similar functions
	    return (1.1 + a) * (0.9 + b) + a

    measurement_error = pm.Normal("measurement_error", sigma=0.01)
    measurement = pm.Normal("measurement", 
                            mu = toy_system(mu[0], mu[1]), 
                            observed = toy_system(internal_state, calibration) + measurement_error)
    idata = pm.sample()
    # Given measurement, what do we know about internal_state ?

However this gives me an error (PyMC v5.21.1):

> TypeError: Variables that depend on other nodes cannot be used for observed data.

If I take this error message at face value it feels like a show stopper, but the restriction itself seems odd to me. I looked online for help with this error, but the discussions I found were outside my depth.

I feel like this should be a solvable problem, as I have a reasonable guess of what all the sources of error are, but I do not know how to proceed. The resources online solve a good many problems that sure seem much more complicated than this toy task, but they don’t seem engineering/physics focused, except for some regression examples that did not leave me with a good idea of how to tackle this.

I assume I am simply struggling with a knowledge gap, but if there is another tool or approach that is better suited for this please let me know.

Thank you!

Your thinking about the problem seems to be somewhat backwards. You never observe random variables, you observe draws from the distributions those RVs represent, then try to learn something about the latent random variables you posit in your model based on that. It fundamentally doesn’t make sense to write something like observed = f(RV), because it reverses what is “data” and what is “model”.

A PyMC model describes a generative process. What you want to write down is the way in which your measurements come about. The measurements, on the other hand, are totally fixed, and everything else in the model is going to bend to better explain how they could have arisen. This is a useful organizing principle for me when tackling a new problem, because I sometimes have to rearrange how I think about a problem.

In your case, I would start by trying to recast the whole problem in terms like measurement = f(???) + error. If error ~ N(0, sigma), then this is equivalent to measurement ~ N(f(???), sigma) due to how normal distributions behave under affine transformations. Now you have observation distribution for your model, measurement_hat = pm.Normal('measurement_hat', mu=mu, sigma=0.01, observed=measurements), you just have to figure out what goes in the mean function. I guess this is toy_function, but I don’t really understand how it is being used.

I also don’t really understand your setup with mu. Is that what is actually observed? Are internal_state and calibration unknown, or just themselves noisy? Where is the actual measured quantity in the model?

1 Like

Pardon for the long response time, I have been very over-subscribed. Thank you for response, it means a lot to me that you’d take time out of your day to throw me a bone.

At the moment I am working entirely in simulation, as the physical device is under development. That’s why I was plugging the simulation into the observed quantity. I wasn’t clear on this critical point, mea culpa. In my defense it’s hard to be clear when you also have to be vague!
I got approval to post this though:

As you look at that picture, please realize I am not using that plate notation correctly, because I don’t actually know what I’m doing (I just need to do it anyways). Also, never buy one of those fancy glass blackboards. They look slick when they’re clean, but they suck to actually use.

The stuff on the left of the dashed line happens at “low rate”, which for me means about 10 Hz. The stuff on the right must happen at 200 Hz. Because of those constraints I suspect I am moving outside of what PyMC can help me with, but that’s fine: if I need to reimplement everything from scratch in C++ that is what’ll happen. That might be simpler than fighting with these pytensor errors…

Ultimately, I want to know everything I can about h, and that hinges on c. Well, that and if g and h sufficiently reflect some poorly understood physics. And for once its poorly understood by more people than just me.

a and b are both 10 dimensional normal distributions, with known mu and sigmas; diagonal covariance. (I guess that for purposes of this plot I could have said it was a single 20 dimensional Gaussian. Oh well.)

f, g, and h are all deterministic nonlinear functions. f is fast to compute, but g and h are a little expensive, say 1,000 FLOPs per observation. g and h model some interesting physics which is very poorly understood. I have derivatives for everything. (Actually, h is more complicated, but more on that later.)

I make 100 observations of d and e, which are both 3 dimensional vectors, corrupted by measurement error I assume to be Gaussian, with known covariance. However, they will both be corrupted by gross outliers at an unknown rate (should be very low for d, and below 10% for e, God willing).

It is very difficult to make statements about c, but I do have a good initialization, and it should be somewhat slowly varying in time.

Currently operating only in simulation and under a dozen dubious assumptions, I am using Levenberg-Marquardt with a pseudo-Huber loss function to recover the MAP estimate of c, then making the Laplace approximation to get a Gaussian with covariance from the negative inverse Hessian of the log probability; why wasn’t I taught this in university? This is a lot better than nothing, but I suspect as the project progresses that we will find c is much nastier than a Gaussian. I’m having nightmares of heteroskedastic bananas.

(Should I still call it a MAP estimate even though I don’t have a prior for c, just a and b? Or is this just a maximum likelihood estimate?)

Currently I’m only checking the MAP values of a and b to make sure they’re reasonable. Much more than 2 std deviations away probably means I need to report partial system failure and continue to use the previous estimate of c. Other than that, they’re nuisance parameters, along with s.


That was the left side of the graph, now for the high rate stuff.

t is time (observed exactly). After I have c, I have to temporally project it to c_t. Exactly how I do this is up in the air; maybe I can just set c_t := c, maybe I just scale the variance with time, maybe I can even temporally predict it in a more interesting way (I have some evidence this is possible, but I am very data starved right now).

For the most part, only my observed variables (thick white border) are time varying: the mu’s and sigma’s of a, b, c, s are all constant and known, as are the sigma’s on d and e (of the inliers, at least).

x is a unit 3 dimensional vector that pops into existence at 200 Hz. I know it exactly and have no other information about it.

h is a little tricky. I drew it as a deterministic variable, but I fear that was misleading. I think this part in particular is not very Bayesian, but it’s demanded of me. The evaluation of h begins with sampling a uniform random point on a unit disk (this is not in the diagram I provided). Even if I knew c_t, s, and x exactly with infinite precision I would still have to evaluate h many times, each time getting a different unit 3 vector (direction). I expect these to be pretty tightly clustered in most scenarios, but sometimes smeared out like a banana or teardrop. I would like to know everything I can about what distribution gets generated when I do that.

Ultimately these samples don’t even measure what I really want, it just approximates it. There is only one true direction, but there is no way of knowing which, even in principle, except that it is likely to be near the mode of that distribution I get from sampling h (assuming my various assumptions hold). It’s not obvious to me how to incorporate this knowledge, so I might just neglect it. (Maybe I should seek an estimate that maximizes the probability mass within some fixed bound?)

I’m currently just computing the geometric median of those samples on-manifold, which seems
very reasonable. I’m currently providing something like a confidence bound by projecting those points to the tangent space around the geometric median (taking care to preserve geodesic distance from that tangent point) and fitting a Gaussian to those transformed points. I don’t think my real data will be a Gaussian, but I’m not sure if the downstream systems can handle more than a Gaussian “confidence interval”. Maybe one of the Fisher Bingham distributions would be better, but I haven’t had time to look at that in depth.


That was the high-rate stuff, now for my concrete questions:

  1. I probably can’t think to ask every question I need answered, so please tell me what I need to know. :slight_smile: Actually, where is a better place to ask for help?

  2. Is it possible that PyMC can be made to run fast enough for my purposes? I’m assuming I’d have to use ADVI, as I need to adhere to latency guarantees.

  3. I understand that the Laplace approximation (and variational inference in general) underestimates variance, so I’d like to understand how I can improve my estimate of the variance of c.

  4. In my research I came across BBVI black box variational inference, especially a 2023 paper by Giordano, Ingram, and Broderick. Am I correct in thinking I could use this to fit a Fisher Bingham, Kent, etc distribution to all my samples of h? Or easily try out different distributions for c? It seems like this might actually be tractable inside my compute budget; am I missing something?

  5. Solely for my curiosity, why does that paper have only 9 citations? And it’s not like the authors are nobodies: Broderick is a tenured professor at MIT. Even Ranganath, Gerrish, and Blei’s 11-year-old paper on BBVI only has 106 citations, which is wild to me. Obviously, I’m an outsider to statistics, but these techniques seem extremely practical… what gives?

  6. How should I handle gross outliers in d and e without harming my estimate of c? Currently I’m using the pseudo-Huber loss, but that means that gross outliers still change the position of the MAP. I’m thinking that I should use that to get close to the MAP, then mark some of the observations as outliers, then continue optimizing only on the inliers. (Essentially, truncated least squares.) I’ve seen the mean-field approach, but it seems a little abstract and less efficient.

  7. Is there anything I can say in a principled way about what form c or h will take, say by looking at f and g, without just getting some real data and looking at it?

  8. Is the approach of Levenberg Marquardt followed by Laplace approximation sensible? Is there a better way of doing that, assuming c is sufficiently well approximated by a Gaussian?

  9. To rephrase question 3, if I wanted to accommodate a c that was, say, a multivariate skew Gaussian, or a leptokurtic, heteroskedastic banana, what do I do? Other than scream in fright.

Again, thank you. Really. I wouldn’t be making these long posts if it didn’t actually matter that I get this right, and to a lot more people than just me. I’m going to look around for better venues to ask for guidance. I really wish I had been taught this stuff in university! I took I don’t know how many classes on numeric methods and still I’m a babe in the woods.