Beyond linear regression with pymc

My understanding of what you are looking for suggests that your original formulation is correct (simplified a bit here):

p(C, F, dp| obs) = \frac {p(obs | C, F, dp) p(C, F, dp)} { p(obs)}

When one builds “a model”, people are typically referring to the likelihood:

p(obs | C, F, dp)

In your case, the observed data seems to be x1, x2, and x3. As suggested, it may be possible to formulate your model so that it produces a quantity such as p(x_1 | C, F, dp, x_1, x_2). This turns your model into a “regression-like” model. Regression models have a likelihood:

p(y, X| \theta)

where \theta captures all the model parameters (e.g., intercept and coefficients). The full inference is then:

p(\theta|y, X) = \frac {p(y, X| \theta) p(\theta)} { p(y, X)}

But this isn’t what we actually do (typically) when we put Bayesian regression models together. Instead, we assume the following:

p(y, X| \theta) = p(y | \theta)p(X)

Why? Well the predictors are typically assumed to be exogenous and thus independent of other model parameters and further assume that the p(X) can be ignored. Why? We typically aren’t that interested in doing inference on the parameters governing process by which X itself is generated (we take them as given) and are not related to the inference we are interested in (i.e., \theta). So in the end, we actually do something like the following:

p(\theta | y) = \frac {p(y | \theta) p(\theta)} { p(y) }

and we ignore the denominator as usual. What may be a bit confusing is that some of our observations (y) appear in this expression, but others (X) do not. But all of these observations certainly appear in our model. Why? Because they are required to calculate p(y | \theta). If you hand me a \theta (e.g., an intercept and some coefficients) and ask me for a y, I can’t do much until you also hand me an X.

Whether the assumptions conventional in the regression setting are appropriate in your setting is not clear. But you should definitely give it some thought and figure out what makes sense before blindly “doing regression” on your data. For example, if I provide values of C, F, and dp, can you provide me with values of x1, x2, and x3? From the constraint you provided, the answer seems to be no. But then you need to figure out what formulation makes sense.

This is unlikely to be what you want, if for no other reason that it is likely to be extremely slow.

This is likely to be tough to do inference on (i.e., idenfiability may be a challenge), thought that’s just intuition based on just looking at this expression. For example, rearranging yields this:

dp - (C/F) = x1 - x2 - (x0/F)

at which point nearly all the observations are on one side and nearly all the parameters are on the other. My suggestion would be to generate some synthetic data from the data generating process consistent with your understanding and then build a model and see if it generates inferences that are in any way related to the values of C , F and dp underlying the synthetic data.

1 Like