Beyond linear regression with pymc

Hello pymc community,
I have learned how to use pymc mostly through linear regression examples. However, I am dealing with a new problem which essentially is looking to find a distribution of posterior for 3 unknown parameters.
Here I know x1, x2, x0, and I am hoping to find the distribution for C an F parameters for any value of dp (C, F and dp are unknown).

The formula that relates all these parameters is:
dp = x1 - x2 - (x0/F) + (C / F)

The Bayes theorem would be, what is the probability of seeing any combination of C, F and dp, given we observed a combination of x1, x2, x0:

P(C, F, dp | x0, x1, x2) = [ P(x0, x1, x2 | C, F, dp) x P(C, F, dp) ] / P(x0, x1, x2)

This might be really silly (given I learned the pymc with linear regression), I changed equation relating the parameters to y = x1 - x2 - (x0/F) + (C / F) - dp, and then defined a vector of y with values of 0 with a small standard deviation (e.g., 0.001). This might be really silly though. The other way around would be to fix dp (e.g., dp = np.zeros(len(P_i)) + 1) and replace it with y value. But this is tedious as I have to write a for-loop to go and fix the dp value which will take days to run and only a few dp values will be explored.

My questions to the community are:

  1. Is it actually a problem to solve using the Bayes theorem? My though is: I hope so because the uncertainties in x0, x1, x2 should be taken into account and Bayes is the tool that seems to be they way when dealing with uncertainties and credible intervals.

  2. Is there something wrong with my approach in formulating the problem as P(C, F, dp | x0, x1, x2) = [ P(x0, x1, x2 | C, F, dp) x P(C, F, dp) ] / P(x0, x1, x2)? I wrote the code below to try to find the most probable combination of C, F, dp for each observed x0, x1, x2?

  3. Is there any similar example (notebook) so that I could modify my current code to find a solution for the problem?

  4. Do you have any suggestions how to modify the code so that it could reach the goal of finding the combination of C, F, dp for each observed combination of x0, x1, x2?

Many thanks in advance!

#### Toy data
x0 = np.random.uniform(low=5, high=61, size=(30,))

x1 = np.random.uniform(low=25, high=145, size=(30,))
x2 = np.random.uniform(low=20, high=65, size=(30,))
y = P_i * 0
with pm.Model() as model_min_press_const_fric:
    ### likelihoods
    T_final = pm.MutableData("T_", x0)
    avg_x_noise_0 = pm.HalfNormal("avg_x_noise_0")
    x_noisy_0 = pm.Normal("x_noisy_0", mu=T_final, sigma=avg_x_noise_0, shape=T_final.shape)

    S_final = pm.MutableData("S", x1)
    avg_x_noise_1 = pm.HalfNormal("avg_x_noise_1")
    x_noisy_1 = pm.Normal("x_noisy_1", mu=S_final, sigma=avg_x_noise_1, shape=S_final.shape)
    
    P_i_final = pm.MutableData("P_i", x2)
    avg_x_noise_2 = pm.HalfNormal("avg_x_noise_2")
    x_noisy_2 = pm.Normal("x_noisy_2", mu=P_i_final, sigma=avg_x_noise_2, shape=P_i_final.shape)

    ##### priors on known
    C = pm.Uniform('C', 0, 70)
    F = 0.6 #pm.TruncatedNormal('F', mu = 0.6, sigma = 0.2, lower = 0)
    dp = pm.Uniform("dp", 0.001,35)

    #### likelihood on y 
    y_p = pm.MutableData("y_p", y)



    pred_ = pm.Deterministic("pred_", x_noisy_1 - x_noisy_2 - x_noisy_0/F + C / F- dp)
    obs_  = pm.Normal('y', mu=pred_, sigma=0.001, observed=y_p)

    trace_min_press_increase_test_2 = pm.sample(2000, tune=1500, chains=8, cores=8, init='jitter+adapt_diag', random_seed= 7, target_accept = 0.99)
    
1 Like

Could you state the formula in a way that it solves for any of the x values?
E.g. x_0 = F dp + C + F x_1 - F x_2

1 Like

@mattiasthalen
Thanks for the reply! Yeah as you posted x_0 = F(x_1 -x_2 - dp) + C, so are you suggesting to replace y with x0 in the code?

Please correct me if I am wrong, then wouldn’t replacing y with x0 result in P(C, F, dp , x_0| x_1, x_2) = \frac {P(x_1, x_2 | C, F, dp, x_0) \times P(C, F, dp, x_0)} { P(x_1, x_2)}?
If this is right, then it deviates from the actual form P(C, F, dp | x_0, x_1, x_2) = \frac {P(x_0, x_1, x_2 | C, F, dp) \times P(C, F, dp)} { P(x_0, x_1, x_2)}, right?

My question is about training a model that could the most credible regions for the combination of C,F,dp given each combination of x0,x1,x2, my current code is looking at all the data and then give me a posterior distribution for C,F,dp with observing all the data x0,x1,x2. I could loop through the data (x0,x1,x2) and for each point run the code to find the most probable C,F,dp, after I found posterior distribution for each of C,F,dp given all the data points and then predict one by one what would be the C,F,dp for each. Do you think this is the right way to find the most probable combination for each example of dataset?

My understanding of what you are looking for suggests that your original formulation is correct (simplified a bit here):

p(C, F, dp| obs) = \frac {p(obs | C, F, dp) p(C, F, dp)} { p(obs)}

When one builds “a model”, people are typically referring to the likelihood:

p(obs | C, F, dp)

In your case, the observed data seems to be x1, x2, and x3. As suggested, it may be possible to formulate your model so that it produces a quantity such as p(x_1 | C, F, dp, x_1, x_2). This turns your model into a “regression-like” model. Regression models have a likelihood:

p(y, X| \theta)

where \theta captures all the model parameters (e.g., intercept and coefficients). The full inference is then:

p(\theta|y, X) = \frac {p(y, X| \theta) p(\theta)} { p(y, X)}

But this isn’t what we actually do (typically) when we put Bayesian regression models together. Instead, we assume the following:

p(y, X| \theta) = p(y | \theta)p(X)

Why? Well the predictors are typically assumed to be exogenous and thus independent of other model parameters and further assume that the p(X) can be ignored. Why? We typically aren’t that interested in doing inference on the parameters governing process by which X itself is generated (we take them as given) and are not related to the inference we are interested in (i.e., \theta). So in the end, we actually do something like the following:

p(\theta | y) = \frac {p(y | \theta) p(\theta)} { p(y) }

and we ignore the denominator as usual. What may be a bit confusing is that some of our observations (y) appear in this expression, but others (X) do not. But all of these observations certainly appear in our model. Why? Because they are required to calculate p(y | \theta). If you hand me a \theta (e.g., an intercept and some coefficients) and ask me for a y, I can’t do much until you also hand me an X.

Whether the assumptions conventional in the regression setting are appropriate in your setting is not clear. But you should definitely give it some thought and figure out what makes sense before blindly “doing regression” on your data. For example, if I provide values of C, F, and dp, can you provide me with values of x1, x2, and x3? From the constraint you provided, the answer seems to be no. But then you need to figure out what formulation makes sense.

This is unlikely to be what you want, if for no other reason that it is likely to be extremely slow.

This is likely to be tough to do inference on (i.e., idenfiability may be a challenge), thought that’s just intuition based on just looking at this expression. For example, rearranging yields this:

dp - (C/F) = x1 - x2 - (x0/F)

at which point nearly all the observations are on one side and nearly all the parameters are on the other. My suggestion would be to generate some synthetic data from the data generating process consistent with your understanding and then build a model and see if it generates inferences that are in any way related to the values of C , F and dp underlying the synthetic data.

1 Like

Thank you so much for taking time to answer to my post. Even relaxing the problem a little bit by assuming a fixed F value (e.g., F = 1). dp - C = x_1 - x_2 - x_0 now we have all the known (observed data) in the right hand and the unknown in the left. There is one equation and two unknown where I could define a prior on both unknowns, however I am still unclear how one would do an inference for this equation using the theory of P(C,dp| obs) = \frac {P(obs | C,dp) P(C,dp)} {P(obs)} in pymc without converting it to a regression problem, any code suggestion?

1 Like

My strong suggestion is to generate (simulate) some data. If you can do that, it would be much easier to figure out how to put together a model. In your original post you generated data (x0, x1, and x2), but the values you generated were not related to C, dp, or F. Can you simulate values for x0, x1, and x2 in a way that utilizes C, dp, or F?

1 Like

data_generated.csv (1.1 KB)

I generated the data assuming a fixed F = 0.6 and a fixed dp = 5. I could repeat this and change the dp and F and generate a new value based on the observations. Should I generate more data? Is this dataset useful?

Can we see the code that you used to generate this data? That would be the useful bit.

Absolutely!

data_generated = pd.read_csv("data_generated.csv")
x_0 = data_generated["x0"].values
x_1 = data_generated["x1"].values
x_2 = data_generated["x2"].values
dp = np.linspace(0.1,100,len(x_0))
F = 0.6
c = []
for i in dp:
    c.append((x_1 - x_2 - x_0/F - dp) * (-F))

df_data = pd.DataFrame({"x_0" : x_0, "x1" : x_1, "x2": x_2,"C": c[1], "F" : np.zeros(len(x_0))+0.6, "dp": np.zeros(len(x_0)) + dp [1]})
df_data.to_csv("data_generated.csv", index=False)

I took c[1] which corresponds to dp = 5 and F = 0.6, x_0, x_1, x_2` are the same as their corresponding columns in the attached csv file.

My updated priors since my original post are:

    ##### priors on unknown
    C = pm.HalfNormal('coh', 25)
    F = pm.TruncatedNormal('F', mu = 0.6, sigma = 0.4, lower = 0.2)
    dp = pm.TruncatedNormal("dp", mu = 5,sigma = 6, lower = 0.1)

This code won’t run because neither data_generated nor P_i are defined. But the bigger thing is that this code doesn’t seem to generate values of any of your observed variables (i.e., x0, x1, or x2). So this simulation data seems like it describe the process by which the unobserved variables (i.e., dp, F, and c) are generated, rather than the the process by which the observed variables are generated. Is that correct?

1 Like

obs_data_generated.csv (660.4 KB)

My apologies! The code below should run with no problem generating the appropriate data. I have also uploaded one sample of such a run.
I had to bring in one more condition as the ratio to be able to generate x_0 and x_1. This physically makes sense in the scope of my research, I have never thought to put that into my pymc model. The ideal model should take into account the uncertainties in all the observed data. I tried this in my original model by letting the model learn the noise in my observation data. I’d really appreciate it if you have more suggestions.

# number of data points
n = 10000

x_0 = []
x_1 = []


### upper and lower limits for drawing a random value for x_2 from a normal distribution 
low_truncated_x_2 = 0
up_truncated_x_2 = 60
x_2 = []

### upper and lower limits for drawing random values for x_0 / x_1 ratio from a normal distribution
low_truncated_val_ratio = 0.2
up_truncated_val_ratio = 0.9
ratio_ = [] ### ratio = x_0 divided by x_1

### upper and lower limits for drawing a random value for F from a normal distribution 
low_truncated_F = 0.1
up_truncated_F = 1.5
F = []

### upper and lower limit for drawing a random value for C from a normal distribution 
low_truncated_C = 0
up_truncated_C = 20
C = []

### upper and lower limit for drawing a random value for dp from a normal distribution 
low_truncated_dp = 0.1
up_truncated_dp = 70
dp = []


### drawing random data to generate fake data

#### ratio_ 
while len(ratio_) < n:
    num_ = np.random.normal(0.4, 0.5, size=(1,))[0]
    if (num_>= low_truncated_val_ratio) & (num_ <= up_truncated_val_ratio):
        ratio_.append(num_)

#### x_2
while len(x_2) < n:
    num_ = np.random.normal(40, 20, size=(1,))[0]
    if (num_ >= low_truncated_x_2)&(num_<= up_truncated_x_2):
        x_2.append(num_)

#### F
while len(F) < n:
    num_ = np.random.normal(0.6, 0.4, size=(1,))[0]
    if (num_ >= low_truncated_F)&(num_<= up_truncated_F):
        F.append(num_)
#### C
while len(C) < n:
    num_ = np.random.normal(0, 25, size=(1,))[0]
    if (num_ >= low_truncated_C)&(num_<= up_truncated_C):
        C.append(num_)

#### dp
while len(dp) < n:
    num_ = np.random.normal(5, 1, size=(1,))[0]
    if (num_ >= low_truncated_dp)&(num_<= up_truncated_dp):
        dp.append(num_)

for j in range(n):

    x_1.append((dp[j] - C[j]/F[j] + x_2[j]) / (1-ratio_[j]/F[j])) 
    x_0.append(ratio_[j] * x_1[j])

df_data = pd.DataFrame({"x_0" : x_0, "x_1" : x_1, "x_2": x_2,"C": C, "F" : F, "dp": dp})
df_final = df_data[(df_data["x_0"] > 0)&(df_data["x_0"] < 90) & (df_data["x_1"] > 0) &(df_data["x_1"] < 210)]
df_final.to_csv("obs_data_generated.csv")

Just following up on this! I was wondering if you had any chance to take a look at the data? I have used the artificial data with x_0 = F(x_1 - x_2 -dp) +C model, and the results are to some degree close to the underlying numbers of F, but overestimate dp and C (although the large posterior range includes the underlying distribution of dp and C). In addition, I still cannot make the argument which justifies calculating the probability of P(C,F, dp, x_0 | x_1, x_2) instead of P(C,F, dp | x_0, x_1, x_2) which has an actual physical meaning.