How to properly use split data with PyMC?

Hi,

I am learning how to perform Bayesian Logistic Regression with PyMC.
I managed to do some regression, but only with the full data.

I’m struggling a bit to find how to train the model with X_train and y_train and then test my model using the test data. Could someone advise how should I change my code below to use test and train data?

with pm.Model() as firstmodel:
    # priors on betas
    beta = pm.Normal('beta', 0, 1, shape=len(df.columns))
        
    # probability of belonging to class 1
    p = pm.Deterministic('p', pm.math.sigmoid(beta[0] + beta[1]*X.Temperature + 
                                              beta[2]*X.Humidity + beta[3]*X.Light +
                                              beta[4]*X.CO2 + beta[5]*X.HumidityRatio + 
                                              beta[6]*X.date))

with firstmodel:
    # fit the data
    observed = pm.Bernoulli("occupancy", p, observed=y)
    
    idata = pm.sample(3000)

You can take a look at the documentation for pm.set_data() for some examples. Does that get you what you need?

Thanks for your answer. I believe it does take me where I wanted. My code is now:

with pm.Model() as model:
    x = pm.MutableData('x', X_train)
    beta = pm.Normal('beta', mu = 0, sigma = 5, shape = len(df.columns))
    p = pm.Deterministic('p', pm.math.sigmoid(beta[0] + beta[1]*X_train.Temperature + beta[2]*X_train.Humidity
                                               + beta[3]*X_train.Light + beta[4]*X_train.CO2 + 
                                                beta[5]*X_train.HumidityRatio))
    obs = pm.Bernoulli('obs', p, observed=y_train)
    idata = pm.sample(3000)

with model:
    pm.set_data({'x':X_test})
    y_pred = pm.sample_posterior_predictive(idata)

Now I have one more question… for the problem that I’ll have to tackle in the future, I can’t call my pm.math.sigmoid so explicitly like this with beta[1]*X_train.Something, etc. as it is a high-dimensional problem. I tried an approach where I would define a func as a string like shown below:

func = 'beta[0]'
for i in range(len(X.columns)):
    func += f'+ beta[{i+1}]*X.iloc[:,{i}]'

and then do p = pm.Deterministic('p', pm.math.sigmoid(eval(func))). But that didn’t work. How should I approach this? So I don’t have to write it out every multiplication… I also tried to do something like pm.math.sigmoid(beta[0] + beta[1]@X_train.values.T) but that didn’t work…

Thanks once again!

beta[0] + x @ beta[1:] should work. What error do you get?

Also note that you should be using x (the pm.MutableData object) everywhere in your model instead of X_train, otherwise PyMC won’t know to update it.

Hi,

Indeed it worked to change everything to beta[0] + x@beta[1:] however, it is taking way longer to train the model now… any idea why that is?

Before when I had all explicitly written it took around 200 seconds to finish. Now that I have like this, it estimates +30min to finish. Illustration below:

Explicitly written

Vector notation

Thanks!

PS
Unrelated question: is there, by any chance, a Discord channel for those who want to get more involved with the PyMC community?

1 Like

When things are taking a long time it’s usually because there’s some unexpected broadcasting going on.

I suggest to comment out everything related to sampling, and look at p.eval() and obs.eval().

The .eval() method runs the computational graph associated with a variable and lets you see a sample output. My guess is that you will find that one of these is not (n,1), but (n,n) or something.

If you are interested in getting involved, I would suggest checking out the PyMC Meetup group and the PyMC project calendar. We regularly host office hours (Q&A and discussion of whatever people want to discuss) and a quasi-monthly hackathon to get things done on the codebase. We have also hosted sprints and other events. Depending on what you are looking for, you can post questions here on discourse or check out issues over on github (e.g., here). PM me if I can help!

Thanks for the quick reply and for the pro tip on the .eval() method.

I checked the p.eval() and obs.eval() of both the explicit and the vector methods and both all have a shape of (n,). Still couldn’t figure out what is happening… I tried playing around with x.T or something but nothing works, it always takes much longer to train the model then if I only write it out explicitly…

It’s getting late here so I’ll let it run until the end and see if I get the same output. I come back here tomorrow, thanks for the help so far!

@cluhmann Thank you as well, I’ll check the calendar and see if I can become an active member of PyMC, I just found out about Bayesian Statistics but I’m pretty sure I’ll stay for the long run. :slight_smile:

@jessegrabowski Ignore my example, I was wrong. They yield the same result, an (n,) vector.

The two examples are different because you didn’t add beta[0] (the constant) to your dot product.

I had a question about the shape of your beta. You use len(df.columns), which I don’t know the shape of. A priori I would write your model like this:

with pm.Model() as model:
    X = pm.MutableData('X', X_train)
    y = pm.MutableData('y', y_train)
    
    beta = pm.Normal('beta', mu=0, sigma=5, shape=len(X_train.columns))
    constant = pm.Normal('constant', mu=0, sigma=5)

    p = pm.Deterministic('p', pm.math.sigmoid(constant + X @ beta))
    
    observed = pm.Bernoulli("obs", p, observed=y_train)

The shape of beta then matches the the thing I’ll be dotting it with, and I handle the constant separately (or add a column of 1’s to X).

So, my example was indeed wrong. And indeed they mean the same thing. Sorry for that, I must be tired.

Defining the model like you did it works… (len(df.columns) is just len(x)+1 and then I use that beta[0].)

The interesting thing is that writing in this vector form takes 8 times more to train it, and they yield the same result. Please see the images below:

Explicitly writing the equations

Writing in vector form

In the “by hand” method, you’re not using the mutable data object in the computation of p. Are you positive those are out-pf-sample predictions? My guess is that the pm.set_data line actually does nothing in this case.

I’d check if the “by hand” method also slow if you write:

p = pm.Deterministic('p', pm.math.sigmoid(beta[0] + beta[1] * x[:, 0] + beta[2] * x[:,1] +  beta[3] * x[:,2] + beta[4] * x[:, 3]  + beta[5] * x[:, 4])

I’d also run the aesara BLAS test to check that your linear algebra libraries are all set up correctly. This can be done by runnning:

python `python -c "import os, aesara; print(os.path.dirname(aesara.__file__))"`/misc/check_blas.py

(If you’re on pymc>=5.0.0, replace aesara with pytensor everywhere in that command)

Hi @jessegrabowski, thanks for your continuous help.

Indeed, you were right once again. I downloaded two files datatraining.txt and datatest.txt and I assumed they were different and was using one for X_train, y_train and the other for X_test, y_test. But now that I checked them closely they are the same…

I have deleted one of the files and just used the other one. I then did

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

And then ran the model again. Some points to note:

  1. either running x[,1].. etc or X_train.Hudmity... etc takes the same ~200 seconds for the NUTS: [beta] part.
  2. when doing the split approach, I get an error when sampling [obs]. The error reads: ValueError: size does not match the broadcast shape of the parameters. (6514,), (6514,), (1629,). It is worth mentioning that 6514 is the length of the training data and 1629 of the test data. I would imagine it should be OK to have a smaller size of test data than training data?

Now, regarding the BLAS library tests, I am using pymc >= 4.0.0 and this is are the last lines of the test I get (on the second test, since it says to do it again):

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 10.09s on CPU (with direct Aesara binding to blas).

Should you need the entire output of the check_blas.py test, please let me know!

Any ideas on what is happening?

PS.: Should I use pymc >= 5.0.0 or … ?

-----------------------EDIT------------------------
The vectorized function now takes the same time as the other one. Something happend and its fixed. My only remaining question now is how to split the data in different sizes? I can only run the model if I have both X_test and X_train with the same number of samples. As I showed above, if I split it in different sizes I get an error when sampling [obs].

Thanks a lot for all the help!

This is good in a way, it means X is updating now. You just need to also update y to be a vector of the same shape as X, or else PyMC gripes that the shapes don’t match up on the computational graph. y isn’t actually used for anything when you run pm.sample_posterior_predictive, so you can just fill it with zeros if you like.

Add a 2nd pm.MutableData object to your model for the observed data, then update this with pm.set_data as well and you should be good to go.

No reason not to, they work hard on those updates!

Glad to hear it’s suddenly working. For the BLAS test, you want(ed) to check 2 things. First, you get a table of CPU benchmarks, look for one closest to yours and compare the timing. Second, there’s a line blas__ldflags, you want to make sure that there’s a -lmkl_rt flag there. That indicates you have MKL installed – not having this is a common source of slowdown (you might also get a warning about using numpy BLAS something when you import pymc if you have this problem).

1 Like

:sob: :sob: :sob: :sob: :sob: :sob: :sob: :sob:

First of all, thanks for the tip on how to have different sizes of train/test data — that works.

However, I am crying because the vector-form thing didn’t actually work… I believe that before it seemed to have worked because I reduced the size of X_train so much that doing by_hand or X @ beta would have around the same speed.

Now, again, its taking around 120 sec for the by_hand approach while it takes ~18min for the X @ beta. Maybe is because of the BLAS thing; please see below my remarks.

According to the benchmark given by check_blas.py for openblas/8 on a i7 950 it took 3.70s for “10 executions of gemm in float64 with matrices of shape 2000x2000”; while mine took around 10s for 10 executions of gemm with matrices of shape 5000x5000. So it seems to me that the timing is OK(?) Given that i7 950 is a bit worse than my AMD Ryzen 7 5800H and the matrices I tested were 2.5x bigger.

I do not have this flag, the only flag I have is -lblas. Should I install MKL? (even though my CPU is AMD?) Or should I install AML?

-------------------------------EDIT-------------------------------
I have upgraded my pymc to pymc>=5.0.0 and installed MKL. It looks like the vector form now is working properly but I’ll test it a few more times before I say for sure. It is worth mentioning that now doing the BLAS tests with both aesara and pytensor still doesn’t show the -lmkl_rt flag; and both tests take the same ~10 sec.

But now I get a lot of different warnings when importing and using pymc. Nonethless, I believe the main question of this thread have been answered and I’ll create a new thread with my other questions regarding MKL/BLAS and all these warnings.

Thanks once again for your very helpful assistance.

Definitely make a new thread. I’m out of my depth on AMD vs Intel stuff, but a quick search turned up a discussion that intel purposely cripples AMD chips when using MKL, so you might not want to use it at all. OTOH I have no idea. I hope someone more knowledgeable can chime in.