Hi @jessegrabowski, thanks for your continuous help.
Indeed, you were right once again. I downloaded two files datatraining.txt
and datatest.txt
and I assumed they were different and was using one for X_train, y_train
and the other for X_test, y_test
. But now that I checked them closely they are the same…
I have deleted one of the files and just used the other one. I then did
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
And then ran the model again. Some points to note:
- either running
x[,1]..
etc or X_train.Hudmity...
etc takes the same ~200 seconds for the NUTS: [beta]
part.
- when doing the split approach, I get an error when sampling
[obs]
. The error reads: ValueError: size does not match the broadcast shape of the parameters. (6514,), (6514,), (1629,)
. It is worth mentioning that 6514
is the length of the training data and 1629
of the test data. I would imagine it should be OK to have a smaller size of test data than training data?
Now, regarding the BLAS library tests, I am using pymc >= 4.0.0 and this is are the last lines of the test I get (on the second test, since it says to do it again):
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).
Total execution time: 10.09s on CPU (with direct Aesara binding to blas).
Should you need the entire output of the check_blas.py
test, please let me know!
Any ideas on what is happening?
PS.: Should I use pymc >= 5.0.0
or … ?
-----------------------EDIT------------------------
The vectorized function now takes the same time as the other one. Something happend and its fixed. My only remaining question now is how to split the data in different sizes? I can only run the model if I have both X_test
and X_train
with the same number of samples. As I showed above, if I split it in different sizes I get an error when sampling [obs]
.
Thanks a lot for all the help!