I want to be as direct as possible.
I’m trying to implement a logistic regression problem (following Nicole Carlson startguide to PyMC3 - PyData 2016 HLR case study she showed) with 6 categories (6 different types of movements in a bank) and 4 features (day of the month of the movement, times the movement is repeated in a month, individual cost per movement and total spend per month of each movement).
Although I know that I’m feeding the model with a large amount of data (60000 observations among all the categories), the time it takes me to make the inference is very high and I don’t know why. If someone could write me some advice, I’d appreciate it.
I’m leaving my program here:
Full code -> logistic_regression_bankmovements.py (4.3 KB)
Could the problem arise that, when specifying my
model_output, that is, the
in my likelihood, I’m considering all 1s for a specific category and 0s for the others?.
My final goal is to obtain (from the posterior) the probability (between [0,1]) of EACH category, based on certain observations in each feature.
Thanyou so much !
You don’t need to use the shared variables as the
model_input and your
model_output will not change. You can further vectorize the samling of
mu_slope by using
shape=2. Similarly, you can also vectorize the
intercept. I don’t know what the
cat is but if it is an index then you can go ahead and remove the
cat_theano variable also and use the
cat as index directly. This will speed up your computation a little but the most computation is happening at the step
pm.math.sigmoid(...). You can use multiple cores to get a high speed up.
>>> import numpy as np
>>> import theano
>>> import theano.tensor as tt
>>> import pymc3 as pm
>>> y_data = np.zeros(60000)
>>> y_data[:20000] = np.ones(20000)
>>> model_input = np.random.randn(60000, 4)
>>> model_output = y_data
>>> with pm.Model() as model:
... mu = pm.Normal("mu", mu=0, sd=10, shape=2)
... sigma = pm.Uniform("sigma", 0., 10., testval=2., shape=2)
... intercept = pm.Normal("intercept", mu=mu, tau=sigma, shape=60000)
... slope = pm.Normal("slope", mu=mu, tau=sigma, shape=(60000, 4))
... p = pm.math.sigmoid(intercept + pm.math.sum(slope * model_input, 1))
... like = pm.Bernoulli("like", p, observed=model_output)
>>> with model:
... trace = pm.sample(500, init='advi', n_init=5000, chains=2, tune=500, target_accept=0.95)
(I am using core i7 cpu on a linux machine)
In addition to Tirth’s advice, you can also try more regularizing priors – with hierarchical GLMs, this is especially important.
Here is a guide for choosing good priors, and a NB with regularizing priors for GLMs.
Hope this helps
Hi @tirthasheshpatel thankyou so much for your tips. Only one question about your comments, if I have 6 different categories, what is the reason do you use 60,000 (total number of observations in my data) instead of 6, inside
slope ?. Thanks!
@AlexAndorra @tirthasheshpatel The last doubt about my model because I’m going around it and I can’t get it. You would help me understand it much better.
My goal is get (from the posterior) the probability ([0,1]) of EACH category for specific features. (example : 4 features (
day of the month of the movement == 10,
times the movement is repeated in a month == 2,
individual cost per movement == 50, t
otal spend per month of each movement == 100).
What numpy.array for observed value is valid ? or am I saying something crazy?
(in the code I generate a value of observed with 20000 lines of 1s for the category_1 and 0s for the other categories (the other 40000 lines), because I want to evaluate what I have told you before but only for the category_1)
Thankyou from a beginner!
I’m not sure I understand what the problem is
Does the model sample? Do you get an error?
@AlexAndorra yes ! The model samples but I think there are some wrong things because it returns me a rhat greater than 1.05 for the slope parameters.
Did you some diagnotics and plotting checks to see if / where the problem could be?