Regression model using only categorical variables

hi guys.

I am finding a way to make regression model only using categorical variables.
my data consists of a continuous dependent variable and 2 categorical independent variables(one has 20 categories and the other has 1000 categories).
I found several way to deal wit categorical variable but it all about binary variable.

my model would be Hierarchical Model like
y = a[x2_idx]+ b[x2_idx] * x1

if any of you guys know a solution, please let me know.
it would be big help and really appreciated ! thank you!

You need to recode your independent variables.

Your question seems to be not really related to Bayesian modeling or PyMC3 but more of a general one related regression/data analysis. I would recommend looking up regression with categorical variables at a basic stats textbook (or even read the wikipedia article).

Let me rephrase the question. What is an efficient way of treating categorical variables in pymc3? I am a newbie here and I see a lot of pm.math.dot in linear models examples. Is there a way to do this dot product without first binarising (pd.dummies) your categorical variables? Use case would be models that make use of categorical features with lots of levels.
Thanks

You can use index variables for variables with many categories: instead of having one indicator variable per category, you just have one index variable and each category has its level – see chapter 5, section 5.3 of Rethinking2 for instance (PyMC port, see code 5.50).
Pandas’ factorize method is very useful for this:

district_id, districts = data.arrondissement.factorize(sort=True)
Ndistricts = len(districts)

Hope this helps :vulcan_salute:

I don’t see the factorize method being used in that part (nor in the notebook), wrong link?

Nevertheless how would you perform the dot product in pymc3 / theano to obtain a linear projection for the location parameter of a pymc distribution?

In one of my attempts I used

X = pd.get_dummies(X, columns=['cate1'], sparse=True)
mu0 = pm.Normal('mu0', 0, 1)
mu1 = pm.Normal('mu1', 0, 10, shape=m)
alpha = pm.Deterministic('alpha', tt.exp(mu0 - pm.math.dot(X, mu1)))
y = pm.BetaBinomial('obs', n=clicks, alpha=alpha, beta=beta, observed=conversions)

no exception rose but the code hangs forever :confused:

Yeah that looks good – if you don’t encounter any shape error it’s already a very good sign :sweat_smile: You can also check the size of any tensor with print(pm.math.dot(X, mu1).tag.test_value.shape, this is usually helpful!

But if it’s your first PyMC model, you can also just write down the complete formula b0 + b1 * X1 + b2 * X2... instead of the dot product, to get an intuition of how things work.

I don’t think the NB I linked to uses factorize but it does show how to use categorical variables in regressions. My code snippet does use factorize though – data.arrondissement.**factorize**(sort=True) – and I linked to pandas doc.

Hope this helps :vulcan_salute:

Thanks. I know how to encode categorical variables in python (pandas, scikit-learn etc.) Nevertheless I don’t know what is the best way to do it in this context (pymc3, theano and in general bayesian models).

Is pymc3 integrated with pandas sparseArrays or scipy sparse matrices?
Should one assign a distribution to each categorical level in a linear (bayesian) regression?
How much is too much for pymc?

The use of predictors in general and categorical features in particular seems to be quite limited in the bayesian context compared to ML in general.

What did you find confusing in the resources I linked to?
As each category gets its own parameter, each needs its own prior, indeed (but this is the same thing for indicator variable). You can see that in the use of the shape parameter.
You will also find a lot of examples of models with categorical variables in chapter 13 :wink: