Intro
I’m trying to predict a categorical variable with another categorical variable. I’m providing a minimal example here in hopes that someone can help me understand how to proceed.
The setup is each person has a college major and eventually gets a job. I want to predict the job using the college major. Here is the data that I have.
major job
0 English Writer
1 Biology Scientist
2 Physics Engineer
3 Biology Doctor
4 Physics Scientist
5 Biology Doctor
6 Biology Doctor
7 Biology Doctor
8 Biology Doctor
9 Biology Doctor
10 Biology Doctor
11 Biology Doctor
12 Biology Doctor
13 Biology Doctor
14 Biology Doctor
Since the conjugate prior of the categorical distribution is the Dirichlet, I am using an uninformative Dirichlet prior for each of the four majors. Then I try to predict what will be the job on some unobserved data. My prediction code is based on the sample_posterior_predictive
example here.
Code
import arviz
import pandas as pd
import pymc3
import numpy as np
import theano
def build(major, job, n_majors):
with pymc3.Model() as mod:
p = pymc3.Dirichlet("p", a=np.array([1] * n_majors))
out = pymc3.Categorical("out", p=p[major], observed=job)
return mod
def main(data):
print(data)
n_majors = len(set(data["major"]))
majors = theano.shared(data["major"].cat.codes.values)
model = build(major=majors, job=data["job"].cat.codes.values, n_majors=n_majors)
pymc3.model_to_graphviz(model).render()
trace = pymc3.sample(model=model)
inference = arviz.from_pymc3(trace)
print(arviz.summary(inference))
# Create some out-of-sample people to forecast.
out_of_sample_majors = [1] * 4 # They all major in biology.
# Replace the modeled data with the new data.
majors.set_value(out_of_sample_majors)
# Make predictions about the new data.
pred = pymc3.sample_posterior_predictive(trace=trace, model=model)
print(pred)
if __name__ == "__main__":
main(
data=pd.DataFrame(
{
"major": pd.Categorical(["English", "Biology", "Physics", "Biology", "Physics"] + ['Biology']*10),
"job": pd.Categorical(["Writer", "Scientist", "Engineer", "Doctor", "Scientist"] + ['Doctor']*10),
}
)
)
Output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p]
Sampling 4 chains, 0 divergences: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:01<00:00, 3349.01draws/s]
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
p[0] 0.319 0.152 0.044 0.588 0.004 0.003 1506.0 1478.0 1501.0 1471.0 1.01
p[1] 0.466 0.171 0.151 0.758 0.004 0.003 1737.0 1737.0 1710.0 1263.0 1.00
p[2] 0.215 0.134 0.013 0.464 0.003 0.002 1783.0 1448.0 1845.0 1026.0 1.00
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1274.40it/s]
>>> pred['out']
array([[1, 3, 2, 3, 2, 2, 3, 0, 3, 2, 3, 3, 3, 2, 2],
[1, 2, 2, 2, 1, 2, 2, 0, 0, 3, 0, 1, 2, 0, 2],
[2, 2, 2, 1, 3, 2, 0, 2, 3, 0, 0, 0, 3, 1, 2],
[1, 1, 1, 1, 1, 1, 2, 0, 3, 3, 2, 3, 2, 0, 2],
[1, 1, 2, 1, 1, 1, 3, 3, 2, 3, 1, 1, 3, 3, 0],
But something is wrong.
- The predictions have 15 columns, even though I only put in 4 out-of-sample majors.
- There are not enough predicted Doctors (category
3
). Since almost all of the modeled Biology majors are Doctors, the out-of-sample Biology majors should be predicted as Doctors too.
Question
How can I predict the jobs with a simple model, based only on the proportion of each major and job observed?