Predicting occupation from college major

Intro

I’m trying to predict a categorical variable with another categorical variable. I’m providing a minimal example here in hopes that someone can help me understand how to proceed.

The setup is each person has a college major and eventually gets a job. I want to predict the job using the college major. Here is the data that I have.

      major        job
0   English     Writer
1   Biology  Scientist
2   Physics   Engineer
3   Biology     Doctor
4   Physics  Scientist
5   Biology     Doctor
6   Biology     Doctor
7   Biology     Doctor
8   Biology     Doctor
9   Biology     Doctor
10  Biology     Doctor
11  Biology     Doctor
12  Biology     Doctor
13  Biology     Doctor
14  Biology     Doctor

Since the conjugate prior of the categorical distribution is the Dirichlet, I am using an uninformative Dirichlet prior for each of the four majors. Then I try to predict what will be the job on some unobserved data. My prediction code is based on the sample_posterior_predictive example here.

Code

import arviz
import pandas as pd
import pymc3
import numpy as np
import theano


def build(major, job, n_majors):
    with pymc3.Model() as mod:
        p = pymc3.Dirichlet("p", a=np.array([1] * n_majors))
        out = pymc3.Categorical("out", p=p[major], observed=job)
    return mod


def main(data):
    print(data)
    n_majors = len(set(data["major"]))
    majors = theano.shared(data["major"].cat.codes.values)
    model = build(major=majors, job=data["job"].cat.codes.values, n_majors=n_majors)
    pymc3.model_to_graphviz(model).render()
    trace = pymc3.sample(model=model)
    inference = arviz.from_pymc3(trace)
    print(arviz.summary(inference))

    # Create some out-of-sample people to forecast.
    out_of_sample_majors = [1] * 4 # They all major in biology.
    # Replace the modeled data with the new data.
    majors.set_value(out_of_sample_majors)
    # Make predictions about the new data.
    pred = pymc3.sample_posterior_predictive(trace=trace, model=model)

    print(pred)



if __name__ == "__main__":
    main(
        data=pd.DataFrame(
            {
                "major": pd.Categorical(["English", "Biology", "Physics", "Biology", "Physics"] + ['Biology']*10),
                "job": pd.Categorical(["Writer", "Scientist", "Engineer", "Doctor", "Scientist"] + ['Doctor']*10),
            }
        )
    )

Output

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p]
Sampling 4 chains, 0 divergences: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:01<00:00, 3349.01draws/s]
       mean     sd  hpd_3%  hpd_97%  mcse_mean  mcse_sd  ess_mean  ess_sd  ess_bulk  ess_tail  r_hat
p[0]  0.319  0.152   0.044    0.588      0.004    0.003    1506.0  1478.0    1501.0    1471.0   1.01
p[1]  0.466  0.171   0.151    0.758      0.004    0.003    1737.0  1737.0    1710.0    1263.0   1.00
p[2]  0.215  0.134   0.013    0.464      0.003    0.002    1783.0  1448.0    1845.0    1026.0   1.00
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1274.40it/s]
>>> pred['out']
array([[1, 3, 2, 3, 2, 2, 3, 0, 3, 2, 3, 3, 3, 2, 2],
       [1, 2, 2, 2, 1, 2, 2, 0, 0, 3, 0, 1, 2, 0, 2],
       [2, 2, 2, 1, 3, 2, 0, 2, 3, 0, 0, 0, 3, 1, 2],
       [1, 1, 1, 1, 1, 1, 2, 0, 3, 3, 2, 3, 2, 0, 2],
       [1, 1, 2, 1, 1, 1, 3, 3, 2, 3, 1, 1, 3, 3, 0],

But something is wrong.

  1. The predictions have 15 columns, even though I only put in 4 out-of-sample majors.
  2. There are not enough predicted Doctors (category 3). Since almost all of the modeled Biology majors are Doctors, the out-of-sample Biology majors should be predicted as Doctors too.

Question

How can I predict the jobs with a simple model, based only on the proportion of each major and job observed?

2 Likes

Hi,

  • I wonder if you’re model is misparametrized: if I understood correctly, your goal is to infer the probability (and hence raw numbers) of jobs, but you’re actually inferring the probability of each major in the model above. I think you’re looking for a multinomial softmax regression (see Code 10.58 in this notebook): each job has a probability of being chosen, and this probability is influenced by the choice of major (your predictor variable) – and probably by other factors as well. I don’t think the current model is modelling this process.

  • Note that your OOS majors are not for biology: you take [1] * 4 but biology is index 0.

Hope this helps :vulcan_salute:

Note that your OOS majors are not for biology: you take [1] * 4 but biology is index 0.

Right, good point.

Here I try reparametrizing the model to linear with a categorical outcome.

import arviz
import pandas as pd
import pymc3
import numpy as np
import theano
import theano.tensor
import patsy

def softmax(w):
    e = np.exp(w)
    return e / e.sum(axis=0)

def main(data):
    print(data)
    n_majors = len(set(data["major"]))
    n_jobs = len(set(data["job"]))
    formula = "C(job) ~ 0 + C(major)"
    y, x = patsy.dmatrices(formula, data=data)
    x = np.array(x)
    majors = data['major'].cat.codes.values

    tx = theano.shared(x)
    with pymc3.Model() as model:
        tx # shape: (N, J - 1)
        b = pymc3.Normal("b", shape=(n_majors, n_jobs - 1)) # shape: (M, J - 1)
        xb = tx @ b # shape: (N, J - 1)
        p = softmax(xb)
        y # shape: (N, J - 1)
        pymc3.Categorical("y", p=p[np.array(majors)], observed=y.T)



    pymc3.model_to_graphviz(model).render()
    trace = pymc3.sample(model=model)
    inference = arviz.from_pymc3(trace)
    print(arviz.summary(inference))

    tx.set_value(x.copy())
    predictions = pymc3.sample_posterior_predictive(model=model, trace=trace, )
    return predictions['y']

if __name__ == "__main__":
    data = pd.DataFrame(
        {
            "major": pd.Categorical(["English", "Biology", "Physics", "Biology", "Biology"] + ['Biology']*10),
            "job": pd.Categorical(["Writer", "Scientist", "Writer", "Writer", "Nurse"] + ['Nurse']*10),
        }
    )
    pr = main(data)
    print(pr.mean(axis=0))

Data:

      major        job
0   English     Writer
1   Biology  Scientist
2   Physics     Writer
3   Biology     Writer
4   Biology      Nurse
5   Biology      Nurse
6   Biology      Nurse
7   Biology      Nurse
8   Biology      Nurse
9   Biology      Nurse
10  Biology      Nurse
11  Biology      Nurse
12  Biology      Nurse
13  Biology      Nurse
14  Biology      Nurse

Summary:

         mean     sd  hpd_3%  hpd_97%  mcse_mean  mcse_sd  ess_mean  ess_sd  ess_bulk  ess_tail  r_hat
b[0,0] -0.275  0.852  -1.849    1.265      0.021    0.018    1670.0  1118.0    1674.0    1603.0    1.0
b[0,1]  0.212  0.839  -1.299    1.785      0.020    0.018    1726.0  1032.0    1733.0    1353.0    1.0
b[1,0]  0.161  0.854  -1.366    1.828      0.021    0.020    1675.0   950.0    1666.0    1018.0    1.0
b[1,1] -0.164  0.817  -1.696    1.339      0.021    0.018    1474.0   994.0    1479.0    1122.0    1.0
b[2,0]  0.096  0.900  -1.514    1.894      0.021    0.022    1755.0   866.0    1758.0    1233.0    1.0
b[2,1] -0.066  0.893  -1.751    1.501      0.022    0.019    1618.0  1071.0    1630.0    1355.0    1.0

Predictions:

In [6]: pr.shape                                                                                                                                
Out[6]: (2000, 3, 15)

So far so good. But I don’t understand the predictions. The outcome variable is categorical, so each person should have exactly one predicted job. But in the first draw (pr[0]), two people (columns) have two different predicted jobs, and several more have zero predicted jobs.

In [7]: pr[0]                                                                                                                                   
Out[7]: 
array([[1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

Am I misinterpreting the results? What is the correct way to write this model?

I think you’re still making the same mistake: you’re indexing the Categorical’s probability vector by the majors, while I think it should be indexed by the jobs: Categorical("y", p=p[jobs], observed=y.T), not Categorical("y", p=p[majors], observed=y.T).
Each job has a latent probability of being chosen, which determines the binary outcomes produced by the Categorical.

Upstream from that, each job’s probability is influenced by the choice of majors. So each job will have a different parameter attached to each major, conveying the influence of the major on the given job. This is why each job should have its own linear model, like in the example NB I cited above.

One last thing: I strongly encourage you to do prior and posterior predictive checks to better understand the model’s predictions, as I explained in this NB. Multinomial regression models live in a weird multidimensional space that’s difficult to interpret without plots.

I’m sorry I don’t have time to dig deeper these days, but I hope this already helps you :vulcan_salute:

You’re right, I hadn’t fixed that.

Indexing on the outcome variable, the predictions are still very weird. In this example, I provide 94 (major=Statistics; job=Statistician) observations and 5 other observations. Then I create an out-of-sample X with 33 observations of each major, and predict the jobs for each out-of-sample observation.

Because the input data have so many Statistics->Statistician pairs, I would expect a high rate of predicted Statistician jobs. But the output actually predicts Statistician as the least common outcome, with only 2 percent of the forecasted outcomes.

import warnings

import arviz
import pandas as pd
import pymc3
import numpy as np
import theano
import theano.printing
import theano.tensor
import patsy
import matplotlib
import matplotlib.pyplot as plt
import plotnine as p9


warnings.simplefilter(action="ignore", category=FutureWarning)


matplotlib.use("TkAgg")


def softmax(w):
    e = np.exp(w)
    return e / e.sum(axis=0)


n_statisticians = 94

data = pd.DataFrame(
    {
        "major": pd.Categorical(
            ["English", "Chemistry", "English", "Chemistry", "Chemistry"]
            + ["Statistics"] * n_statisticians
        ),
        "job": pd.Categorical(
            ["Writer", "Scientist", "Teacher", "Doctor", "Nurse"]
            + ["Statistician"] * n_statisticians
        ),
    }
)


n_majors = len(set(data["major"]))
n_jobs = len(set(data["job"]))
formula = "C(job) ~ 0 + C(major)"
y_obs, x = patsy.dmatrices(formula, data=data)
x = np.array(x)
majors = data["major"].cat.codes.values
jobs = data["job"].cat.codes.values


with pymc3.Model() as model:
    X = pymc3.Data("X", x)
    a = pymc3.Normal("a", shape=n_jobs - 1)
    a_full = theano.tensor.concatenate([[0], a])
    b = pymc3.Normal("b", shape=(n_majors, n_jobs - 1))
    z = np.zeros(n_majors)[:, np.newaxis]
    b_full = theano.tensor.concatenate([z, b], axis=1)
    xb = pymc3.math.dot(X, b)
    xb_full = pymc3.math.dot(X, b_full)
    mu = a_full + xb_full
    p = softmax(mu)
    y = pymc3.Multinomial("y", n=1, p=p[jobs], observed=y_obs)

    theano.printing.Print("X", attrs=["shape"])(X)
    theano.printing.Print("b", attrs=["shape"])(b)
    theano.printing.Print("xb", attrs=["shape"])(xb)
    theano.printing.Print("mu", attrs=["shape"])(mu)
    theano.printing.Print("p", attrs=["shape"])(p)
    theano.printing.Print("y", attrs=["shape"])(y)


pymc3.model_to_graphviz(model).render()
trace = pymc3.sample(model=model)
oos_x = np.tile(np.unique(x, axis=0), len(data)//len(set(data.major))).T # shape: (n, n_majors)
pymc3.set_data({"X": oos_x}, model=model)
predictions = pymc3.sample_posterior_predictive(model=model, trace=trace,)
inference = arviz.from_pymc3(trace, posterior_predictive=predictions)
print("n_majors:", n_majors)
print("n_jobs:", n_jobs)
print("summary:")
print(arviz.summary(inference))
print('out-of-sample x rates:', oos_x.mean(0))
print('predictions: ', dict(zip(data.job.cat.categories, predictions["y"].mean(0).mean(0).round(2))))

Output

X shape = (99, 3)
b shape = (3, 5)
xb shape = (99, 5)
mu shape = (99, 6)
p shape = (99, 6)
y shape = (99, 6)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b, a]
Sampling 4 chains, 0 divergences: 100%|████████████████████████████████████████████████████████████████| 4000/4000 [00:02<00:00, 1877.44draws/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:06<00:00, 287.41it/s]
n_majors: 3
n_jobs: 6
summary:
         mean     sd  hpd_3%  hpd_97%  mcse_mean  mcse_sd  ess_mean  ess_sd  ess_bulk  ess_tail  r_hat
a[0]   -0.026  1.006  -1.904    1.792      0.019    0.023    2756.0   933.0    2743.0    1531.0   1.00
a[1]    0.002  0.987  -1.760    1.888      0.019    0.022    2602.0   995.0    2625.0    1530.0   1.00
a[2]   -0.011  0.981  -1.733    1.886      0.016    0.024    3928.0   808.0    3961.0    1100.0   1.00
a[3]   -0.008  1.014  -1.861    1.825      0.018    0.022    3278.0  1050.0    3293.0    1684.0   1.00
a[4]   -0.004  0.996  -1.859    1.875      0.016    0.023    3730.0   922.0    3675.0    1609.0   1.00
b[0,0] -0.504  0.801  -2.053    0.911      0.018    0.015    1943.0  1517.0    1933.0    1597.0   1.00
b[0,1] -1.066  0.814  -2.529    0.461      0.017    0.013    2277.0  1911.0    2286.0    1462.0   1.00
b[0,2]  2.550  0.745   1.202    3.964      0.017    0.012    1928.0  1834.0    1943.0    1655.0   1.00
b[0,3] -0.483  0.772  -1.902    0.993      0.016    0.014    2462.0  1543.0    2472.0    1881.0   1.00
b[0,4] -0.866  0.847  -2.458    0.666      0.018    0.014    2338.0  1793.0    2330.0    1571.0   1.00
b[1,0] -0.210  0.936  -1.861    1.603      0.015    0.022    3737.0   934.0    3752.0    1486.0   1.00
b[1,1]  0.561  0.937  -1.153    2.364      0.018    0.020    2571.0  1116.0    2495.0    1402.0   1.01
b[1,2] -0.621  0.847  -2.164    0.981      0.015    0.016    3120.0  1456.0    3172.0    1488.0   1.00
b[1,3] -0.225  0.920  -2.009    1.501      0.017    0.021    3098.0   990.0    3108.0    1435.0   1.00
b[1,4] -0.178  0.949  -1.984    1.503      0.017    0.021    3162.0  1022.0    3156.0    1533.0   1.00
b[2,0]  0.692  0.799  -0.707    2.228      0.018    0.014    2066.0  1530.0    2087.0    1840.0   1.00
b[2,1]  0.549  0.777  -0.891    2.065      0.016    0.014    2466.0  1632.0    2466.0    1788.0   1.00
b[2,2] -1.927  0.750  -3.312   -0.476      0.016    0.012    2200.0  2046.0    2224.0    1801.0   1.00
b[2,3]  0.699  0.775  -0.786    2.108      0.015    0.013    2605.0  1852.0    2610.0    1902.0   1.00
b[2,4]  1.010  0.836  -0.620    2.527      0.018    0.014    2237.0  1706.0    2250.0    1545.0   1.00
out-of-sample x rates: [0.33333333 0.33333333 0.33333333]
predictions:  {'Doctor': 0.14, 'Nurse': 0.21, 'Scientist': 0.17, 'Statistician': 0.02, 'Teacher': 0.21, 'Writer': 0.24}

Hi Mina

Seems there are still a few issues in the code. Mainly on the observed line you should be indexing p using majors rather jobs. You want to use the major to predict the job so you need to look up the p proportion of the major of each person to predict y_obs.

Here is my modified code

with pymc3.Model() as model:
    data_majors = pymc3.intX(pymc3.Data("data_majors", majors))
    
    a = pymc3.Normal("a", shape=n_jobs - 1)
    a_full = pymc3.Deterministic("a_full", theano.tensor.concatenate([[0], a]))
    b = pymc3.Normal("b", shape=(n_majors, n_jobs - 1))
    z = np.zeros(n_majors)[:, np.newaxis]
    b_full = pymc3.Deterministic("b_full", theano.tensor.concatenate([z, b], axis=1))
    
    X = np.eye(n_majors)
    xb_full = pymc3.math.dot(X, b_full)
    unconfined_space = a_full + xb_full
    p = pymc3.Deterministic("p", theano.tensor.nnet.softmax(unconfined_space))
    y = pymc3.Multinomial("y", n=1, p=p[data_majors,:], observed=y_obs)
    
    trace = pymc3.sample()
    inference = arviz.from_pymc3(trace)


with model:
    pymc3.set_data({"data_majors": np.arange(n_majors)})
    predictions = pymc3.sample_posterior_predictive(trace=trace,)

print(pd.DataFrame(predictions['y'].mean(0), index=data["major"].cat.categories, columns=data['job'].cat.categories).round(3))

A few points:

  • I’ve used pymc3.Deterministic, this allows you to view intermediate variables in the summary. You can check the shape and values of each to ensure they are reasonable.
  • I used pymc3.Data() to store the list of majors. This allows X, xb_full and p to all be in terms of simply the number of majors rather than the number of observations. On the observation line I index into p by data_majors to match the length of y_obs. This gives p a much cleaner interpretation as the 6 probabilities for each major.
  • Similarly in the predictions I haven’t repeated the majors, I just have one prediction for each of the three majors (drawn from the 4000 samples).
  • There is an inbuilt softmax function in Theano, you should use it rather than write your own in numpy.

2020-04-25_10-41

That works and is much easier to understand. Thank you.