Am I doing the right model?

In my dataset there is 4 columns (project_id, responsible_id, date_reference, acquisitiontypes, number_of_tasks_finished). What we call project is an client that has contract with us, and the responsible is the person from that project that is doing the n_tasks on an given day of an specific acquisitiontypes (somethink like the type of task). The ideia is to model number_of_tasks_finished so we can estimate an realistic goal for an employer taken into account the historic of the specific project and the task_type, since different projects are more dinamic and work intensive than others and some type of tasks are more common than others.

About the dataset, there is 11k rows with 167 projects, 7 types of tasks. Since the ideia is to generalize for any kind of person, we chose to not consider the responsible_id.

The model it self is hierarchical and we wanted to estimate project_random_effects, the task_type_beta and an project_task_type_interaction. The thing is that this model trains really slowly, but when we consider another model with only the interaction parameter, the model trains really fast. But we want to estimate some task_type because the ideia is that when another project comes in we can make an prediction based on the type.

Am I doing something wrong?

The model code:

import pymc as pm
import pandas as pd
import numpy as np
import arviz as az

project_categories = pd.Categorical(train_df['project'], train_df['project'].unique())
acquisitiontype_categories = pd.Categorical(train_df['acquisitiontype'], train_df['acquisitiontype'].unique())

coords = {
    "projects": project_categories.categories, 
    "acquisitiontypes": acquisitiontype_categories.categories
}
with pm.Model(coords=coords) as model:
    project_codes = pm.Data("project_codes", project_categories.codes)
    acqtype_codes = pm.Data("acqtype_codes", acquisitiontype_categories.codes)
    y_data = pm.Data("y_data", train_df['tasks_finished'])
    
    alpha = pm.Normal("interaction_mean", 0, 0.5)

    project_sd = pm.Exponential("project_sd", 3)
    project_effect = pm.Normal("project_effect", 
                                 0, project_sd, 
                                 dims=('projects'))


    channel_sd = pm.Exponential("channel_sd", 3)
    channel_effect = pm.Normal("channel_effect", 
                                 0, channel_sd, 
                                 dims=('acquisitiontypes'))

    interaction_deviation = pm.Exponential("interaction_deviation", 3)
    interaction_effect = pm.Normal("interaction_effect", 
                                 0, interaction_deviation, 
                                 dims=('projects', 'acquisitiontypes'))

    mu = pm.math.exp(
            alpha + project_effect[project_codes] + channel_effect[acqtype_codes] + interaction_effect[project_codes, acqtype_codes]
        )
    y = pm.Poisson("y", mu, observed=y_data, shape=project_codes.shape)

    trace = pm.sample(target_accept=0.9)

If you like your model but are concerned about it being slow, try the following:

trace = pm.sample(target_accept=.9, nuts_sampler = 'nutpie',nuts_sampler_kwargs={'backend':'jax')

You’ll need to pip install jax and nutpie but this should give you a very large speedup.

1 Like

I think this is going to give you the centered parameterization of the hierarchical model. You might want to rewrite to draw channel_effect_std from a standard normal and then let channel_effect = channel_sd * channel_effect_std. Whether the centered or non-centered will be more efficient will determine on the how informative the data and prior are about the parameter values—the more data and stronger prior you have, the more likely the centered parameterization is to be better.

The exponential distribution in PyMC uses a rate (inverse) scale parameterization, so the prior on your standard deviations has a mean of 1/3 and a standard deviation of 1/3 (see pymc.Exponential — PyMC 5.22.0 documentation).