Dynamic RV shape or choosing from it

Hello, i need a help with modeling some simple stuff:
A have small and simple generative model, which create users with some dialogs (0-3), each dialog took some time from exponential dist.
A can’t figure out how create this dynamic shape (0-3), so for example i have 1000 users and each can create 0-3 dialogs… so i have dynamic dataset size.

import numpy as np
import numpy.random as rnd
import pandas as pd

rnd.seed(42)

users_count = 1_000
users_rct_prob = np.array([.1, .9])  # control and test group
dialog_prob = np.array([.4, .4, .15, .05])  # prob to write 0-3 dialogs, zero = not wrote
dialog_auto = np.array([.33, .33, .34])  # probs factor [automatize dialog, reduce duration, do nothing]
dialog_auto_mult = np.array([0, .33, 1])  # factor for dialog duration
dialog_duration_mean = 4  # param for duration


def generate_data():
    dialogs_users_rct = rnd.choice(np.arange(len(users_rct_prob)), p=users_rct_prob, size=users_count)
    dialogs_users_count = rnd.choice(np.arange(len(dialog_prob)), p=dialog_prob, size=users_count)

    dialogs_rct = np.repeat(dialogs_users_rct, dialogs_users_count)
    dialogs_count = np.repeat(dialogs_users_count, dialogs_users_count)
    dialogs_user_id = np.repeat(np.arange(users_count), dialogs_users_count)

    dialogs_auto = rnd.choice(dialog_auto_mult, p=dialog_auto, size=sum(dialogs_users_count))
    dialog_duration = rnd.exponential(scale=dialog_duration_mean, size=sum(dialogs_users_count))

    dialog_duration_observed = dialog_duration - dialog_duration * dialogs_auto * dialogs_rct

    df = pd.DataFrame({
        'user': dialogs_user_id,
        'duration_true': dialog_duration,
        'duration_obs': dialog_duration_observed,
        'auto': dialogs_auto,
        'group': dialogs_rct,
        'count': dialogs_count,
    })
    return df

Here i used choice, but in real example it could be poisson rv.
I finking about create expotential rv with shape (1000, 4) and choose from it only that i need, but i don’t know how to do it.

So can you help me to create pymc model for this generative model?

I have written this code:

def main():
    data = generate_data()

    import pymc as pm
    import pytensor.tensor as at
    import arviz as az

    indexes = np.indices((users_count, len(dialog_prob)))[1]

    with pm.Model(coords={'user': range(users_count), 'num': range(len(dialog_prob))}) as m:
        dialog_duration = pm.Exponential('', lam=2, dims=('user', 'num'))
        dialog_count = pm.Categorical('dialog_count', p=dialog_prob, shape=users_count, dims='user')

        dialog_duration = pm.Deterministic('all_duration',
                                           at.where(indexes < dialog_count[:, None], dialog_duration, 0),
                                           dims=('user', 'num'))
        user_duration = pm.Deterministic('user_duration', at.sum(dialog_duration, axis=1), dims='user')

        trace = pm.sample_prior_predictive(random_seed=42)

    print(az.summary(trace))
    breakpoint()
    print()

This code samples correct sum by users, but i don’t know how to reshape dialog_duration to matrix like this:

[
[user_0, dialog_0_len],
[user_0, dialog_1_len],
[user_1, dialog_0_len],
[user_4, dialog_0_len],
]

is it possible? Or maybe exists some another way to sample data like this?

pytensor.tensor can do most of the things numpy can do, so you can just recycle a lot of the code you already wrote:

n_users = 1000

coords = {
    'dialogue':[0, 1, 2, 3],
    'user_id':np.arange(n_users, dtype='int'),
    'action':['automatize', 'reduce', 'pass'],
}
with pm.Model(coords=coords) as m:
    p_control = pm.Beta('p_control', 1, 1)
    p_dialogue = pm.Dirichlet('p_dialogue', a=[1, 1, 1, 1], dims=['dialogue'])
    p_action = pm.Dirichlet('p_action', a=[1, 1, 1], dims=['action'])
    logit_action_factor = pm.Normal('logit_dialogue_factor', mu=0, sigma=3, transform=pm.distributions.transforms.ordered,
                                      dims=['action'],
                                      initval=[-1, 0, 1])
    dialogue_factor = pm.Deterministic('action_factor', pm.math.softmax(logit_action_factor), dims=['action'])
    
    is_control = pm.Bernoulli('is_control', p_control, dims=['user_id'])
    n_dialogues = pm.Categorical('n_dialogues', p_dialogue, dims=['user_id'])
    user_idxs  = pt.repeat(pt.arange(n_users), n_dialogues)

That gets you the first column of your output, the user ids. For the second column, I couldn’t think of anything super elegant, so I went with a loop that exploits the fact that zip(n_dialogues.cumsum()[:-1], n_dialogues.cumsum()[1:]) gives, at each iteration, the slice where the next “run” of dialogues occurs (that is, it automatically ignores users who didn’t write anything, and gives length equal to the number of dialogues written):

    n_obs = user_idxs.shape[0]
    _dialogue_idxs = pt.zeros(n_obs, dtype='int')
    dialogue_idxs, _ = pytensor.map(lambda start, stop, arr: pt.set_subtensor(arr[start:stop], pt.arange(stop-start)), 
                                    sequences=[n_dialogues.cumsum()[:-1], n_dialogues.cumsum()[1:]],
                                    non_sequences=[_dialogue_idxs])
    dialogue_idxs = pm.Deterministic('dialogue_idx', dialogue_idxs[-1])

From here you could use dialogue_idx and user_idx to fancy-index into dialogue effects and user effects to make a model.

1 Like