Understanding 'Coords' and 'dims' to analyze dataset

Hi,

I am new to PyMC and Bayesian statistics. I am trying to analyze a dataset that has two columns (for example, ‘index’ and ‘rate’). However, for the same ‘index’, there are multiple rates. I want to analyze the rates associated with a specific index to find a parameter per index. Here is a sample code of what I tried to implement, which gives a value error:

data = pd.DataFrame({
    'index': ['a_1', 'a_1', 'a_2', 'a_2', 'b_1', 'b_1', 'b_2', 'b_2'],
    'rate': [0.1, 0.3, 0.1, 0.3, 0.1, 0.5, 0.4, 0.5]
})

coords = {'index': data['index'].unique()}

with pm.Model(coords=coords) as model:
    alpha_prior = pm.Uniform('alpha_prior', 0, 100,dims='index')  # shape parameter 1
    beta_prior = pm.Uniform('beta_prior', 0, 100,dims='index')   # shape parameter 2

    y = pm.Beta('y', alpha=alpha_prior, beta=beta_prior, observed=data['rate'], dims='index')
    
    trace2 = pm.sample()

Later, I tried the same with the ‘shape’ function as given in the code here (the code works):

data['id'] = pd.factorize(data['index'])[0]

with pm.Model() as beta_model:

    alpha = pm.Uniform('alpha', 0, 100, shape=4)
    beta = pm.Uniform('beta', 0, 100, shape=4)
    
    y = pm.Beta('y', alpha=alpha[data['id']], beta=beta[data['id']], observed=data['rate'])

     trace = pm.sample()

I am confused about whether PyMC knows which data points relate to which index and analyzes accordingly, or if it tries to match them to the number of dimensions specified in the code.

How can the previous code be altered to work in the same manner as the second one? Or would it be better to create separate models for each index?

Welcome! I think you are on the right track. But you have 4 unique indicies and 8 observations. So setting the dims of y to be ="index" is going to conflict with the shape of data["rate"]. So you need alpha_prior to hold 4 values but somehow get these 4 values to broadcast to the 8 observations.

import numpy as np
import pandas as pd
import pymc as pm

data = pd.DataFrame(
    {
        "index": ["a_1", "a_1", "a_2", "a_2", "b_1", "b_1", "b_2", "b_2"],
        "rate": [0.1, 0.3, 0.1, 0.3, 0.1, 0.5, 0.4, 0.5],
    }
)


idx, cat = pd.factorize(data["index"])
coords = {"index": cat, "obs": np.arange(len(idx))}

with pm.Model(coords=coords) as model:
    alpha_prior = pm.Uniform("alpha_prior", 0, 100, dims="index")  # shape parameter 1
    beta_prior = pm.Uniform("beta_prior", 0, 100, dims="index")  # shape parameter 2

    y = pm.Beta(
        "y", alpha=alpha_prior[idx], beta=beta_prior[idx], observed=data["rate"], dims="obs"
    )

    trace2 = pm.sample()

Hi,

Thank you for your reply. I was able to get the code working with your advice. However, I’m still a bit confused about how this is executed within PyMC. Does the program analyze all the data and try to fit it into the specified number of ‘index’ dimensions, or does it know which data points correspond to which index?

Thank you for your help!

PyMC broadcasts just like numpy broadcasts. So if we had a parameter with shape (4,):

dummy_param = np.array([42, 21, 37, 59])

and we index into dummy_param with our idx variable defined in the code snippet above (which is shape (8,)):

dummy_param[idx]

then we get a result which is shape (8,), where each value is the corresponding value from dummy_param indicated by idx:

array([42, 42, 21, 21, 37, 37, 59, 59])

The idea is that PyMC should require no additional knowledge about broadcasting if you already understand how numpy works.

1 Like