Understanding 'Coords' and 'dims' to analyze dataset

ykt · September 26, 2024, 10:04pm

Hi,

I am new to PyMC and Bayesian statistics. I am trying to analyze a dataset that has two columns (for example, ‘index’ and ‘rate’). However, for the same ‘index’, there are multiple rates. I want to analyze the rates associated with a specific index to find a parameter per index. Here is a sample code of what I tried to implement, which gives a value error:

data = pd.DataFrame({
    'index': ['a_1', 'a_1', 'a_2', 'a_2', 'b_1', 'b_1', 'b_2', 'b_2'],
    'rate': [0.1, 0.3, 0.1, 0.3, 0.1, 0.5, 0.4, 0.5]
})

coords = {'index': data['index'].unique()}

with pm.Model(coords=coords) as model:
    alpha_prior = pm.Uniform('alpha_prior', 0, 100,dims='index')  # shape parameter 1
    beta_prior = pm.Uniform('beta_prior', 0, 100,dims='index')   # shape parameter 2

    y = pm.Beta('y', alpha=alpha_prior, beta=beta_prior, observed=data['rate'], dims='index')
    
    trace2 = pm.sample()

Later, I tried the same with the ‘shape’ function as given in the code here (the code works):

data['id'] = pd.factorize(data['index'])[0]

with pm.Model() as beta_model:

    alpha = pm.Uniform('alpha', 0, 100, shape=4)
    beta = pm.Uniform('beta', 0, 100, shape=4)
    
    y = pm.Beta('y', alpha=alpha[data['id']], beta=beta[data['id']], observed=data['rate'])

     trace = pm.sample()

I am confused about whether PyMC knows which data points relate to which index and analyzes accordingly, or if it tries to match them to the number of dimensions specified in the code.

How can the previous code be altered to work in the same manner as the second one? Or would it be better to create separate models for each index?

cluhmann · September 27, 2024, 1:04am

Welcome! I think you are on the right track. But you have 4 unique indicies and 8 observations. So setting the dims of y to be ="index" is going to conflict with the shape of data["rate"]. So you need alpha_prior to hold 4 values but somehow get these 4 values to broadcast to the 8 observations.

import numpy as np
import pandas as pd
import pymc as pm

data = pd.DataFrame(
    {
        "index": ["a_1", "a_1", "a_2", "a_2", "b_1", "b_1", "b_2", "b_2"],
        "rate": [0.1, 0.3, 0.1, 0.3, 0.1, 0.5, 0.4, 0.5],
    }
)


idx, cat = pd.factorize(data["index"])
coords = {"index": cat, "obs": np.arange(len(idx))}

with pm.Model(coords=coords) as model:
    alpha_prior = pm.Uniform("alpha_prior", 0, 100, dims="index")  # shape parameter 1
    beta_prior = pm.Uniform("beta_prior", 0, 100, dims="index")  # shape parameter 2

    y = pm.Beta(
        "y", alpha=alpha_prior[idx], beta=beta_prior[idx], observed=data["rate"], dims="obs"
    )

    trace2 = pm.sample()

ykt · September 27, 2024, 8:36pm

Hi,

Thank you for your reply. I was able to get the code working with your advice. However, I’m still a bit confused about how this is executed within PyMC. Does the program analyze all the data and try to fit it into the specified number of ‘index’ dimensions, or does it know which data points correspond to which index?

Thank you for your help!

cluhmann · September 28, 2024, 12:35am

PyMC broadcasts just like numpy broadcasts. So if we had a parameter with shape (4,):

dummy_param = np.array([42, 21, 37, 59])

and we index into dummy_param with our idx variable defined in the code snippet above (which is shape (8,)):

dummy_param[idx]

then we get a result which is shape (8,), where each value is the corresponding value from dummy_param indicated by idx:

array([42, 42, 21, 21, 37, 37, 59, 59])

The idea is that PyMC should require no additional knowledge about broadcasting if you already understand how numpy works.

Topic		Replies	Views
Understanding dimensions/shapes of variables v5	5	1358	August 29, 2023
Understanding coords, indexation, Data, ..., for multilevel models v5 modeling	1	3901	April 29, 2022
Problem with coords/dims in hierarchical model v5	4	817	January 5, 2023
How indexing works in pymc v5	8	1438	September 13, 2022
[CLOSED] Help sharing dimension between location and scale in a simple hierarchical model version agnostic development , shape_issue , modeling	1	408	August 11, 2022

Understanding 'Coords' and 'dims' to analyze dataset

Related topics