Discretization of Continuous Variables in Pymc

sponyo · October 21, 2024, 3:55pm

Hello! I would like to discretize a continuous variable in PyMC or create a discrete Categorical node that depends on a continuous parent node. My goal is to assign the parent node’s value to a specific interval and map it to a corresponding discrete category. How can I achieve this?

cluhmann · October 21, 2024, 4:13pm

This notebook would probably be a good place to start.

sponyo · October 22, 2024, 2:28am

But I want to create discrete sub-nodes, is it possible to achieve this through regression?

jessegrabowski · October 22, 2024, 9:00am

What does “discrete sub-nodes” mean?

I’m concerned this question is an XY problem, perhaps you could elaborate more on what you’re modeling, and how this discretization fits into your larger plan?

sponyo · October 22, 2024, 9:16am

For example, if continuous nodes a and b have a discrete node c as their child, and the CPT of node c is manually specified.

ricardoV94 · October 22, 2024, 9:20am

Can you provide a specific numerical example? Using numpy with random draws is fine

sponyo · October 22, 2024, 9:52am

The CPT to be defined for the “combination” node:

file_path = 'Combination.xlsx'
df = pd.read_excel(file_path, sheet_name = 0, header=None)  
Combination_cpt = df.to_numpy().T
Combination_cpt.shape

node definition：

    PoA = pm.Triangular('PoA', lower=0, c=0.3, upper=1)
    CF = pm.Lognormal('CF', mu=4.5, sigma=0.2)
    TEF = pm.Deterministic('TEF', CF * PoA)
    MPLEF = pm.Deterministic('MPLEF', TEF * vulnerability)
    CoSL = pm.Normal('CoSL', mu=0.7, sigma=0.2)
    PLF = pm.Poisson('PLF', mu=MPLEF)
    SLF = pm.Binomial('SLF', n=PLF, p=CoSL)
# The "combination" node is a child node of both PLF and SLF, and the CPT data is imported from Excel. 
# It is desired to divide both PLF and SLF into 12 intervals each to match the dimensions of CPT

ricardoV94 · October 22, 2024, 10:18am

Perhaps have a look into the OrderedLogistic / Probit distributions? Those allow you to map a continuous latent quantity into uneven categorical bins

sponyo · October 22, 2024, 1:51pm

file_path = 'Combination.xlsx'
df = pd.read_excel(file_path, sheet_name = 0, header=None)  
Combination_cpt = df.to_numpy().T
Combination_cpt.shape
bins = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120])

num_states_plf = len(bins) - 1 
num_states_slf = len(bins) - 1

lookup_table_COM = {}
for i, j in itertools.product(range(num_states_plf), range(num_states_slf)):
    idx = i * num_states_slf + j
    lookup_table_COM[(i, j)] = Combination_cpt[idx]

with pm.Model() as model:
    PoA = pm.Triangular('PoA', lower=0, c=0.3, upper=1)
    CF = pm.Lognormal('CF', mu=4.5, sigma=0.2)
    TEF = pm.Deterministic('TEF', CF * PoA)
    vulnerability = pm.Normal('vulnerability', mu=1, sigma=0.5)
    MPLEF = pm.Deterministic('MPLEF', TEF * vulnerability)
    CoSL = pm.Normal('CoSL', mu=0.7, sigma=0.2)
    PLF = pm.Poisson('PLF', mu=MPLEF)
    SLF = pm.Binomial('SLF', n=PLF, p=CoSL)

    plf_bin = np.digitize(PLF.eval(), bins) - 1
    slf_bin = np.digitize(SLF.eval(), bins) - 1
    current_state = (plf_bin, slf_bin)
    p_combination = lookup_table_COM.get(current_state)

    combination = pm.Categorical('combination', p=p_combination)

    trace = pm.sample(1000, tune=1000)

Combination.csv (37.2 KB)
I tried the OrderedLogistic / Probit distributions as you suggested, but it doesn’t seem to achieve the effect I want. I tried another method, but there are issues with sampling in the “combination” node. Could it be a problem with the node definition?

jessegrabowski · October 22, 2024, 2:00pm

The computational backend for pymc is pytensor. You cannot use numpy functions on symbolic PyMC objects, because they aren’t arrays of numerical values, they are symbols representing computation. If you call .eval(), you will execute the computation the node represents, and freeze the result at a constant value. Importantly, if the computation depends on random variables, it will not be updated when new values are drawn.

sponyo · October 22, 2024, 2:41pm

What should I do to achieve the desired effect? I’d really appreciate any help

ricardoV94 · October 22, 2024, 3:43pm

If you are overwhelmed you will probably have to start way simpler instead or jumping directly to your idea of the final model.

Have a look at some of the studies in pymc-examples to get a feeling for how models of different complexity are built: PyMC Example Gallery — PyMC example gallery

sponyo · October 22, 2024, 5:47pm

Okay, thank you for your help!

jessegrabowski · October 23, 2024, 7:56am

You need to replace np. operations with pt. operations, and remove .eval() calls. For example:

    plf_bin = np.digitize(PLF.eval(), bins) - 1

is equivalent to:

import pytensor.tensor as pt
plf_bin = pt.searchsorted(bins, PLF) - 1

(This equivalence between digitize and searchsorted is described in the documentation for np.digitize)

You will also not be able to use a dictionary for lookup_table_COM, because current_state will not be a tuple of integers, it will be a tuple of symbolic computations. Instead, you can make it an array, and you can make it an array:

lookup_table_COM = np.zeros((num_states_plf), num_states_slf))
for i, j in itertools.product(range(num_states_plf), range(num_states_slf)):
    idx = i * num_states_slf + j
    lookup_table_COM[i, j] = Combination_cpt[idx]

lookup_table_pt =pt.as_tensor_variable(lookup_table_COM)

which you can then symbolically index into just like a numpy array:

p_combination = lookup_table_pt[plf_bin, slf_bin]

sponyo · October 23, 2024, 9:38am

import pytensor.tensor as pt
import numpy as np
import pymc as pm

file_path = 'Combination.xlsx'
df = pd.read_excel(file_path, sheet_name = 0, header=None)  
Combination_cpt = df.to_numpy().T
Combination_cpt.shape
bins = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120])

# num_states_plf = len(bins) - 1 
# num_states_slf = len(bins) - 1

Combination_cpt_tensor = pt.as_tensor_variable(Combination_cpt)

with pm.Model() as model:
    PoA = pm.Triangular('PoA', lower=0, c=0.3, upper=1)
    CF = pm.Lognormal('CF', mu=4.5, sigma=0.2)
    TEF = pm.Deterministic('TEF', CF * PoA)
    vulnerability = pm.Normal('vulnerability', mu=1, sigma=0.5)
    MPLEF = pm.Deterministic('MPLEF', TEF * vulnerability)
    CoSL = pm.Normal('CoSL', mu=0.7, sigma=0.2)
    PLF = pm.Poisson('PLF', mu=MPLEF)
    SLF = pm.Binomial('SLF', n=PLF, p=CoSL)

    plf_bin = pt.searchsorted(bins, PLF) - 1
    slf_bin = pt.searchsorted(bins, SLF) - 1 
    
    def get_cpt_value_combination(plf_state, slf_state, cpt_tensor):
        idx = plf_state * 12 + slf_state
        return cpt_tensor[idx]

    p_Combination = get_cpt_value_combination(plf_bin, slf_bin, Combination_cpt_tensor)
    Combination = pm.Categorical('Combination', p=p_Combination)
    
    trace = pm.sample(10000, tune=1000)

Thank you for your suggestions!! I tried converting the CPTs into tensors in advance and also followed your advice. However, sampling from the Combination node is still not working. Are there any other issues that could be causing this?

jessegrabowski · October 23, 2024, 10:00am

I’m afraid you’re going to need to be a lot more specific than “it’s not working”

sponyo · October 23, 2024, 10:41am

Like this:

ricardoV94 · October 23, 2024, 3:12pm

Can you open an issue in PyTensor to add the helper?

sponyo · October 23, 2024, 3:36pm

I realized that the issue might not be related to how PyTensor is used, but rather to the fact that the CPT of my combination node consists of 0 and 1, with a large number of 0. I found that in PyMC, categorical no\des defined with CPTs of 0 and 1 often face difficulties in sampling

jessegrabowski · October 23, 2024, 6:53pm

I also don’t know what this array means, but the summary you posted suggests your p in the categorical is 100% on the 8th category, and 0% everywhere else. I suggest you inspect the values your are obtaining for p and make sure they make sense.

To be clear, for the p argument, Categorical expects a vector of length n_categories that sums to 1.

Topic		Replies	Views
Sampling problem of discrete parent nodes for continuous nodes v5 modeling , sampling	2	13	October 17, 2024
Building a bayesian network with continuous variables Questions	1	3295	August 10, 2021
Making a discrete CustomDist v5 modeling	4	429	June 13, 2023
PyMC3 Conditioning Random Variable on Multiple Discrete Parents Questions	12	2024	February 9, 2022
Modeling a simple bayesian network Questions	0	403	September 30, 2021

Discretization of Continuous Variables in Pymc

Related topics