Hello! I would like to discretize a continuous variable in PyMC or create a discrete Categorical node that depends on a continuous parent node. My goal is to assign the parent node’s value to a specific interval and map it to a corresponding discrete category. How can I achieve this?
This notebook would probably be a good place to start.
But I want to create discrete sub-nodes, is it possible to achieve this through regression?
What does “discrete sub-nodes” mean?
I’m concerned this question is an XY problem, perhaps you could elaborate more on what you’re modeling, and how this discretization fits into your larger plan?
For example, if continuous nodes a and b have a discrete node c as their child, and the CPT of node c is manually specified.
Can you provide a specific numerical example? Using numpy with random draws is fine
The CPT to be defined for the “combination” node:
file_path = 'Combination.xlsx'
df = pd.read_excel(file_path, sheet_name = 0, header=None)
Combination_cpt = df.to_numpy().T
Combination_cpt.shape
node definition:
PoA = pm.Triangular('PoA', lower=0, c=0.3, upper=1)
CF = pm.Lognormal('CF', mu=4.5, sigma=0.2)
TEF = pm.Deterministic('TEF', CF * PoA)
MPLEF = pm.Deterministic('MPLEF', TEF * vulnerability)
CoSL = pm.Normal('CoSL', mu=0.7, sigma=0.2)
PLF = pm.Poisson('PLF', mu=MPLEF)
SLF = pm.Binomial('SLF', n=PLF, p=CoSL)
# The "combination" node is a child node of both PLF and SLF, and the CPT data is imported from Excel.
# It is desired to divide both PLF and SLF into 12 intervals each to match the dimensions of CPT
Perhaps have a look into the OrderedLogistic / Probit distributions? Those allow you to map a continuous latent quantity into uneven categorical bins
file_path = 'Combination.xlsx'
df = pd.read_excel(file_path, sheet_name = 0, header=None)
Combination_cpt = df.to_numpy().T
Combination_cpt.shape
bins = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120])
num_states_plf = len(bins) - 1
num_states_slf = len(bins) - 1
lookup_table_COM = {}
for i, j in itertools.product(range(num_states_plf), range(num_states_slf)):
idx = i * num_states_slf + j
lookup_table_COM[(i, j)] = Combination_cpt[idx]
with pm.Model() as model:
PoA = pm.Triangular('PoA', lower=0, c=0.3, upper=1)
CF = pm.Lognormal('CF', mu=4.5, sigma=0.2)
TEF = pm.Deterministic('TEF', CF * PoA)
vulnerability = pm.Normal('vulnerability', mu=1, sigma=0.5)
MPLEF = pm.Deterministic('MPLEF', TEF * vulnerability)
CoSL = pm.Normal('CoSL', mu=0.7, sigma=0.2)
PLF = pm.Poisson('PLF', mu=MPLEF)
SLF = pm.Binomial('SLF', n=PLF, p=CoSL)
plf_bin = np.digitize(PLF.eval(), bins) - 1
slf_bin = np.digitize(SLF.eval(), bins) - 1
current_state = (plf_bin, slf_bin)
p_combination = lookup_table_COM.get(current_state)
combination = pm.Categorical('combination', p=p_combination)
trace = pm.sample(1000, tune=1000)
Combination.csv (37.2 KB)
I tried the OrderedLogistic / Probit distributions as you suggested, but it doesn’t seem to achieve the effect I want. I tried another method, but there are issues with sampling in the “combination” node. Could it be a problem with the node definition?
The computational backend for pymc
is pytensor
. You cannot use numpy functions on symbolic PyMC objects, because they aren’t arrays of numerical values, they are symbols representing computation. If you call .eval()
, you will execute the computation the node represents, and freeze the result at a constant value. Importantly, if the computation depends on random variables, it will not be updated when new values are drawn.
What should I do to achieve the desired effect? I’d really appreciate any help
If you are overwhelmed you will probably have to start way simpler instead or jumping directly to your idea of the final model.
Have a look at some of the studies in pymc-examples to get a feeling for how models of different complexity are built: PyMC Example Gallery — PyMC example gallery
Okay, thank you for your help!
You need to replace np.
operations with pt.
operations, and remove .eval()
calls. For example:
plf_bin = np.digitize(PLF.eval(), bins) - 1
is equivalent to:
import pytensor.tensor as pt
plf_bin = pt.searchsorted(bins, PLF) - 1
(This equivalence between digitize
and searchsorted
is described in the documentation for np.digitize
)
You will also not be able to use a dictionary for lookup_table_COM
, because current_state
will not be a tuple of integers, it will be a tuple of symbolic computations. Instead, you can make it an array, and you can make it an array:
lookup_table_COM = np.zeros((num_states_plf), num_states_slf))
for i, j in itertools.product(range(num_states_plf), range(num_states_slf)):
idx = i * num_states_slf + j
lookup_table_COM[i, j] = Combination_cpt[idx]
lookup_table_pt =pt.as_tensor_variable(lookup_table_COM)
which you can then symbolically index into just like a numpy array:
p_combination = lookup_table_pt[plf_bin, slf_bin]
import pytensor.tensor as pt
import numpy as np
import pymc as pm
file_path = 'Combination.xlsx'
df = pd.read_excel(file_path, sheet_name = 0, header=None)
Combination_cpt = df.to_numpy().T
Combination_cpt.shape
bins = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120])
# num_states_plf = len(bins) - 1
# num_states_slf = len(bins) - 1
Combination_cpt_tensor = pt.as_tensor_variable(Combination_cpt)
with pm.Model() as model:
PoA = pm.Triangular('PoA', lower=0, c=0.3, upper=1)
CF = pm.Lognormal('CF', mu=4.5, sigma=0.2)
TEF = pm.Deterministic('TEF', CF * PoA)
vulnerability = pm.Normal('vulnerability', mu=1, sigma=0.5)
MPLEF = pm.Deterministic('MPLEF', TEF * vulnerability)
CoSL = pm.Normal('CoSL', mu=0.7, sigma=0.2)
PLF = pm.Poisson('PLF', mu=MPLEF)
SLF = pm.Binomial('SLF', n=PLF, p=CoSL)
plf_bin = pt.searchsorted(bins, PLF) - 1
slf_bin = pt.searchsorted(bins, SLF) - 1
def get_cpt_value_combination(plf_state, slf_state, cpt_tensor):
idx = plf_state * 12 + slf_state
return cpt_tensor[idx]
p_Combination = get_cpt_value_combination(plf_bin, slf_bin, Combination_cpt_tensor)
Combination = pm.Categorical('Combination', p=p_Combination)
trace = pm.sample(10000, tune=1000)
Thank you for your suggestions!! I tried converting the CPTs into tensors in advance and also followed your advice. However, sampling from the Combination node is still not working. Are there any other issues that could be causing this?
I’m afraid you’re going to need to be a lot more specific than “it’s not working”
Can you open an issue in PyTensor to add the helper?
I realized that the issue might not be related to how PyTensor is used, but rather to the fact that the CPT of my combination node consists of 0 and 1, with a large number of 0. I found that in PyMC, categorical no\des defined with CPTs of 0 and 1 often face difficulties in sampling
I also don’t know what this array means, but the summary you posted suggests your p
in the categorical is 100% on the 8th category, and 0% everywhere else. I suggest you inspect the values your are obtaining for p
and make sure they make sense.
To be clear, for the p
argument, Categorical
expects a vector of length n_categories
that sums to 1.