Random Variable with observed data not in trace

hayfreed · October 19, 2022, 4:29pm

I’m new to pymc3 and I am trying to generate samples from a Categorical distribution for a list of discrete variables rv_list related to patient data. I’m trying to wrap my head around why adding observed data to a Random Variable causes it to not appear in the trace generated by pm.sample().

My data are as follows (I used ordinal encoding to convert from string to float values)
Gender ∈ [0,1]
Race ∈ [0,1,2]
Site ∈ [0,1,2,3,4]
Disease ∈ [0,1,2,3,4]

Below is an example where I do not include the observed data in the Categorical distribution.

for rv in rv_list:
    num_categories = len(set(data[rv])
    mu = pm.Dirichlet(f"prior_{rv}", a=[1]*num_categories, shape=(num_categories,))
    x = pm.Categorical(f"likelihood_{rv}", p=mu)
trace = pm.sample(1000, cores=1)

The NUTS sampler is assigned for the uniform Dirichlet prior, and Metropolis is assigned for the Categorical distribution. As seen below, both the Dirichlet prior and Categorical likelihood variables are present in the trace.

What I am unsure about is when I modify the Categorical variable to include observed data, the trace no longer seems to include the Categorical variable. Instead, the prior variable is modified to reflect the reality of the observed data. For example, I add an observed argument like this:

x = pm.Categorical(f"likelihood_{rv}", p=mu, observed=data[rv])

The trace object visualization now appears this way:

Firstly, I am not sure why adding observed data to the declaration of the Categorical variable causes that variable to disappear from the trace output. Second, I am confused as to why the observed data would cause the prior random variable to change. (As one example, in my data, there are many more females than males, as seen by the blue distribution taking on a larger mean value than the orange distribution in the top left chart of the above figure).

This is probably a pretty easy question for someone with experience to answer, but I was somewhat unclear about the expected behavior based on the documentation. For example, in the Getting Started guide, there are 2 case studies including observed data; in the coal mining case, the variable with the observed data appears in the final trace, and in the stochastic votality case, it does not.

Thanks in advance for any help!

cluhmann · October 19, 2022, 7:38pm

Welcome!

The values of observed variables are not sampled. The observed values are taken “as is” and the likelihood of those observed values is calculated (repeatedly) during sampling.

In the coal mining disaster example, I do not see the observed variable (disasters) in the posterior. The observed data is always present in the inferenceData object (i.e., idata.observed_data), but not in the posterior (i.e., idata.posterior).

hayfreed · October 20, 2022, 10:04pm

Thanks for the response! What I was trying to do was accomplished by pm.sample_posterior_predictive(trace)

cluhmann · October 20, 2022, 10:07pm

Glad you got it sorted out!

Topic		Replies	Views
Using a random variable as observed Questions	19	4042	October 20, 2023
Observed data in Bayesian networks Questions	13	2714	July 6, 2021
Random variable as observation Questions	9	1444	January 20, 2023
Sample from posterior predictive after fixing latent varaibles Questions	1	584	October 23, 2018
Constrain the sum of two categorical variables Questions modeling	3	780	January 24, 2022

Random Variable with observed data not in trace

Related topics