Random Variable with observed data not in trace

I’m new to pymc3 and I am trying to generate samples from a Categorical distribution for a list of discrete variables rv_list related to patient data. I’m trying to wrap my head around why adding observed data to a Random Variable causes it to not appear in the trace generated by pm.sample().

My data are as follows (I used ordinal encoding to convert from string to float values)
Gender ∈ [0,1]
Race ∈ [0,1,2]
Site ∈ [0,1,2,3,4]
Disease ∈ [0,1,2,3,4]

Below is an example where I do not include the observed data in the Categorical distribution.

for rv in rv_list:
    num_categories = len(set(data[rv])
    mu = pm.Dirichlet(f"prior_{rv}", a=[1]*num_categories, shape=(num_categories,))
    x = pm.Categorical(f"likelihood_{rv}", p=mu)
trace = pm.sample(1000, cores=1)

The NUTS sampler is assigned for the uniform Dirichlet prior, and Metropolis is assigned for the Categorical distribution. As seen below, both the Dirichlet prior and Categorical likelihood variables are present in the trace.

What I am unsure about is when I modify the Categorical variable to include observed data, the trace no longer seems to include the Categorical variable. Instead, the prior variable is modified to reflect the reality of the observed data. For example, I add an observed argument like this:

x = pm.Categorical(f"likelihood_{rv}", p=mu, observed=data[rv])

The trace object visualization now appears this way:

Firstly, I am not sure why adding observed data to the declaration of the Categorical variable causes that variable to disappear from the trace output. Second, I am confused as to why the observed data would cause the prior random variable to change. (As one example, in my data, there are many more females than males, as seen by the blue distribution taking on a larger mean value than the orange distribution in the top left chart of the above figure).

This is probably a pretty easy question for someone with experience to answer, but I was somewhat unclear about the expected behavior based on the documentation. For example, in the Getting Started guide, there are 2 case studies including observed data; in the coal mining case, the variable with the observed data appears in the final trace, and in the stochastic votality case, it does not.

Thanks in advance for any help!

1 Like

Welcome!

The values of observed variables are not sampled. The observed values are taken “as is” and the likelihood of those observed values is calculated (repeatedly) during sampling.

In the coal mining disaster example, I do not see the observed variable (disasters) in the posterior. The observed data is always present in the inferenceData object (i.e., idata.observed_data), but not in the posterior (i.e., idata.posterior).

3 Likes

Thanks for the response! What I was trying to do was accomplished by pm.sample_posterior_predictive(trace)

1 Like

Glad you got it sorted out!