Incomplete Observations of binary labeled population

I want to model a mixed population of two groups. The total size of the population is observed, and some of each population is observed. In other words, the labeling of the population is incomplete. There is more variance and more observation in one class than the other. I’m modeling it in an attempt to recover information about the relative true size of each class.

For example, considering one class “orange” and the other “green”, data can be generated with code like this:

n_observations = 250

true_orange_mu, true_orange_sd = 300, 100
true_green_mu, true_green_sd = 600, 200

# Fraction of each class that will be observed.
obs_pct_orange_mu, obs_pct_orange_sd = 0.10, 0.05
obs_pct_green_mu, obs_pct_green_sd = 0.40, 0.10

df = pd.DataFrame(
        true_orange=np.random.normal(true_orange_mu, true_orange_sd, size=n_observations),
        true_green=np.random.normal(true_green_mu, true_green_sd, size=n_observations),
        pct_obs_orange=np.random.normal(obs_pct_orange_mu, obs_pct_orange_sd, size=n_observations),
        pct_obs_green=np.random.normal(obs_pct_green_mu, obs_pct_green_sd, size=n_observations),
    columns=['true_orange', 'true_green', 'pct_obs_orange', 'pct_obs_green']
df['total'] = df['true_orange'] + df['true_green']
df['obs_orange'] = df['true_orange'] * df['pct_obs_orange'].clip(lower=0)
df['obs_green'] = df['true_green'] * df['pct_obs_green'].clip(lower=0)

df.drop(['pct_obs_orange', 'pct_obs_green'], axis=1, inplace=True)
df = df.round()

The generative model above can’t be directly translated to PyMC, but I think I was able to get close with this code:

no_upper_bounds = np.matrix(np.ones((n_observations,1)) * np.inf)

with pm.Model() as m1:
    obs_data_m1 = pm.MutableData("obs_data_m1", df)
    mu_orange = pm.Uniform("mu_orange", 100, 500)
    sd_orange = pm.HalfNormal("sd_orange", 150)
    orange_daily = pm.Normal("orange_daily", mu_orange, sd_orange)
    # Use observed number of orange to put a lower bound on modeled # of orange.
    orange_daily_with_obs = pm.Deterministic(
        pm.math.clip(orange_daily, obs_data_m1['obs_orange'].to_numpy(), no_upper_bounds)
    mu_green = pm.Uniform("mu_green", 300, 1000)
    sd_green = pm.HalfNormal("sd_green", 300)
    green_daily = pm.Normal("green_daily", mu_green, sd_green)
    green_daily_with_obs = pm.Deterministic(
        pm.math.clip(green_daily, obs_data_m1['obs_green'].to_numpy(), no_upper_bounds)
    # Introducing a noise term here, since observed data can be directly applied
    # to a deterministic node, but otherwise I would use Deterministic here.
    daily_sd = pm.HalfNormal("daily_sd", 500)
    daily_total = pm.Normal(
        orange_daily + green_daily, 
    trace_m1 = pm.sample(10_000, chains=8, target_accept=0.9)

When using 250 observations, as in this example, this model runs in to a recursion error in _tensor_py_operators.__getitem__.<locals>.includes_bool.

The model works with only 3 observations, but produces 9 different values for orange_daily_with_obs and green_daily_with_obs. The model is running all the possible combinations of those observed data. I want the model to treat them as paired observations, where a particular observation of total, lines up with the specific observations of orange_daily_with_obs and green_daily_with_obs.

  • How should I structure the model to work in that paired way?
  • And secondly, is this use of clip a reasonable way to apply this partial data?

I would like to remove the discussion of the recursive error I had seen previously, but the edit window has passed.

I removed the MutableData use from the model, since I’m still learning about how to apply that construct. After removing MutableData, I was no longer seeing the error.

I have posted a notebook with the simplified model to GitHub.

The model shows some success in recovering the mean of the normal distributions initially used to generate the example data.

Can you say more about what question(s) you are attempting to answer using your data? In your original post, you said that you were “attempt[ing] to recover information about the relative true size”. But the model seems to omit the obs_pct_orange_mu/obs_pct_orange_sd and obs_pct_green_mu/obs_pct_green_sd variables used in when generating the data. If you know that the data of each color is going to be filtered in a way that is systematic across observations, it seems reasonable to include that in the model no? If you did, then you could model the observed number of orange and the observed number of green and the observed total size of the population. At least I think you could.

I am hoping to estimate the true number of orange and true number of green in each individual trial.

In each trial, the total number is observed, but the split between orange and green is not.

You’re making a good point about modeling the filtering process that dictates observations. I was hoping to build the model step by step, and thought that the filtering process would be a good step to add after being confident with how I was even handling the incompleteness of the observations.

I tend to model the process by which (I believe) data is generated and I tend to do so chronologically. So if the first step is to drawn some number of items from a larger set (or 2 larger subsets), I would start there. If the next step is to filter those draws so that only some of them are observed, I would add then next. But that’s just an idea.