I want to model a mixed population of two groups. The total size of the population is observed, and some of each population is observed. In other words, the labeling of the population is incomplete. There is more variance and more observation in one class than the other. I’m modeling it in an attempt to recover information about the relative true size of each class.
For example, considering one class “orange” and the other “green”, data can be generated with code like this:
n_observations = 250
true_orange_mu, true_orange_sd = 300, 100
true_green_mu, true_green_sd = 600, 200
# Fraction of each class that will be observed.
obs_pct_orange_mu, obs_pct_orange_sd = 0.10, 0.05
obs_pct_green_mu, obs_pct_green_sd = 0.40, 0.10
df = pd.DataFrame(
dict(
true_orange=np.random.normal(true_orange_mu, true_orange_sd, size=n_observations),
true_green=np.random.normal(true_green_mu, true_green_sd, size=n_observations),
pct_obs_orange=np.random.normal(obs_pct_orange_mu, obs_pct_orange_sd, size=n_observations),
pct_obs_green=np.random.normal(obs_pct_green_mu, obs_pct_green_sd, size=n_observations),
),
columns=['true_orange', 'true_green', 'pct_obs_orange', 'pct_obs_green']
)
df['total'] = df['true_orange'] + df['true_green']
df['obs_orange'] = df['true_orange'] * df['pct_obs_orange'].clip(lower=0)
df['obs_green'] = df['true_green'] * df['pct_obs_green'].clip(lower=0)
df.drop(['pct_obs_orange', 'pct_obs_green'], axis=1, inplace=True)
df = df.round()
df
The generative model above can’t be directly translated to PyMC, but I think I was able to get close with this code:
no_upper_bounds = np.matrix(np.ones((n_observations,1)) * np.inf)
with pm.Model() as m1:
obs_data_m1 = pm.MutableData("obs_data_m1", df)
mu_orange = pm.Uniform("mu_orange", 100, 500)
sd_orange = pm.HalfNormal("sd_orange", 150)
orange_daily = pm.Normal("orange_daily", mu_orange, sd_orange)
# Use observed number of orange to put a lower bound on modeled # of orange.
orange_daily_with_obs = pm.Deterministic(
"orange_daily_with_obs",
pm.math.clip(orange_daily, obs_data_m1['obs_orange'].to_numpy(), no_upper_bounds)
)
mu_green = pm.Uniform("mu_green", 300, 1000)
sd_green = pm.HalfNormal("sd_green", 300)
green_daily = pm.Normal("green_daily", mu_green, sd_green)
green_daily_with_obs = pm.Deterministic(
"green_daily_with_obs",
pm.math.clip(green_daily, obs_data_m1['obs_green'].to_numpy(), no_upper_bounds)
)
# Introducing a noise term here, since observed data can be directly applied
# to a deterministic node, but otherwise I would use Deterministic here.
daily_sd = pm.HalfNormal("daily_sd", 500)
daily_total = pm.Normal(
"daily_total",
orange_daily + green_daily,
daily_sd,
observed=obs_data_m1['total']
)
trace_m1 = pm.sample(10_000, chains=8, target_accept=0.9)
When using 250 observations, as in this example, this model runs in to a recursion error in _tensor_py_operators.__getitem__.<locals>.includes_bool
.
The model works with only 3 observations, but produces 9 different values for orange_daily_with_obs
and green_daily_with_obs
. The model is running all the possible combinations of those observed data. I want the model to treat them as paired observations, where a particular observation of total
, lines up with the specific observations of orange_daily_with_obs
and green_daily_with_obs
.
- How should I structure the model to work in that paired way?
- And secondly, is this use of
clip
a reasonable way to apply this partial data?