I am modelling sea-level rise, which is made of various components. I have a model for glaciers contributions, and a model for thermal expansion of the water. I can tune these models separately, independently from each other, because there are observations of glacier retreat, and observations of ocean warming. Using MCMC, I have posterior distributions of the model parameters, and say, of 20th century contribution for both these sources.
with pm.Model() as glaciers_model:
glacier_param1 = pm.Normal(...)
glacier_param2 = pm.Normal(...)
glaciers_slr = pymc3_wrapper_for_glaciers(glacier_param1, glacier_param2, ...)
pm.Normal("glaciers_obs", mu=glaciers_slr, sigma=glaciers_std, observed=glaciers_obs)
trace_glaciers = pm.sample(...)
with pm.Model() as thermal_model:
thermal_param1 = pm.Normal(...)
thermal_param2 = pm.Normal(...)
thermal_slr = pymc3_wrapper_for_thermal(thermal_param1, thermal_param2, ...)
pm.Normal("thermal_obs", mu=thermal_slr, sigma=thermal_std, observed=thermal_obs)
trace_thermal = pm.sample(...)
Now I want to also use actual sea-level observations, which I model as being the sum of my two components: glaciers + thermal expansion. How best can I do that, in an efficient manner (model integration time is not negligeable)?
Of course, the simplest way would be to have one big model with glaciers and thermal and sum, and use observations for all three. But the integration time appears to be quite long and in one attempt I got a warning (effective ensemble size low for some parameters). That’s because in fact I have more components (four), with a total of 10 parameters for the full model, and 15 observations.
with pm.Model() as (waste?)full_model:
glacier_param1 = pm.Normal(...)
glacier_param2 = pm.Normal(...)
thermal_param1 = pm.Normal(...)
thermal_param2 = pm.Normal(...)
glaciers_slr = pymc3_wrapper_for_glaciers(glacier_param1, glacier_param2, ...)
pm.Normal("glaciers_obs", mu=glaciers_slr, sigma=glaciers_std, observed=glaciers_obs)
thermal_slr = pymc3_wrapper_for_thermal(thermal_param1, thermal_param2, ...)
pm.Normal("thermal_obs", mu=thermal_slr, sigma=thermal_std, observed=thermal_obs)
total_slr = glaciers_slr + thermal_slr
pm.Normal("slr_obs", mu=total_slr, sigma=slr_std, observed=slr_obs)
trace_full = pm.sample(...)
For efficiency reasons, it would be better if I could re-use the posterior samples from the glaciers model, and from the thermal expansion model, and use these samples to apply the sum constraint. A classical bayesian update, so to say. But across models.
One approach I can think of, is to first tune my sub-models separately, and in a second step, have a “sum” model which is simply a categorical sampling for the ensemble members indices. In the glaciers + thermal cases, I’d have then:
with pm.Model() as total_resampling:
glaciers_i = pm.Categorical(f"glaciers_index", np.arange(sample_size))
thermal_i = pm.Categorical(f"thermal_index", np.arange(sample_size))
slr = pymc3_wrapper_to_sample_posteriors_with_indices_and_sum_them(glaciers_i, thermal_i)
pm.Deterministic("slr_total", slr) # keep track for output
pm.Normal("slr_obs", mu=slr, sigma=slr_obs_error, observed=slr_obs)
pm.sample(...)
At the moment my wrapper function defines a theano Op to have free hands in indexing the posterior distributions of glaciers and thermal expansion parameters (1000 samples each). Possibly I could do without the wrapper, using thano.shared(np.asarray(trace_glaciers.posterior.glaciers_slr[0]))[glaciers_i]
.
I feel that should work OK, but that is not very elegant (some ensemble members for the sub-terms will appear several times). I guess there are other approaches out there that I cannot think of.
Later I will yet re-use thesse samples and update it with additional data at various locations, where new processes come into play…
I wonder if that is what people call a “hierarchical model”.
If I would work in pure numpy/scipy, I would fit a kde kernel to my posterior and resample from it. That would also work. Not sure if that can be done in pymc3.
Thanks for anyone coming across and considering the question!