I’m trying to build a basic model to test my understanding of estimating population and group level effects . I’m leaning on existing examples, such as Rugby Analytics and Multilevel Modelling but want to try an even simpler example which is a bit more relevant to my field of interest…
Say I have 3 thermometers sitting in a water bath, and I want to estimate the temperature of the water. I suspect that the thermometers might not all be entirely accurate, and may have different levels of precision, but I don’t know this a priori. By using 3 thermometers I assume that my estimate will be better than choosing one of the thermometers at random to take my measurements.
I assume that the true water temperature (y) is a function of the thermometer recorded temperature with some bias/error associated with it. I constructed some dummy data as follows:
import pymc as pm
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
true_temp_mean=37.
n=30 # Number of observations
obs_temps0 = true_temp_mean + np.random.normal(loc=0, scale=0.5, size=n) # Good, accurate thermometer
obs_temps1 = true_temp_mean + np.random.normal(loc=0, scale=1, size=n) # Good but less accurate thermometer
obs_temps2 = true_temp_mean + np.random.normal(loc=-2, scale=0.5, size=n) # Bad thermometer with cold bias
all_obs = np.array([obs_temps0, obs_temps1, obs_temps2]).flatten()
# An array describing which of the obs belong to which thermometer
obs_idx = np.array([np.repeat(0, n), np.repeat(1, n), np.repeat(2, n)]).flatten()
obs_sample = np.tile(range(n), 3) # Optional: sample number
# Make dataframe
df = pd.DataFrame(data=zip(obs_sample, obs_idx, all_obs), columns=["Sample", "Thermometer", "Temp"], )
I then construct a model that attempts to model a global temperature term (population level effect) and deviations from it for each of the 3 thermometers (random effects) assuming the true temperature is drawn from a truncated Normal distribution, with random effects for the mean and standard deviation (noting that I could have gone with an ordinary Normal distribution, but I’m imposing some prior understanding that the water is not frozen or boiling)
coords = {'Thermometer': ['T0', 'T1', 'T2'],
'obs_id': df.index}
with pm.Model(coords=coords) as tmodel:
thermometer_idx = pm.MutableData("thermometer_idx", obs_idx, dims="obs_id")
# Hyperpriors
mu_a = pm.Normal("a", mu=0.0, sigma=10.0)
sigma_a = pm.Exponential("sigma_a", 2.0)
# Varying intercepts
a_thermometer = pm.Normal("a_thermometer", mu=mu_a, sigma=sigma_a, dims="Thermometer")
global_theta = pm.TruncatedNormal('t_global', mu=30, sigma=10, lower=0, upper=100)
# Expected value, global term plus random effect
theta = global_theta + a_thermometer[thermometer_idx]
# Uncertainty terms
a_thermometer_sigma = pm.Exponential("sigma", 1, dims="Thermometer")
global_sigma = pm.Exponential("sigma_global", 1)
sigma = global_sigma*a_thermometer_sigma[thermometer_idx]
y = pm.TruncatedNormal("y", mu=theta, sigma=sigma, observed=all_obs, dims="obs_id")
pm.model_to_graphviz(tmodel)
Sampling as follows:
with tmodel:
trace = pm.sample(tune=2000, chains=4, cores=4, target_accept=0.99, draws=4000, return_inferencedata=True)
This take me c. 14 mins and raises some warnings that the maximum tree depth. Increase
max_treedepth, increase
target_accept or reparameterize.
but seems to give reasonable answers
With a global mean (my proxy for the true water bath temperature)
What could I improve here? Is there a better way to formulate this model? The posterior estimate seems very wide, but maybe this is because I left my priors very wide? The energy trace doesn’t look great…