Suppose I want to fit the following model to data:
Group level: Beta1 ~ Zeta1 X variable2
data level: y[i, t] ~ Intercept + Beta1[i] X [t] + Beta2 X variable1
Where [i] indexes person, [t] indexes time, and variable2 is a time invariant characteristic of a person that inform the growth rate of y[i, t] via Beta1.
Once I fit the model, I want to use it on new data. However, I will observe y[i,t], [t],[i] and variable1. What I really want to do is use this observed information to infer what variable2 is for a given person [i] at a given time [t].
I can set up the model and understand how to predict y[i,t]. How would I use the same model to predict variable2?
Place a prior distribution on variable2 within the model, pass it as a masked array so that PyMC3 knows that it is a missing observation and then run the sampler as before. Here’s an example of how it’s done for a multiple linear regression:
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as lt
%matplotlib inline
# Specify number of individuals and covariates
n = 200
p = 2
# Create a fake dataset
true_x = np.random.randn(n,p)
true_intercept = 3
true_coefficients = np.asarray([1.,-2.])
y = true_intercept + np.dot(true_x,true_coefficients) + np.random.randn(n) * 0.1
# Create a mask to apply to fake dataset
mask = np.zeros([n,p],dtype=bool)
mask[0:5,0] = True
print('Missing values of x are:',true_x[0:5,0])
observed_x = np.ma.masked_array(data = true_x, mask=mask)
# Model with prior placed on 'x'
with pm.Model() as model:
x = pm.Normal('x',observed = observed_x)
intercept = pm.Normal('intercept',sd=10)
coefficients = pm.Normal('coefficients',sd=10,shape = p)
mean = intercept + pm.math.dot(x,coefficients)
response = pm.Normal('response',mu=mean,observed=y)
trace = pm.sample()
# Visualize posterior estimates of missing data
pm.plot_posterior(trace,varnames=['x_missing'])```
Edit: I just realized that I misspecified the error variance in the PyMC3 model, but the code still works.
I think that you could find gaussian mixture models really helpful for your problem. I imagine that zeta is a categorical variable, so it will be hard to sample it directly. I recommend you read this for a brief explanation on how to write a different parametrization of your model, called the marginalized parametrization, which would help you to draw samples from Variable2's posterior.