We would like to perform linear regression on timeseries data, where each data point is conditioned on the previous one, i.e. we want to learn coefficients b_0 and b_1 such that
y_t \sim N(\mu = b_0 + b_1 * y_{t-1}, \sigma = 1)
for all t. We have been able to create a model that does this when we have complete data, but we are stuck on how to get our model to handle missing data. Using the following code to define our model, PyMC3 is able to easily impute any missing y values. However, for this particular problem, the x values are the same as the y values (just offset by 1 index), so missing y’s are the same as missing x’s.
data = np.arange(10)
x = data[:-1]
y = data[1:]
with pm.Model() as model:
beta = pm.Normal('bias', mu=0, sd=10, shape=2)
mu = beta[0] + beta[1] * x
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=1, observed=y)
We then tried defining a custom multivariate distribution over our entire dataset so that it would be treated as one series of values, rather than having to separate it into x’s and y’s prior to entering the model. This solution still wasn’t able to handle missing data since there is no way to compute the likelihood of the distribution when there are NaN’s in the data.
def likelihood(beta):
def logp_(data):
data = data.eval()
x = counts[:-1]
y = counts[1:]
mu = beta[0] + beta[1] * x
lik = pm.Normal.dist(mu=mu, sigma=1)
return lik.logp(y)
return logp_
with pm.Model() as model:
beta = pm.Normal('beta', mu=0, sd=10, shape=2)
Y_obs = pm.DensityDist('Y_obs', likelihood(beta), observed=data)
Any ideas on how to handle missing data for this problem?