Handling missing data in linear regression on timeseries data

We would like to perform linear regression on timeseries data, where each data point is conditioned on the previous one, i.e. we want to learn coefficients b_0 and b_1 such that

y_t \sim N(\mu = b_0 + b_1 * y_{t-1}, \sigma = 1)

for all t. We have been able to create a model that does this when we have complete data, but we are stuck on how to get our model to handle missing data. Using the following code to define our model, PyMC3 is able to easily impute any missing y values. However, for this particular problem, the x values are the same as the y values (just offset by 1 index), so missing y’s are the same as missing x’s.

data = np.arange(10)
x = data[:-1]
y = data[1:]

with pm.Model() as model:
    beta = pm.Normal('bias', mu=0, sd=10, shape=2)
    mu = beta[0] + beta[1] * x
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=1, observed=y)

We then tried defining a custom multivariate distribution over our entire dataset so that it would be treated as one series of values, rather than having to separate it into x’s and y’s prior to entering the model. This solution still wasn’t able to handle missing data since there is no way to compute the likelihood of the distribution when there are NaN’s in the data.

def likelihood(beta):

    def logp_(data):
        data = data.eval()
        x = counts[:-1]
        y = counts[1:]
        mu = beta[0] + beta[1] * x
        lik = pm.Normal.dist(mu=mu, sigma=1)
        return lik.logp(y)

    return logp_

with pm.Model() as model:
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    Y_obs = pm.DensityDist('Y_obs', likelihood(beta), observed=data)

Any ideas on how to handle missing data for this problem?

Hi,
Maybe casting your data to a numpy masked array could help? That way, PyMC should be able to handle and impute the missing data points. Here is an example of how to do that.
Hope this helps :vulcan_salute:

2 Likes