Handling missing data in linear regression on timeseries data

ally-lee · July 1, 2020, 8:53pm

We would like to perform linear regression on timeseries data, where each data point is conditioned on the previous one, i.e. we want to learn coefficients b_0 and b_1 such that

y_t \sim N(\mu = b_0 + b_1 * y_{t-1}, \sigma = 1)

for all t. We have been able to create a model that does this when we have complete data, but we are stuck on how to get our model to handle missing data. Using the following code to define our model, PyMC3 is able to easily impute any missing y values. However, for this particular problem, the x values are the same as the y values (just offset by 1 index), so missing y’s are the same as missing x’s.

data = np.arange(10)
x = data[:-1]
y = data[1:]

with pm.Model() as model:
    beta = pm.Normal('bias', mu=0, sd=10, shape=2)
    mu = beta[0] + beta[1] * x
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=1, observed=y)

We then tried defining a custom multivariate distribution over our entire dataset so that it would be treated as one series of values, rather than having to separate it into x’s and y’s prior to entering the model. This solution still wasn’t able to handle missing data since there is no way to compute the likelihood of the distribution when there are NaN’s in the data.

def likelihood(beta):

    def logp_(data):
        data = data.eval()
        x = counts[:-1]
        y = counts[1:]
        mu = beta[0] + beta[1] * x
        lik = pm.Normal.dist(mu=mu, sigma=1)
        return lik.logp(y)

    return logp_

with pm.Model() as model:
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    Y_obs = pm.DensityDist('Y_obs', likelihood(beta), observed=data)

Any ideas on how to handle missing data for this problem?

AlexAndorra · July 2, 2020, 12:18pm

Hi,
Maybe casting your data to a numpy masked array could help? That way, PyMC should be able to handle and impute the missing data points. Here is an example of how to do that.
Hope this helps

Topic		Replies	Views
Dealing with random missing values in a GLM model v5 modeling	0	298	July 18, 2023
Masking missing values of predictors Questions	3	1318	July 10, 2020
Handling missing values in predictor when outcome is a Multivariate Normal distribution v5	7	77	October 25, 2024
OOS predictions of covariate with "out-of-range" missing values Questions	6	650	August 12, 2024
Dealing With Missing Data Questions	1	3068	August 18, 2017

Handling missing data in linear regression on timeseries data

Related topics