How to Predict On Test Data

Problem Description:

You have a hierarchical model where you indexed data using DistrictName_Month_Year. When splitting your dataset into train (80%) and test (20%), you randomly assigned months to either set. However, this causes a problem:

  • A district in the test set might have month-year combinations that do not exist in the train set.
  • Since the model has learned coefficients for specific district-month-year combinations, it struggles to predict for unseen months in the test set.
  • To handle this, you assign the coefficient from the nearest available month in the training set to the missing test months.

Example of Your Approach

Dataset Before Splitting

District Month Year Feature 1 Feature 2 Disease_Count
A 3 2022 5.1 7.3 20
A 4 2022 5.5 7.1 18
A 5 2022 5.6 7.4 22
A 6 2022 5.8 7.2 24
A 7 2022 6.0 7.5 25

After Splitting (Random 80-20)

  • Train Set (80%) → Includes months (3, 5, 6) for District A
  • Test Set (20%) → Includes months (4, 7) for District A

Since month 4 and 7 are missing in the training data, you find the closest available month from the train set and assign its learned coefficient.

For instance:

  • Month 4 → Nearest available in training: Month 3 or 5
  • Month 7 → Nearest available in training: Month 6
  • Use these closest month coefficients to estimate missing values in the test set.

Can i Improve this approach or any other approach that i can use?

I asked a question similar to this. Basically I was wondering if I when getting new observations, do I add them to the existing model or resample the model from scratch. In my case since I can actually observe the data from the new groups, it is best for me to resample from scratch. However, there are cases where you may want to model groups that cannot be observed. In that case you would need to perform some out of model prediction not out of sample prediction.

This is a very helpful tutorial on the matter: Out of model predictions with PyMC - PyMC Labs

The other thing is that since you have temporal data, you may want to consider something that accounts for seasonality or other cycles within the data. This could improve your estimates if what you are modeling has any seasonality.