How to Predict On Test Data

Nij_PADARIYA · February 17, 2025, 11:42am

Problem Description:

You have a hierarchical model where you indexed data using DistrictName_Month_Year. When splitting your dataset into train (80%) and test (20%), you randomly assigned months to either set. However, this causes a problem:

A district in the test set might have month-year combinations that do not exist in the train set.
Since the model has learned coefficients for specific district-month-year combinations, it struggles to predict for unseen months in the test set.
To handle this, you assign the coefficient from the nearest available month in the training set to the missing test months.

Example of Your Approach

Dataset Before Splitting

District	Month	Year	Feature 1	Feature 2	Disease_Count
A	3	2022	5.1	7.3	20
A	4	2022	5.5	7.1	18
A	5	2022	5.6	7.4	22
A	6	2022	5.8	7.2	24
A	7	2022	6.0	7.5	25

After Splitting (Random 80-20)

Train Set (80%) → Includes months (3, 5, 6) for District A
Test Set (20%) → Includes months (4, 7) for District A

Since month 4 and 7 are missing in the training data, you find the closest available month from the train set and assign its learned coefficient.

For instance:

Month 4 → Nearest available in training: Month 3 or 5
Month 7 → Nearest available in training: Month 6
Use these closest month coefficients to estimate missing values in the test set.

Can i Improve this approach or any other approach that i can use?

JAB · February 20, 2025, 8:34pm

I asked a question similar to this. Basically I was wondering if I when getting new observations, do I add them to the existing model or resample the model from scratch. In my case since I can actually observe the data from the new groups, it is best for me to resample from scratch. However, there are cases where you may want to model groups that cannot be observed. In that case you would need to perform some out of model prediction not out of sample prediction.

This is a very helpful tutorial on the matter: Out of model predictions with PyMC - PyMC Labs

The other thing is that since you have temporal data, you may want to consider something that accounts for seasonality or other cycles within the data. This could improve your estimates if what you are modeling has any seasonality.

Topic		Replies	Views
Out of sample prediction with new category version agnostic	5	921	August 3, 2022
Train Test Splits with Multi-level hierarchical regression model where the test set contains unseen values in the hierarchy Questions	1	915	November 2, 2021
How to properly do out-of-sample prediction for hierarchical model v5 modeling , hierarchical , prediction	24	1392	January 25, 2024
How can i sample posteriors from training-data, run posterior predictive on testing data (with complex factor graph) v5 modeling	9	681	April 23, 2023
How to make predictions in production without "observed eaches" v5	5	497	September 26, 2023

How to Predict On Test Data

Problem Description:

Example of Your Approach

Dataset Before Splitting

After Splitting (Random 80-20)

Related topics