Let us consider a scenario where we have a pool of 100 datasets from various customers, with varying sizes containing sales and budgeting data per date. The datasets can range from small, covering only 10 time steps, to moderate, spanning up to 600 time steps, but all have the same daily frequency. Additionally, the datasets may have different levels of noise, zero-inflated sales, an over-dispersed response, and varying time-varying factors underlying the data generation process. Despite these differences, all the datasets originate from the same environment, and certain dynamics will be apparent in all of them.
The primary objective is to develop a regression model for one dataset, but it would be valuable to incorporate data from other datasets. For example, the dataset we want to model might have a constant cost for one channel, making it difficult to determine its impact on the cost, or it may have low expenditures and be relatively small. To overcome these limitations, we need to incorporate data from other datasets to improve predictions beyond our previous expenditures and to predict accurately for small datasets.
To accomplish this goal, we can create either a hierarchical model or a sequential model by building priors for the dataset beforehand. However, we need to determine which datasets are appropriate to include in our calculations for priors. To do this, we can consider two cases: one with auxiliary information such as sector or country, and one without any auxiliary information, relying solely on a data-driven approach.
What are the standard methods for this task of determining which datasets are relevant and not? Is dynamic time-warping or spike and slab priors a go-to method?
I.e how do we choose which datasets to include in our calculations for priors taking into account the time-varying factors?
References to research on the topic would be highly appreciated.