On Extrinsic forecasting metrics

Hello everyone.

I’m looking for a way to evaluate a forecasting model with an “extrinsic” metric.

As part of my PhD, I developed 3 state-space models for forecasting stink bug populations in soybean crops (Redirecting ) that now a company wants to use in production. They are interested in the models because with the posterior predictive distribution we can compute the likelihood of stink bugs surpassing a given “economic threshold”, which will require a pest control intervention to avoid economic loss.

So far, I’ve only evaluated my models using LOO-CV (although I should’ve used LFO-CV), which was good to decide which model was the best one.
The thing is that this company wants to know “how useful is the model”, which makes a lot of sense, but I’m unable to respond properly because of lack of knowledge and because my attempts to find an extrinsic metric for my Bayesian models (i.e., a metric that will tell us how good the model will perform overall in a business context) have been unsuccessful.

So my questions are:

  1. Do you know of an extrinsic metric that I could use in this case?
  2. I wrote down an idea that I had for an extrinsic metric but I’m not sure if it’s a sound proposal. Could you please comment on it?

Extrinsic metric proposal

Given that this company wants a model that gives a “recommendation” on if a pest control intervention should be carried out in a given week to avoid the economic threshold being surpassed on the next week, this proposed extrinsic metric aims to measure how often a model makes a “good” recommendation.

In the metric, so far named “Threshold Surpassing Accuracy”, a recommendation is considered ‘good’ when the model suggested to carry out a pest control intervention at time t, and stink bugs surpassed the economic threshold at t+1; or when the model did NOT suggest to carry out a pest control intervention at time t, and stink bugs did NOT surpass the economic threshold at t+1.

A recommendation is considered ‘bad’ when the opposite happens, i.e., the model recommends a control and threshold is not surpassed or model does not recommend a control and the threshold is surpassed.

The model will only recommend to carry out a pest control intervention when the likelihood of the population density surpassing the economic threshold in t + 1 is higher than 50% (or other percentage based on expert knowledge).

A good recommendation gives a score of +1 and a bad recommendation gives a score of +0.

The “Threshold Surpassing Accuracy” (TSA) metric is calculated as the total score divided by the total amount of recommendations.

The TSA metric is computed in a similar way as the Leave-Future-Out Cross-Validation algorithm ([1902.06281] Approximate leave-future-out cross-validation for Bayesian time series models ), i.e., model is trained with L sequential data points, an L + 1 forecast is made and a score is given based on the recommendation (good/bad) and the observed outcome in L + 1 (threshold was surpassed or not). The process is repeated for every possible L and then the final score is computed.


I would recommend thinking in dollars, like the company does.


  • the cost of intervening pre-emptively is c_{pre}, and
  • intervening preemptively has a 100% success rate

then expected cost of intervening preemptively at period t is c_{pre}.


  • the cost of intervening after infestation is c_{post}, and
  • the probability of infestation in period t+1 is p

then the expected cost of not intervening preemptively at t is p*c_{post} + (1-p)*0

So the company would intervene at t whenever c_{pre} < p*c_{post}

I would apply this logic in a cross-validation setting. Fit your model on 80% of the dataset. Use the fitted model to predict p for the holdout. Plug in placeholder values for c_{pre} and c_{post} (e.g. $20 and $100, or ask the company for guidance on this ratio), then simulate what the company would have done for the holdout crops. Did they save money by intervening preemptively on the right crops?