Model setup for multiple events

FraMann · October 12, 2024, 10:49am

Hello. I’m pretty new to PyMC and I would like your help about setting up the model for the following problem:

I want to model the time taken by formula 1 drivers to finish the race. I’m not interested in the modeling part of this, at the moment I’m just considering the distance as a factor, but the aim of this topic is that I would like to get your help about building the model architecture, such that, the prior for the same drivers in race B are the posterior of the race A. Since I’m pretty new it’s hard for me to setup the model architecture properly and I’m looking for a very basic implementation that I can start with and improve in the future once the architecture has been created properly.

I attach you my sample code.

import pymc as pm
import numpy as np

class F1Model:
    def __init__(self, times, distances, driver_ids, previous_posterior=None):
        """
        Parameters:
        times (array-like): Observed lap times.
        distances (array-like): Distances for the corresponding lap times (could be lap number or track segment).
        driver_ids (array-like): IDs representing different drivers.
        previous_posterior (dict, optional): A dictionary with previous posterior estimates for each driver.
        """
        self.times = times
        self.distances = distances
        self.driver_ids = driver_ids
        self.model = None
        self.trace = None

        self.unique_drivers = np.unique(driver_ids)
        self.num_drivers = len(self.unique_drivers)
        self.driver_id_map = {driver_id: idx for idx, driver_id in enumerate(self.unique_drivers)}
        self.mapped_driver_ids = np.array([self.driver_id_map[driver_id] for driver_id in driver_ids])

        self.previous_posterior = previous_posterior

    def build_model(self):
        """
        Builds the PyMC model for race pace estimation.
        """
        with pm.Model() as model:
            if self.previous_posterior is not None:
                beta_mu = np.zeros(self.num_drivers)
                beta_sigma = np.zeros(self.num_drivers)
                sigma = np.zeros(self.num_drivers)
                for driver_id, idx in self.driver_id_map.items():
                    if driver_id in self.previous_posterior['driver_posteriors']:
                        beta_mu[idx] = self.previous_posterior['driver_posteriors'][driver_id]['beta_mean']
                        beta_sigma[idx] = self.previous_posterior['driver_posteriors'][driver_id]['beta_std']
                        sigma = self.previous_posterior['driver_posteriors'][driver_id]['sigma']
                    else:
                        beta_mu[idx] = 0.06  # Default initial values if no previous data
                        beta_sigma[idx] = 0.05
                        sigma[idx] = 5
            else:
                beta_mu = np.full(self.num_drivers, 0.06)
                beta_sigma = np.full(self.num_drivers, 0.05)
                sigma = np.full(self.num_drivers, 5)

            beta = pm.Normal('beta', mu=beta_mu, sigma=beta_sigma, shape=self.num_drivers)
            sigma = pm.HalfNormal('sigma', sigma=sigma)

            distances_data = pm.Data('distances', self.distances)
            driver_ids_data = pm.Data('driver_ids', self.mapped_driver_ids)

            mu = beta[driver_ids_data] * distances_data
            s = sigma[driver_ids_data]

            time_obs = pm.Normal('time_obs', mu=mu, sigma=s, observed=self.times)

            self.model = model

    def fit(self, **sample_kwargs):
        """
        Fits the PyMC model to the data.
        """
        if self.model is None:
            self.build_model()

        with self.model:
            self.trace = pm.sample(**sample_kwargs)

    def predict(self, new_distances, new_driver_ids):
        """
        Predicts lap times for new distances and drivers.

        Parameters:
        new_distances (array-like): Distances for which to predict times.
        new_driver_ids (array-like): IDs of the drivers for the new distances.
        
        Returns:
        Predicted lap times based on the posterior predictive distribution.
        """
        with self.model:
            pm.set_data({'distances': new_distances, 'driver_ids': new_driver_ids})

            posterior_predictive = pm.sample_posterior_predictive(self.trace)

            predicted_times = posterior_predictive.posterior_predictive['time_obs'].values

            return predicted_times

    def get_posterior_estimates(self):
        """
        Returns the posterior estimates for each driver.

        Returns:
        A dictionary containing posterior estimates for each driver, including
        mean and standard deviation for both beta and sigma parameters.
        """
        beta_mean = self.trace.posterior['beta'].mean(dim=['chain', 'draw']).values
        beta_std = self.trace.posterior['beta'].std(dim=['chain', 'draw']).values
        sigma_mean = self.trace.posterior['sigma'].mean(dim=['chain', 'draw']).values
        sigma_std = self.trace.posterior['sigma'].std(dim=['chain', 'draw']).values

        driver_posteriors = {
            driver_id: {
                'beta_mean': beta_mean[idx],
                'beta_std': beta_std[idx],
                'sigma_mean': sigma_mean[idx],  # Driver-specific sigma means
                'sigma_std': sigma_std[idx]     # Driver-specific sigma std deviations
            }
            for driver_id, idx in self.driver_id_map.items()
        }

        return {
            'driver_posteriors': driver_posteriors
        }

tcapretto · October 14, 2024, 2:57pm

Notice it’s not trivial to use the posterior as the prior when you do inference with MCMC (and there’s not a single way to do it). The reason is because the output is not distribution from a known family but draws from a distribution from an unknown family.

With that said, you can have a look at this example Updating Priors — PyMC example gallery, which makes use of pm.Interpolated. Pay attention to the comment in the “Words of Caution” section about correlations.

FraMann · October 14, 2024, 4:02pm

Thank you for your answer @tcapretto . So if I want to build a sort of rating system like ELO how can be done in the case that I want to keep track of the priors/posteriors of the same players over time? This is something which is possible trough Infer.NET, as far as I know.
The Words of Caution suggests this is not something suggested by the PyMC framework!

tcapretto · October 14, 2024, 4:50pm

If you’re thinking about a workflow where you fit a model at some point in time, and then use the posterior as the prior at another point in time to fit the model again, I think the answer is no (unless you do something like in the example I shared, caveats included).

I’m not very familiar with that kind of models, but I think there’s something that can be done. These are some links that can be helpful:

jessegrabowski · October 15, 2024, 12:40am

The current advice is to re-fit the model every time you want to update. This will give the same answer as doing something iterative, without risking information loss on the joint posterior.

It seems like what you really want though is a model structure that captures the time series component of your data.

FraMann · October 15, 2024, 7:07am

Thanks for your answer @jessegrabowski. Exactly, and actually I’m doing exactly like you described, even if it seems pretty slow

jessegrabowski · October 15, 2024, 7:51am

Try using an alternative sampler (nuts_sampler="nutpie" or nuts_sampler="numpyro") to get some free speedups

Topic		Replies	Views
How to get "on-line" prediction estimates for data over time? Questions	6	771	May 19, 2020
Feeding Posterior back in as Prior (updating a model) Development	3	2853	August 27, 2017
"Empirical Bayes" w/PyMC3 Questions	4	1581	August 18, 2018
How to use the posterior distribution of one model as a prior distribution for another model	19	1460	April 9, 2024
Combining pre-fit model with new observations and forecasting in PyMC version agnostic	5	91	March 27, 2025

Model setup for multiple events

Related topics