How to get "on-line" prediction estimates for data over time?

Hi, I’m new to pymc3 so forgive me if this is somehow a stupid question:

I’m trying to model a production process, where I’m using three parameters A, B, and C to predict an output measurement Y. Each product has two relevant dates: the production date and the measurement date. At the time of production, I want to make my best guess at what the measured value will be, and then at the measurement date when I get the true measured value, I will update my model.

So what I’m trying to do is: for each product, on the production date P give the best estimate for Y based off observed data from every product whose measurement date is before P.

What is the best way to solve this kind of problem?

Since time does not seem to be a relevant variable in your model (A, B, C don’t change over time), you can ignore it. Just make your model predict observed Ys given their A, B, and C. Then use the posterior to predict new unobserved Ys given their A, B, and C. As you get more observed Ys you can keep updating your model. This gives you an idea of how you might do it in PyMC:

https://docs.pymc.io/notebooks/updating_priors.html

Hi Ricardo, thanks for the response. I should have specified, the impact of certain parameters may indeed change over time (e.g. in a chemical process the precursor may age and its properties will change, so the batch of chemical C may be a time-dependent parameter,) while other variables are time-independent (e.g. the solvent A being used in the reaction.)

Do you know how I would write the code to take into account a time-varying parameter?

I am afraid that is too general of a question for me to offer specific usable advice. You will need to decide how to model your time effects. Two examples include: linear function based on time, or random walk prior…

Let me study your example. I wasn’t sure if a for loop was the best way to do this, but from your example it seems like a for loop is the way to go. I also didn’t know how to use the old posterior as the new prior, but I think by studying your example I should be able to figure that out (it looks like it will involve using the Interpolated function?)

If I still can’t figure it out, I will come back with more context and paste in my code.

One more question: if I’m running pm.sample thousands of times in my for loop (all on similar types of data,) do I have to tune each time? Or is there a way to bypass the tuning step by explicitly setting the parameters it’s optimizing?

My current difficulty: in the linked example they are estimating three variables: alpha, beta0, and beta1. In my case different data points may use different recipes (e.g. depending on the solvent used for one data point, I would want to use B[1] or B[2]) so there will be a vector instead of just one parameter.

Here’s an example of what my code looks like: on the second iteration (when I’m trying to retrieve the posteriors from the first run to use as priors) it gives me:

“MissingInputError: Input 0 of the graph (indices start from 0), used to compute Elemwise{exp,no_inplace}(eps_log__), was not provided and not given a value.”

(This is my first project with pymc3, so if I’m doing anything else stupid in the code please let me know.)

import numpy as np
import pymc3 as pm
import pandas as pd
from scipy import stats

def from_posterior(param, samples):
    smin, smax = np.min(samples), np.max(samples)
    width = smax - smin
    x = np.linspace(smin, smax, 100)
    y = stats.gaussian_kde(samples)(x)

    # what was never sampled should have a small probability but not 0,
    # so we'll extend the domain and use linear approximation of density on it
    x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
    y = np.concatenate([[0], y, [0]])
    return pm.distributions.Interpolated(param, x, y)

if __name__ == '__main__':

    Data = pd.DataFrame([["A1", "B1", -1], ["A1", "B2", 1]], columns=["A", "B", "Y"])

    Distributions = {}

    InitialSigma = 2
    InitialNoise = 2
    
    for i in Data.index:
        
        with pm.Model() as my_model:
            try:
                A = Distributions[Data["A"].loc[i]]
            except:
                A = pm.Normal('A', 0, InitialSigma)
            
            try:
                B = Distributions[Data["B"].loc[i]]
            except:
                B = pm.Normal('B', 0, InitialSigma)
   
            try:
                eps = Distributions["Noise"]
            except:
                eps = pm.HalfCauchy('eps', InitialNoise)
 
            Y = A + B
            
            y = pm.Normal('y', Y, sigma=eps, observed=Data.loc[i,"Y"])
        
        with my_model:
            trace = pm.sample(draws=1000, tune=200)

            Distributions[Data["A"].loc[i]] = A 
            Distributions[Data["B"].loc[i]] = B
            Distributions["Noise"] = eps

I modified the example you provided to handle a situation in which there are multiple independent parameter settings with different alpha, beta0, and beta1. The initial sampling works well, however the code is erroring out when it gets to the from_posterior() function because my trace[‘alpha’] is shape (4000,2) and the function is expecting trace[‘alpha’] to be shape (4000,).

Would you be able to help me modify the from_posterior() function to handle multidimensional sets of distributions? I feel like it would be simple enough to make vectors out of the variables smin, smax, width, x, and y but I don’t know how to use the pymc3 Interpolated function to combine those into a set of distributions.

import matplotlib.pyplot as plt
import matplotlib as mpl
import pymc3 as pm
from pymc3 import Model, Normal, Slice
from pymc3 import sample
from pymc3 import traceplot
from pymc3.distributions import Interpolated
from theano import as_op
import theano.tensor as tt
import numpy as np
from scipy import stats
from random import choice
import pandas as pd

if __name__ == '__main__':
    plt.style.use('seaborn-darkgrid')
    print('Running on PyMC3 v{}'.format(pm.__version__))
    
    # Initialize random number generator
    np.random.seed(93457)
    
    # True parameter values
    alpha_true = [5, -5]
    beta0_true = [7, 17]
    beta1_true = [13, 23]
    
    # Size of dataset
    size = 100
    
    # Predictor variable
    X1 = np.random.randn(size)
    X2 = np.random.randn(size) * 0.2
    
    # for different data points, different known parameter settings are applied, which means there are two values for alpha, two values for beta0, and two values for beta1   
    Data = pd.DataFrame(columns=["Y", "alpha", "beta0", "beta1"], index=range(0,size))
    for point in range(0,size):
        # randomly choose which alphas and betas are applied as parameters for this data point
        alpha_choice = choice([0,1])
        beta0_choice = choice([0,1])
        beta1_choice = choice([0,1])
        
        this_alpha = alpha_true[alpha_choice]
        this_beta0 = beta0_true[beta0_choice]
        this_beta1 = beta1_true[beta1_choice]
        # Simulate outcome variable 
        this_Y = alpha_true[alpha_choice] + beta0_true[beta0_choice] * X1[point] + beta1_true[beta1_choice] * X2[point] + np.random.randn()
        Data.loc[point] = [this_Y, alpha_choice, beta0_choice, beta1_choice]
        
    basic_model = Model()
    
    with basic_model:
    
        # Priors for unknown model parameters
        alpha = Normal('alpha', mu=0, sigma=1, shape=2)
        beta0 = Normal('beta0', mu=12, sigma=1, shape=2)
        beta1 = Normal('beta1', mu=18, sigma=1, shape=2)
    
        # Expected value of outcome
        mu = alpha[Data["alpha"].values.astype(int)] + beta0[Data["beta0"].values.astype(int)] + beta1[Data["beta1"].values.astype(int)]
    
        # Likelihood (sampling distribution) of observations
        Y_obs = Normal('Y_obs', mu=mu, sigma=1, observed=Data["Y"])
    
        # draw 1000 posterior samples
        trace = sample(1000)
        
    traceplot(trace);
    
    def from_posterior(param, samples):
        smin, smax = np.min(samples), np.max(samples)
        width = smax - smin
        x = np.linspace(smin, smax, 100)
        y = stats.gaussian_kde(samples)(x)
    
        # what was never sampled should have a small probability but not 0,
        # so we'll extend the domain and use linear approximation of density on it
        x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
        y = np.concatenate([[0], y, [0]])
        return Interpolated(param, x, y)
    
    traces = [trace]
    
    for _ in range(10):
    
        # generate more data
        X1 = np.random.randn(size)
        X2 = np.random.randn(size) * 0.2
        for point in range(0,size):
            alpha_choice = choice([0,1])
            beta0_choice = choice([0,1])
            beta1_choice = choice([0,1])
            
            this_alpha = alpha_true[alpha_choice]
            this_beta0 = beta0_true[beta0_choice]
            this_beta1 = beta1_true[beta1_choice]
            this_Y = alpha_true[alpha_choice] + beta0_true[beta0_choice] * X1[point] + beta1_true[beta1_choice] * X2[point] + np.random.randn()
            Data.loc[point] = [this_Y, alpha_choice, beta0_choice, beta1_choice]
    
        model = Model()
        with model:
            # Priors are posteriors from previous iteration
            alpha = from_posterior('alpha', trace['alpha'])
            beta0 = from_posterior('beta0', trace['beta0'])
            beta1 = from_posterior('beta1', trace['beta1'])
    
            # Expected value of outcome
            mu = alpha[Data["alpha"].values.astype(int)] + beta0[Data["beta0"].values.astype(int)] + beta1[Data["beta1"].values.astype(int)]
    
            # Likelihood (sampling distribution) of observations
            Y_obs = Normal('Y_obs', mu=mu, sigma=1, observed=Data["Y"])
    
            # draw 1000 posterior samples
            trace = sample(1000)
            traces.append(trace)
    
    print('Posterior distributions after ' + str(len(traces)) + ' iterations.')
    cmap = mpl.cm.autumn
    for param in ['alpha', 'beta0', 'beta1']:
        plt.figure(figsize=(8, 2))
        for update_i, trace in enumerate(traces):
            samples = trace[param]
            smin, smax = np.min(samples), np.max(samples)
            x = np.linspace(smin, smax, 100)
            y = stats.gaussian_kde(samples)(x)
            plt.plot(x, y, color=cmap(1 - update_i / len(traces)))
        plt.axvline({'alpha': alpha_true, 'beta0': beta0_true, 'beta1': beta1_true}[param], c='k')
        plt.ylabel('Frequency')
        plt.title(param)
    
    plt.tight_layout();