Using Pymc3 to do forecasting and numerical integration

Hello everyone, I have some questions concerning the implementation of pymc3 in statistical forecasting and numerical integration. Here is the problem:
I want to construct a hierarchical model with Chinese provinces as different levels, and also with some numerical integration during the modeling process, I followed the original radon example on https://docs.pymc.io/notebooks/GLM-hierarchical.html?highlight=radon, but I got stuck on two problems: the first is forecasting and the latter is integration. Following the forecast example on https://docs.pymc.io/notebooks/multilevel_modeling.html?highlight=radon, I hard-coded the pymc3 model with a for loop running through different provinces of China, I was able to sample the model but the whole process take ages to complete, and I don’t know how to get around the problem. I know it is not good to use for loop in pymc3 model, but I cannot find another way to deal with the problem. Second, I also want to embed my model with a numerical integration process, I don’t know how it could be done on pymc3, I have tried scipy.integrate for the sake of integration, the integration equation works fine. I am stuck. Really want to ask for your insights on the problems.

Thank you very much for the help.

P.S. I also attached the pm.model codes below:

with pm.Model() as model_logistic:
   
    sigma =  pm.Bound(pm.HalfNormal, lower=0.0)('sigma', sigma=5) # pm.HalfNormal('sigma', sigma=5)
    alpha = pm.Bound(pm.Normal, lower=0.0)('alpha', mu=1, sigma=1,shape=len(np.unique(indexer))) # pm.Normal('alpha',mu=1, sigma=1) # 1,1  # make boundary: https://docs.pymc.io/api/bounds.html
    beta =  pm.Bound(pm.Normal, lower=0.0)('beta', mu=1, sigma=1,shape=len(np.unique(indexer))) #pm.Normal('beta',mu=1, sigma=1) # 1,1
    t0 = pm.Bound(pm.Normal, lower=0.0)('t0', mu=10, sigma=10,shape=len(np.unique(indexer))) # pm.Normal('t0',mu=10, sigma=10) #normal(10,10);
   
    mu = alpha[indexer] / (1 + pm.math.exp(-(beta[indexer]*(df_res[~df_res.index.isin(t_pred_idx)].t.values-t0[indexer]))))
    # observed stochastic

    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=df_res[~df_res.index.isin(t_pred_idx)].y.values)
    
    #  ------------- The above codes work for me ------------------------------ $ 
    # prediction
 
    pred_set=df_res.loc[t_pred_idx]

But I also want to do some forecast, based on the sampled parameters in the model
But the model is hierarchical model, so I set different level for parameters using [indexer] to constraint the part of model only on the specific level of the data.
But without a loop, the pm.Normal() could not run.
with a loop, the process is too slow…

    for i in np.unique(indexer): # this is slow --> https://discourse.pymc.io/t/very-parallel-mcmc-sampling/3747

        Y_pred_ls = pm.Normal('Y_pred_p'+str(i),\
                       mu=alpha[i] / (1 + pm.math.exp(-(beta[i]*(pred_set.loc[pred_set[prov_field+'_encoded']==indexer[i]].t.values-t0[i])))) ,\
                       sigma=sigma,\
                       shape=len(pred_set.loc[pred_set[prov_field+'_encoded']==indexer[i]].t) )  # use shape instead of a for-loop: https://docs.pymc.io/notebooks/api_quickstart.html

Best way I would say is to use pm.Data, which is design to handle prediction better:

training_set = df_res[~df_res.index.isin(t_pred_idx)]
test_set = df_res.loc[t_pred_idx]
with pm.Model() as model_logistic:
    predictor = pm.Data('predictor', training_set.t.values)
    observed = pm.Data('observed', training_set.y.values) 
    sigma =  pm.HalfNormal('sigma', sigma=5) # Half Normal are already bounded
    alpha = pm.Bound(pm.Normal, lower=0.0)('alpha', mu=1, sigma=1,shape=len(np.unique(indexer))) # pm.Normal('alpha',mu=1, sigma=1) # 1,1  # make boundary: https://docs.pymc.io/api/bounds.html
    beta =  pm.Bound(pm.Normal, lower=0.0)('beta', mu=1, sigma=1,shape=len(np.unique(indexer))) #pm.Normal('beta',mu=1, sigma=1) # 1,1
    t0 = pm.Bound(pm.Normal, lower=0.0)('t0', mu=10, sigma=10,shape=len(np.unique(indexer))) # pm.Normal('t0',mu=10, sigma=10) #normal(10,10);
    
    mu = alpha / (1 + pm.math.exp(-(beta*(predictor-t0))))
    # do indexing here instead
    Y_obs = pm.Normal('Y_obs', mu=mu[indexr], sigma=sigma, observed=observed)

The nice thing here is that, by pushing the indexing to observed, for prediction you can just use mu:

with model_logistic:
    pm.set_data({'predictor': test_set.t.values})
    Y_predict = pm.Normal('Y_predict', mu=mu, sigma=sigma, shape=...)

As for numerical integration, you can take a look at Custom theano Op to do numerical integration

3 Likes

Thank you for your reply. In doing so, I got some errors:
ValueError: Input dimension mis-match. (input[0].shape[0] = 3069, input[1].shape[0] = 33)

The training_set dataset has 3069 rows, and 33 levels (levels being the provinces of china, similar to the radon example, in which case there are 85 counties). Besides, if index is placed after mu, can I still get different estimates of alpha, beta, etc, for the 33 levels of data?

Thank you!

I see, you actually need to build a 2d mu and index both the provinces and timepoint. I am not sure if it makes the problem more complicated. Basically, you will need to get a single time series time index t as predictor, make sure that predictor-t0 broadcast correctly so that the shape is [number_province, num_timepoint].
From your error message, seems your data is pretty nice: 33 * 93 = 3069 - if you have 93 time point for each province, you actually dont need indexing in observed, just reshape your data correctly into shape (33, 93) will do.

Thank you so much! But I still got some problems:

Following your advice and previous adjustment to the codes, I rewrote the model, but now I am still stuck in the mis-match error:
ValueError: Input dimension mis-match. (input[0].shape[1] = 93, input[1].shape[1] = 33)


with pm.Model() as model_logistic:
    

    predictor = pm.Data('predictor', training_set.t.values.reshape((len(np.unique(indexer)),len(np.unique(training_set.Date)))) )
    observed = pm.Data('observed', training_set.t.values.reshape((len(np.unique(indexer)),len(np.unique(training_set.Date)))) ) 

    sigma =  pm.Bound(pm.HalfNormal, lower=0.0)('sigma', sigma=5)  
    alpha = pm.Bound(pm.Normal, lower=0.0)('alpha', mu=1, sigma=1,shape=len(np.unique(indexer)))  
    beta =  pm.Bound(pm.Normal, lower=0.0)('beta', mu=1, sigma=1,shape=len(np.unique(indexer)))  
    t0 = pm.Bound(pm.Normal, lower=0.0)('t0', mu=10, sigma=10,shape=len(np.unique(indexer)))  
    mu = alpha / (1 + pm.math.exp(-(beta*(predictor-t0))))
    Y_obs = pm.Normal('Y_obs', mu=mu[(np.unique(indexer),np.arange(len(np.unique(training_set.t))))] , sigma=sigma, observed=observed  )

PS: the shape info would be:
training_set.t.values.reshape((len(np.unique(indexer)),len(np.unique(training_set.Date)))).shape

(33, 93)

(np.unique(indexer),np.arange(len(np.unique(training_set.t))))

(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]),
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92]))

Try this:

n_province, n_timepoint = 33, 93
# replace below with your real data
predictor_val = np.repeat(
    np.linspace(0., 1., n_timepoint)[np.newaxis, ...],
    n_province,
    axis=0)  # want shape = (33, 93)
observed_val = np.random.randn(n_province, n_timepoint)

with pm.Model() as model_logistic:
    predictor = pm.Data('predictor', predictor_val)
    observed = pm.Data('observed', observed_val) 

    sigma =  pm.HalfNormal('sigma', sigma=5)
    alpha = pm.Bound(pm.Normal, lower=0.0)('alpha', mu=1, sigma=1, shape=(n_province, 1))
    beta =  pm.Bound(pm.Normal, lower=0.0)('beta', mu=1, sigma=1, shape=(n_province, 1))
    t0 = pm.Bound(pm.Normal, lower=0.0)('t0', mu=10, sigma=10, shape=(n_province, 1))
    mu = alpha / (1 + pm.math.exp(-(beta*(predictor-t0))))
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=observed)
    pm.sample(10)
1 Like

Thank you again! Your codes work correctly! However, when also consider the prediction problem, I assemble the following codes, which prompted errors indicating shapes not correct for the predictor dataset?

ValueError: Input dimension mis-match. (input[0].shape[1] = 25, input[1].shape[1] = 93)

Do I still need to specify tuple indexing in mu for the prediction section of the model?


with pm.Model() as model_logistic:
predictor = pm.Data(‘predictor’, predictor_val)
observed = pm.Data(‘observed’, observed_val)

sigma =  pm.HalfNormal('sigma', sigma=5)
alpha = pm.Bound(pm.Normal, lower=0.0)('alpha', mu=1, sigma=1, shape=(n_province, 1))
beta =  pm.Bound(pm.Normal, lower=0.0)('beta', mu=1, sigma=1, shape=(n_province, 1))
t0 = pm.Bound(pm.Normal, lower=0.0)('t0', mu=10, sigma=10, shape=(n_province, 1))
mu = alpha / (1 + pm.math.exp(-(beta*(predictor-t0))))
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=observed)

# prediction 
pm.set_data({'predictor': test_set.t.values.reshape((33,25))   })  # the prediction data-set only contains 25 time-points for each province, not 93 time-points.
pm.sample(10)

No need to do indexing:

n_timepoint_forecast = 20
predictor_forecast = np.repeat(
    np.linspace(1., 1. + 1./n_timepoint*n_timepoint_forecast, n_timepoint_forecast)[np.newaxis, ...],
    n_province,
    axis=0)

with model_logistic:
    pm.set_data({'predictor': predictor_forecast})
    pm.sample_posterior_predictive(trace, var_names=['Y_obs'])
2 Likes

Finally! It works! Amazing! Thank you!

But, I am still wondering, if the above codes work for 2-d mu, then, if I have more dimensional data, it would also work right? For example, given the tuple with a shape like: (province_number, city_number, time_poins), I just need to change shape in alpha (and beta, etc…) to shape=(n_province,n_city, 1), and also reshape the input predictor and observed data to proper shape. All of these shape would still make the process faster than nested loop like (for and for loop), right?

Thank you!

depends, since city is nested within province, you can probably model it with (city_number, time_poins) - however, if you have another predictor that is complete orthogonal to the other dimension then you can do what you describing.

1 Like

Thanks. I have been recently reading about the integrating codes in Custom theano Op to do numerical integration.

I try to embed the codes with the above model where:
func = pm.math.exp(-(beta*(t - t0))**2)
integrate = Integrate(func, t, beta, t0)
integral_results= integrate(start, stop, beta, t0)
mu = alpha/2 *(1 + 2/pm.math.sqrt(math.pi)*integral_results )

But it does not seem to work. The integrating part is essentially the following formula:
image
where the Greek letter alpha corresponds to “beta” in the above code, and the Greek letter beta corresponds to “t0” in the above code. The ‘t’ variable in the picture is just the “predictor”, indicating timepoint from 0 to 92 (a total of 93 time points).

I know this problem is much more specific, but I think with clear integrating formula, we can understand the codes much better.

Thank you in advance!

Found great solution here: PyMC3 : Using a parameter as a limit of integral

1 Like