How do I combine two models into one larger model?

Hello all,

I am new to PYMC and to Bayesian statistics in general. I am trying to figure out how to fit a model that I have been previously doing in two steps with a frequentest approach. I am working with media consumption data. I am trying to project total consumption 8 months out using data from the first 35 days. The media consumption data seems to largely follow an exponential decay pastern. By log transforming and fitting a line I can model the decay and extrapolate into the future. Looking at historical data this extrapolation is almost always an under estimation, but it correlates really well with the total consumption at 8 months. So I have fit a second linear model under log - log transform.

Frequentist Method:
Model 1: log(daily consumption) = m1 * day + b1 #done individually for each video
Model 2: log(8 months total consumption) = m2 * log(extrapolated total consumption) + b2

One of the motivations for doing this is to answer the question, “How likely is video A to outperform video B at the end of 8 months given the data.” That is in my opinion squarely in the domain of Bayesian statistics. Because of this I am brushing up on my Bayesian statistics and tying to convert this method into a Bayesian model.

Using bambi, I have been able to fit the equivalent (or better) of each of these models. The issue that I see is that I am loosing a lot of data when I move from Model 1 to Model 2. I can see many different approaches to the transition, but they all involve collapsing the posterior distribution of Model 1 into a single value. What I would like to do instead is to either fit one combined model or provide the transformed posteriors of Model 1 to inform Model 2.

I think fitting a combined model would be superior even if it is more computationally expensive. The challenge that I see is that there is a lot of gymnastics in between Model 1 and Model 2. First we are converting from a kind of a time series problem to an area under a curve problem. More importantly is that we transfer out and back into log space. Can NUTS handle something like that? How would I write a combined model? Is this where you would use pm.Deterministic?

The other option is to transform the posteriors of Model 1 and provide them as a prior to Model 2. I have seen

where Kernel Density Estimation is used to construct a prior distribution. I think what I am trying to do is different. I don’t have prior knowledge of most of the parameters in the second model because I am not fitting the same model again. I only have a probability distribution around the independent variable. There is also this idea that each video has its own uncertainty around it. The daily consumption of some videos is better modeled by exponential decay then other videos. The effect of that is we have more confidence about the long term projection for some videos than others. I am not sure how we provide that information to the second model.

I can work on creating a script to generate fake data similar to my data if that would be helpful to the discussion.

Thanks in advance for the help. This forum has already answered a few of my questions, and given me a lot of food for thought.

1 Like

I think some simulated data (and the code to generate it) along with a sketch of the 2 models you are fitting (plus the “gymnastics”) would be helpful as a starting point for a combined model.

Thank you for the quick reply. Sorry about the slow reply on my end. Here is a script that synthesizes a similar looking dataset and then runs my current approach on it.

import pandas as pd
import numpy as np
import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import statsmodels.api as sm
import xarray as xr


NumberOfVideos = 250
DailyDays = 35
TotalDays = 224

#Create Names
GenerationParamiters = pd.DataFrame()
GenerationParamiters['number'] = range(1, NumberOfVideos + 1)
GenerationParamiters['text'] = 'Video_'
GenerationParamiters['metadata_title'] = GenerationParamiters['text'].str.cat(GenerationParamiters['number'].values.astype(str))
GenerationParamiters.drop(['number', 'text'], axis = 1, inplace = True)

#Create Generation Paramiters
RANDOM_SEED = 42
rng = np.random.default_rng(RANDOM_SEED)
GenerationParamiters['Intercept'] = rng.uniform(8,18, NumberOfVideos)
GenerationParamiters['Slope'] = rng.normal(-0.04, 0.01, NumberOfVideos)

#Create Daily Consuption Data
DailyConsuption = pd.DataFrame(columns = GenerationParamiters['metadata_title'], index = range(0, DailyDays))
for Video in GenerationParamiters.index:
	DailyConsuption[GenerationParamiters['metadata_title'][Video]] = np.exp((DailyConsuption.index * GenerationParamiters['Slope'][Video]) + GenerationParamiters['Intercept'][Video] + rng.normal(0, 0.02, DailyDays))

#Create Total Consuption Data
TotalConsumption = pd.DataFrame(index = GenerationParamiters['metadata_title'])
TotalTime = range(0, TotalDays) 
TotalConsumption['Actual'] = 0
for Video in GenerationParamiters.index:
	y = np.exp((TotalTime * GenerationParamiters['Slope'][Video]) + GenerationParamiters['Intercept'][Video])
	s = np.log(y.sum())
	TotalConsumption.Actual[Video] = np.exp((1.03 *s) + rng.normal(0.6, 0.4))

#Curent Approach
#Transform Data
DailyConsuption['time'] = DailyConsuption.index
Long = pd.melt(DailyConsuption, id_vars = 'time', var_name = 'metadata_title', value_name='streams')
Long['streams'] = np.log(1 + Long.streams)

#Fit Model 1
model1 = bmb.Model("streams ~ time + (time|metadata_title)", Long)
results1 = model1.fit()

#Gymnastics
time_array = xr.DataArray(range(0,TotalDays))
posterior = az.extract_dataset(results1)
intercept_common = posterior["Intercept"]
slope_common = posterior["time"]
Out = pd.DataFrame()
for Video in posterior.metadata_title__factor_dim:
	slope_specific_0 = posterior["time|metadata_title"].sel(metadata_title__factor_dim = Video)
	intercept_specific_0 = posterior["1|metadata_title"].sel(metadata_title__factor_dim = Video)
	a = (intercept_common + intercept_specific_0)
	b = (slope_common + slope_specific_0)
	#Use model to project consumtion into the future
	Fits = np.exp(a + (b * time_array))
	VideoTitle = Video.metadata_title__factor_dim.item()
	#add up the projections
	Out[VideoTitle] = Fits.sum(dim = 'dim_0')

Log_Out = np.log(Out)
TotalConsumption['Log_Actual'] = np.log(TotalConsumption['Actual'])
TotalConsumption['Projection'] = ''
for Video in Log_Out.columns:
	TotalConsumption.at[Video, 'Projection'] = Log_Out[Video].mean()

#Fit Model 2
model2 = bmb.Model("Log_Actual ~ Projection", TotalConsumption)
results2 = model2.fit()

Edit: Fix formatting.

I guess one solution, at least theoretically (not sure it’s practical with your data) would be to build a hierarchical model and do everything in a single step?

You have:
Model 1: log(daily consumption) = m1 * day + b1 #done individually for each video
Model 2: log(8 months total consumption) = m2 * log(extrapolated total consumption) + b2

Let me turn that into:
m1 ~ SomePrior(shape=nvideos)
b1 ~ SomePrior(shape=nvideos)
extrapolation = SomeDeterministicFunction(m1,b1,day,target)
m2 ~ SomePrior(shape=1)
b2 ~ SomePrior(shape=1)
mean_logconsumption = m2*log(extrapolation)+b2
Observed_logconsumption ~ SomeLikelihood(mean_logconsumption)

1 Like

This is the path I am trying to walk down. I am a little worried about how the “SomeDeterministicFunction” is handled.

If you just use basic operations then Python will take care of combining RVs properly by himself, you write your function just as if it were acting on real numbers.
Example:

def f(x,y):
    return x*y
a = pm.Normal('a',0,1)
b = pm.Normal('b',0,1)
c = f(a,b)

If you use basic operations plus some simple special functions you have to use for instance pytensor.exp instead of numpy.exp
Example:

a = pm.Normal('a',mu=0,sigma=1,shape=10)
b = pt.exp(a)

But overall it should be trivial, in this case you’re unlikely to need very pytensor-specific things like scans.

PS: I am assuming that the extrapolation is something simple like
log_extrap = m1 * target_day + b1

1 Like

It is a little more complicated than that. It is the sum of all days for the period of interest (8 months).
log_extrap = log( exp(m1 * 0 + b1) + exp(m1 * 1 + b1) + exp(m1 * 2 + b1) … + exp(m1 * 224 +b1))

I think it would be possible to write the function as a combination of simple and a few special functions, but it would be a lot of operations.

Then pt.sum is your friend. Most of these functions work (almost) like in numpy and with the same name.
So:
log_extrap = pt.log( pt.sum( pt.exp(m1[:,None]*np.arange(0,225)[None,:]+b1), axis=1 ) )

PS: That being said, if really this is your formula then it’s easy to just replace it by a close formula and make the code faster. exp(b1) factorises out and the rest is a geometric series in exp(m1).

2 Likes

I am having a hard time figuring out how to get the data formatted correctly. I am starting out with two pandas data frames. The first one is 250 x 35: one column for each video and each row is a time point. The Second is 1 x 250: the total number of streams for each video. I have been using pd.melt to convert the first matrix to long form to input it into the model. This creates an “Input dimension mismatch”. Is there a better approach to get setting up the data?

I think something that would help clarify you thinking would be to re-write your data generation in terms of what you want your model to look like. Right now, it’s a lot of loops and one-by-one computation, which is not how PyMC models are generally written. You also have parameters scattered all throughout the code. Start by gathering those into one place – this highlights everything that will need to be estimated in your PyMC model

# Indices and stuff
NumberOfVideos = 250
DailyDays = 35
TotalDays = 224

daily_dates = np.arange(DailyDays)
total_dates = np.arange(TotalDays)

video_index = np.arange(NumberOfVideos, dtype=int) + 1
titles = [f'video_{x}' for x in video_index]

# Parameters to estimate
true_intercepts = rng.uniform(8, 18, NumberOfVideos)
true_slopes = rng.normal(-0.04, 0.01, NumberOfVideos)
true_daily_sigma = 0.02 # from scale parameter on DailyConsumption noise

true_total_intercept = 0.6 # from loc parameter on TotalConsumption.Actual noise
true_total_slope = 1.03 # multiplied by `s`
true_total_sigma = 0.4 # from scale parameter of TotalConsumption.Actual noise

Then, if I understand correctly, you want to model the log of your data with a a deterministic trend. Vectorized, that looks like this. I draw from a lognormal instead of exp(N(mu, sigma)), but it’s the same thing:

mu_log_consumption = true_intercepts[None] + true_slopes[None] * date_range[:, None]
daily_consumption = rng.lognormal(mu_log_consumption, true_daily_sigma)

The Nones add dimensions to the different variables, in this case making true_intercepts and true_slopes into row vectors, and date_range into a column vector. After broadcasting the shape of mu_log_consumption will be (DailyDays, NumberOfVideos).

Then you can do something similar to compute the sum over the longer time period by just substituting date_range for total_dates:

log_consumption_extrapolated = true_intercepts[None] + true_slopes[None] * total_dates[:, None]
log_total = np.log(np.exp(log_consumption_extrapolated).sum(axis=0))
mu_log_total = true_total_intercept + true_total_slope * log_total
total_consumption = rng.lognormal(mu_log_total, true_total_sigma)

Written this way, I think it’s much clearer how to put it all into a PyMC model.

1 Like

Thank you very much for your help. I am guessing that you did not see that I am trying to combine two models.

model1 = bmb.Model("streams ~ time + (time|metadata_title)", Long)

and

model2 = bmb.Model("Log_Actual ~ Projection", TotalConsumption)

All of the stuff in between the two models in the code above is a deterministic way to create a projection. The issue that I run into is that the data from model 1 and the additional data for model 2 have inherently different dimensions.

For model 1 as you noted has dimensions data (DailyDays, NumberOfVideos) but the additional data for model 2 is (NumberOfVideos, 1).

I have previously been using pd.melt() to convert the (DailyDays, NumberOfVideos) matrix into a (DailyDays * NumberOfVideos, 3) witch worked for model 1 by itself.

It is unclear to me how to arrange the data to let PyMC use it properly.

Here is the model that I have been working on. Thanks in great part to julien’s exceltent advice and the radon gas tutorial.

with pm.Model() as two_part:
	#Data
	time = pm.MutableData('time', Long['time'])
	track = pm.MutableData('track', metadata_title)
	streams = pm.MutableData('streams', Long['streams'])
	total = pm.MutableData('total', TotalStreams)
	# Hyperpriors for Model 1
	mu_a = pm.Normal('mu_a', mu=8.8, sigma=10)
	sigma_a = pm.HalfCauchy('sigma_a', 5)
	mu_b = pm.Normal('mu_b', mu= -0.04, sigma=10)
	sigma_b = pm.HalfCauchy('sigma_b', 5)
	#offsets for Model 1
	a_offset = pm.Normal('a_offset', mu=0, sigma=1, shape=n_tracks)
	a = pm.Deterministic('a', mu_a + a_offset * sigma_a)
	b_offset = pm.Normal('b_offset', mu=0, sigma=1, shape=n_tracks)
	b = pm.Deterministic('b', mu_b + b_offset * sigma_b)
	# error for Model 1
	eps = pm.HalfCauchy('eps', 5)
	streams_est = a[track] + b[track] * time
	# Data likelihood for Model 1
	streams_like = pm.Normal('streams_like', mu=streams_est, sigma=eps, observed=streams)
	#Model 2
	projection = pm.Deterministic('projection', pt.log((pt.exp(a[track]) * (1-(pt.exp(b[track]) ** 210)))/(1 - pt.exp(b[track]))))
	c = pm.Normal('c', mu = 13.0881, sigma = 32.8962)
	d = pm.Normal('d', mu = 0, sigma=2.6768)
	# error for Model 2
	eps2 = pm.HalfCauchy('eps2', 5)
	total_est = c + d * projection
	# Data likelihood for Model 2
	total_like = pm.Normal('total_like', mu=total_est, sigma = eps2, observed=total)

I was showing the computations with wide-form data precisely because it is easier to reason about how to connect the two models in that form. You can map the numpy code I presented to PyMC 1-to-1, replacing true parameters with priors (hierarchical or otherwise), and the lognormal distributions with observed random variables.

If you have your heart set on using long-form data for the first model, you can do that, then reshape projection back to wide form to take the sums across columns. In that case you need to be careful to sort your data by video first and time second, so the values are a vector x with indices: [(video 1, time 1), (video 1, time 2) ... (video 1, time T), (video 2, time 1) ...]. This way, doing x.reshape((-1, T)) will end up with shape (n_videos, T) – videos on the rows and times on the columns, setting up your row-wise totals.