Hello there,
Excuse me if this wasn’t posted under the right forum/tag.
I am currently using the PyMC built-in Metropolis-Hastings algorithm to estimate parameters of a given exponential model.
I would like to use the data of multiple datasets to compute the likelihood of a given parameter set. In the tutorial this is shown for a single dataset as follows:
Y_obs = pm.Normal(“Y_obs”, mu=mu, sigma=sigma, observed=Y)
My question now is how to combine this for multiple, independent datasets? Especially when they have different numbers of rows? Does anyone have a clue?
You’ll need to provide some more details, your post is too vague with respect to exactly what you need. Do you want the same parameters for both datasets? Or different parameters? Do the datasets have the same features? What have you tried?
Thank you for your reply! 
I am indeed trying to infer the same set of parameters for all the datasets, but want to include all of their likelihoods. The datasets have the same features.
I have read elsewhere that by adding multiple likelihood function lines of code, the combined likelihood gets computed behind the scenes. So I have tried the following:
y_like2 = pm.Normal(“y_like2”,
mu
=mu,
sigma
=sigma,
observed
=data2)
y_like3 = pm.Normal("y_like3", mu=mu, sigma=sigma, observed=data3)
However, the error that is raised after this is that the input shapes (data2 and data3) are uneven. This is true, because the observed datasets have different amounts of records.
My question now is how to make it work despite the uneven shapes. It isn’t uncommon for datasets modeling the same thing to have different amounts of records, right?
I guess because mu
has the same shape as the data (it comes from a regression or something?)
Is there any reason why you can’t just concatenate the datasets together? Otherwise you will need to compute a separate mu for each observed outcome distribution.
The mu comes from the following:
# Priors for parameters
alpha = pm.HalfNormal("alpha", sigma=2) # initial capacity > 0
b = pm.HalfNormal("b", sigma=1) # decay rate > 0
sigma = pm.HalfNormal("sigma", sigma=0.1) # observation noise
# degradation model
mu = alpha * pm.math.exp(-b * t)
And the datasets are separate time series, which is why I wanted to include them separately… Do you think adding an extra column to uniquely identify the datasets might work if I concatenate them?
Where does t
come from? I assume it’s just a column of feature values that happens to represent time. In this case, you can indeed just concatenate everything, because all the rows of your data are conditionally independent once you know the value of t
for that row. You don’t have a recursive time series function like x_t = f(x_{t-1}), so there’s no need to handle time in a special way.
You can shuffle the rows of a single dataset and check that you get the same answer to convince yourself this is the case.
In this case, it doesn’t matter if for example the query t = 10 gives multiple records? I am worried the data will look like it is representing some form of periodicity.
Ah sorry, to clarify: I am not querying, but I meant that at a given time step you will acquire values from both datasets. And I was wondering if that was problematic.
You can add a column to identify which dataset the row is from
Dear Jesse,
Excuse me for my late reply. There were some urgent personal matters this week that took away all my attention.
And also thank you for your swift response! This perfectly resolved my issue, and I can now continue with my thesis. 
Best regards!
1 Like