I have 10 datasets (lets say they contain measurements of a X and a Y value and each dataset has a different length).
I want to use pymc to infer the parameters of a model. Lets assume I do a simple linear regression as in the GLM: linear regression notebook glm-linear) .
The envisioned result is one estimation of the slope and one estimation of the intercept based on the model and 10 datasets.
But this gives me a list with 10 “idatas” with 10 estimations and thus 10 slopes and 10 intercepts.
What methodology could use to combine the results to get a best guess of the parameters?
Side mark: I was looking into the part of named dimensions with data containers (named-dimensions-with-data-containers) to see if I can modify this method to make my problem fit into this format, but my datasets have different sizes and no overlapping X values.
pm.set_data is not what you want (as you already found). If you just have tabular data, you should just stack it all up and run a single model. If you don’t, I’d need more information to recommend an approach.
Thanks for your response.
This is what I already thought the solution would be.
About “stacking it all up”: is this the way of working?
Assume the model for which i do parameter a fit is nonlinear and looks like this: Y = some_non_linear_model(X, param_a, param_b)
and the code to setup the pymc model looks something like this:
model = pm.Model()
with model:
param_a = pm.Normal("param_a", mu=1, sigma=2)
param_b = pm.Normal("param_b", mu=1, sigma=2)
# Expected value of outcome
mu = some_non_linear_model(x_obs, param_a, param_b)
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=0.5, observed=y_obs)
I can get this to work with 1 single x_obs and a list of y’s: y_obs = [y_1, ..., y_10]
But i am not sure how to handle x_obs (being a list of x_1 up to x_10 with all different lengths) and thus mu to work with multiple observations.
You are going to have to be more concrete about the shapes involved in your problem. If you have two datasets (X_1, y_1) and (X_2, y_2) with shapes (n_1 \times k), (n_1,) and (n_2 \times k), (n_2, ), then you simply form X = np.concat([X_1, X_2], axis=0), y= np.concat([y_1, y_2], axis=0) and then mu = f(X, a, b) has shape (n_1 + n_2, ).
You can also just keep the logic separate and add 10 observed variables without trying to concatenate or stack them. If they are using the same unobserved variables the information will flow the same way and you get the posterior corresponding to the multiple evidences
That is actually the most easiest way of making this work. And with about 10 observations this is also not that much of an effort.
Thanks for your help
This is how I did interpret your last suggestion
# Expected value of outcome
mu1 = some_non_linear_model(X_obs[0], a, b)
mu2 = some_non_linear_model(X_obs[1], a, b)
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal("Y_obs", mu=mu1, sigma=1, observed=observed_data[0])
Y_obs2 = pm.Normal("Y_obs2", mu=mu2, sigma=1, observed=observed_data[1])