Predictions without controlled variables

My greetings to the community,

I’m developing a softmax regression model using PyMC v5.9.0.
My training data are as following:

xObservedScaled: 124 observations x 102 features
Nclasses: 3

From the 102 features, 2 features (namely age and bmi) were incorporated in order to be controlled for.
The model I used was the following:

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sigma=1, shape=Nclasses)
    beta = pm.Normal('beta', mu=0, sigma=0.5, shape=(Nfeatures,Nclasses))
    X = pm.MutableData("X", xObservedScaled)
    mu = alpha + pm.math.dot(X, beta)
    theta = pm.Deterministic('theta', pt.special.softmax(mu, axis=1))
    yhat = pm.Categorical('yhat', p=theta, observed=yObserved)
    idata = pm.sample(2000)

Based on the documentation, in order to make predictions on new data, I should use the pm.set_data(), that is the following code:

with model:
    pm.set_data({"X": xNewScaled})
    predictions = pm.sample_posterior_predictive(idata, model=model, predictions=True, var_names=['theta'])

However, in order for the above code to work, xNewScaled should have the save number of features (102) as the xObservedScaled. In my case, I want to make predictions without the controlled variables (age and bmi). How can I achieve this?

Thank you in advance for any help.

Searching in the documentation, I found that I need to use a coords object along with the dims keyword. So, below is my new version:

coords = {
        'features': xObservedScaled.columns.tolist(),
        'observations': xObservedScaled.index.tolist()
        }

with pm.Model() as model:
    for k in coords.keys():
        model.add_coord(k, coords[k], mutable=True)
    alpha = pm.Normal('alpha', mu=0, sigma=1, shape=Nclasses)
    beta = pm.Normal('beta', mu=0, sigma=0.5, shape=(Nfeatures,Nclasses), dims=('features', 'classes'))
    X = pm.MutableData("X", xObservedScaled, dims=('observations', 'features'))
    mu = alpha + pm.math.dot(X, beta)
    theta = pm.Deterministic('theta', pt.special.softmax(mu, axis=1))
    yhat = pm.Categorical('yhat', p=theta, observed=yObserved, dims=('observations'))
    idata = pm.sample(2000)

Then, with the new data:

new_coords = {
        'features': xNewScaled.columns.tolist(),
        'observations': xNewScaled.index.tolist()
}

with model:
    for k in new_coords.keys():
            model.set_dim(k, len(new_coords[k]), new_coords[k])
    pm.set_data({"X": xNewScaled.values}, coords=new_coords)
    predictions = pm.sample_posterior_predictive(idata, model=model, predictions=True)

But I get the error:

ValueError: Shape mismatch: x has 100 cols (and 63 rows) but y has 102 rows (and 3 cols)
Apply node that caused the error: Dot22(X, beta)
Toposort index: 0
Inputs types: [TensorType(float64, shape=(None, None)), TensorType(float64, shape=(102, 3))]
Inputs shapes: [(63, 100), (102, 3)]
Inputs strides: [(8, 504), (24, 8)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Add(ExpandDims{axis=0}.0, Dot22.0)]]

So, apparently the issue is with the pm.math.dot(X, beta). How can I make beta mutable?
Thank you in advance.

You don’t want to make beta mutable otherwise you wouldn’t know what values it should have when doing predictions

Maybe you can say more about what exactly you mean by this?

I’m trying to model biological data. I have a set of biological features (100 predictors) and 3 outcomes. These biological features are influenced greatly by cofounding factors such as age and bmi. So, I have included these two variables in my model in order to control for them and shrink the coefficients of the predictors that I’m interested in. But after doing that, I want to predict the probability of an outcome using only the coefficients of the 100 predictors. One solution would be to regress-out age and bmi from the 100 predictors before training. But I don’t like so much this approach.

So if I handed you the true values of all 100 coefficients (or somehow built the true values into the model) and a new vector of 100 predictor values, what do you want/expect the model to produce by way of a prediction?

[edit: and by “true” values I mean the true values you are currently trying to estimate with your 102-predictor model.]

Given a vector of true values for each coefficient:

betasTrue = [b1, b2, b3, ..., b100]

and a new vector of values for the 100 predictors:

xNewScaled = [x1, x2, x3, ..., x100]

I’m expecting to be able to predict the probability of an outcome for each observation:

yhat_obs_i = [p(outcome1), p(outcome2), p(outcome3)]

The above expectation comes with the assumption that each yhat_obs_i is not affected by age and bmi because they were included initially in the model as x101 and x102, got associated with the betas b101 and b102, and helped to get the betasTrue for the 100 predictors of interest.

I think I found a possible solution. Instead of using only 100 predictors with xNewScaled I could use all 102 features. However, in the xNewScaled the features age and bmi wil be fixed to a given value, for instance zero, for all observations. This way, age and bmi will be treated as constants leaving only the contribution of the 100 predictors to the prediction.

1 Like

That is true. The 100 other coefficients should be minimally “contaminated” by the associations between your target variable and age/BMI. But that does not imply that the remaining coefficients can be used in the absence of the age and BMI coefficents (and predictors) to generate plausible estimates of y. You can work through a simple 2-predictor scenario to experiment. If I give you the height and width of a rectangle in cm and ask you to predict the area in square kilometers, you will be able to do so perfectly (i.e., you expect the estimates to be close to the “true” values). Now I give you a width and ask you to predict the area without the “control” variable of height. Do you just omit the height and its coefficient from the model and carry on?

The strategy you suggest in your next post would be to set the height to zero, which would imply that you now always predict an area of exactly zero, which doesn’t seem like what you want. Likewise, setting all ages and BMIs to zero will give you accurate predictions if all of the observations in your test set are individuals age zero years and who have a BMI of zero.

You have some fair points, so let me explain.

In this example, no I cannot omit height. However, in this example width and height have a deterministic relationship. If I know width and height, then I know the rectangle area. In the case of my data, the relationship of age and bmi with the 100 predictors is not deterministic. I can perfectly have two individuals with the same age and bmi and very similar values across the 100 features and still have different outcome. You may argue that in this case there are more cofounding factors. I agree… if only they could be identified.

Keep in mind that my input data are scaled. So, zero represents the mean. Additionally, my approach resembles the way we interpret coefficients. That is, the effect of a predictor is evaluated by keeping the rest fixed. Is it ideal?.. definitely not! To explain the rationale behind this approach, I’ll represent the model as following:

let’s group the predictors in two components:

componentA = b1x1 + b2x2 + b3x3 + ... + b100x100
componentB = b101Age + b102BMI

so the prediction would be:

y = componentA + componentB

in biological data, under the lack of pathological condition the inter-individual variability of biomolecular traits (ie the 100 features I used) is very small. That makes componentA to appear almost constant and the coefficients of componentB to dominate. In this kind of situation the predictions will be driven mainly by componentB. In order to evaluate the contribution of componentA to the prediction, I’ll have to somehow remove the contribution of componentB.

Then comes the problem of sample size. As I mentioned above, I can have different outcomes across individuals with similar values across all 102 features. This indicates the presence of cofounding factors that shape the relationship of the prediction with componentA and componentB. Therefore, to have a couple of individuals with exactly the same age and bmi is not be enough to capture this relationship. In this case, I would need a given number (N) of individuals for each age and for each bmi. So, if I set N=10 (which I consider very small), then for a range of ages between 30-40 and for a range of bmi between 20-30, I would need 1000 individuals. And that if I treat bmi as a discrete variable.

Bottom line: in this kind of scenarios, age and bmi do more harm than good on the prediction.

That’s not required for the example to illustrate that you can’t simply omit parts of the model after estimating parameters and hope to get something sensible. Standardizing your data helps only because it’s easier to “impute” the missing predictor variables and their role in the to-be-predicted values. But it seems risky to interpret this as making predictions “without the controlled variables”. You are definitely still incorporating the control variables, but their inclusion is now implicit.

This is true, but you are not trying to interpret your model parameters here. You are trying to predict using those model parameters. The rectangle example shows how the interpretation can be correct but the predictions can still be implausible and incorrect.

So, to answer your original question, if you wish to make predictions, you need to provide values for the control variables. If you assume that all to-be-predicted cases are of average age and BMI, then you need to (explicitly) set those predictor values. Note that doing so will make your predictions far more certain that they probability should be (e.g., you should probably be rather uncertain about the true age and BMI values of new observations).

Thank you for your answers and your time. I really enjoyed this conversation. As a final note, I would like to clarify a miscommunication from my side that I think creates confusion. The approach that I’m trying to implement does not aim to produce a model to be applied in the clinical practice. Instead, it’s an approach for isolating components and study their contribution, trying at the same time to control for cofounding factors with a potential dominating behavior. Definitely it’s not a perfect approach, but it can yield some useful insights.

1 Like