How to add weights to data in bayesian linear regression

I have implemented a bayesian model like this:

 #Bayesian Linear Regression
    with pm.Model() as predictive_model:
        x = pm.Data("x",xtrain.values)
        y = pm.Data("y",ytrain.values)
        a = pm.Normal('slope', mu=0, sd = 16)
        b = pm.Normal('intercept',mu= 0, sd = 16)
        s = pm.HalfNormal('error',sd =1)
        y_pred = a*x+b

        obs = pm.Normal('observation', mu = y_pred, sd=s, observed=y)

        start = pm.find_MAP()
        step = pm.NUTS(scaling=start)
        trace = pm.sample(1000, step, start=start, progressbar=True)

Now, I need to add weights associated with each data point, weights are based on recency of data(recent data is given more weights and this weight is already available to me as an array). how can I assign them?

Welcome!

There are various ways to incorporate weights into your model. Many will write custom logp functions (e.g., here and here). But I personally prefer to adjust the scale parameter of my observed variable, which allows for more heavily weighted observations to contribute more model’s logp. It’s simple and relatively transparent. Here’s a simple example:

s = pm.HalfNormal('error', sd = 1)
scaled_s = pm.Deterministic('scaled_s', s / y_weights)
obs = pm.Normal('observation',
                mu = y_pred,
                sd=scaled_s,
                observed=y)

In addition, I would strongly suggest using the defaults when calling pm.sample() until you are confident that they don’t work for you. I would particularly suggest avoiding the use of pm.find_MAP()

# nice and simple
trace = pm.sample(1000)
5 Likes

@cluhmann Thanks. I added the weights but I am getting an error while running the sample_posterior_predicive on test set

with predictive_model:
     32         pm.set_data(dict(x=xtest.values,y=ytest.values))
---> 33         posterior = pm.sample_posterior_predictive(trace)
.
.
ValueError: Cannot broadcast provided shapes (7,), (73,), () given size: ()

This issue doesn’t occur when I don’t use weights.
my train data set has 73 data points with 1 feature and test set has 7 data points.

Yes, adding the weights integrates the data more deeply into the model and makes it harder to just swap out data sets. In these cases, I tend to use the samples in the trace to generate whatever I need “by hand”. Something along these lines:

import matplotlib.pyplot as plt
n_posterior_samples = 100
for _ in range(n_posterior_samples):
    sample = np.random.choice(trace)
    mean_hat = sample['intercept'] + (sample['slope'] * new_x)
    obs_hat = np.random.normal(loc=mean_hat,
                               scale=sample['error'] / new_weights)
    plt.plot(new_x, obs_hat)
plt.show()
3 Likes

Hi, I was actually going through a similar problem, and in my case the accepted solution will be very inefficient since I need to also update the model in an online manner. Instead, I found a better way to do the same by specifying the weights in a data container, and during inference with new data, simply set the data container value with test dataset :

s = pm.HalfNormal('error', sd = 1)
weights = pm.Data("weights", y_weights) 
scaled_s = s/weights
obs = pm.Normal('observation',
                mu = y_pred,
                sd=scaled_s,
                observed=y)

After training through this, during inference, when you set your test data, simple update the data container by

pm.set_data("weights", y_weights_test)

You should be able to run it now.
The problem with pm.Deterministic() call is that it makes each weight a parameter and therefore its size become fixed.

Let me know if you have any further questions, or any feedback for this trick.

5 Likes

Question with how much impact scaling a parameter has on the model.

If s is the learnable parameter and we set sd=scaled_s, then wouldn’t s be skewed towards a smaller value anyway? If yes, then wouldn’t that reduce the impact scaling has?

It’s not clear what you mean by this. Skewed in what way?

Sorry if I was vague about this. Let me give an example.

Let’s say you have 100 response data \boldsymbol{y} that are from two different source. One from simulation y_{sim}, and another using real life experiments y_{exp}. The size of the response data from each source is \text{dim}(y_{sim}) = 95, and \text{dim}(y_{exp}) = 5. And you also want to lower weight (reduce the importance) of the simulation data, and increase the weight for the real experiments.

The standard deviation on the simulation data is very small, s_{sim} = 0.01, compared to the real experiments, s_{exp} = 0.30.

If we have two models. The first model doesn’t weight the data, and the second model does. And let’s say the weights for simulation data is 0.05, and real experiment data is 1. Below are code chunks below for the two models.

# model 1
s1 = pm.HalfNormal('error', sd = 1)
obs = pm.Normal('observation',
                mu = y_pred,
                sd=s1,
                observed=y)
# model 2
s2 = pm.HalfNormal('error', sd = 1)
weights = pm.Data("weights", y_weights) # w_sim = 0.05, w_exp = 1
scaled_s = s2/weights
obs = pm.Normal('observation',
                mu = y_pred,
                sd=scaled_s,
                observed=y)

Wouldn’t the s2 posterior be much much smaller than s1 because majority of the data in \boldsymbol{y} has a small standard deviation? If s2 posterior will be much smaller than s1, then scaled_s will be similar to s1. Therefore, scaling the data isn’t that effective.

Any thoughts on this?

Yes, that all seems correct to me. The value of scaled_s is likely to be similar to s1 on average. But the weighting scheme isn’t intended to do anything on average, it’s intended to weigh some observations more than others. So it’s effect is relative, not absolute. Maybe it helps to reformulate your model 1 like this:

# model 1
s1 = pm.HalfNormal('error', sd = 1)
weights = pm.Data("weights", np.ones(size=N))
scaled_s = s1/weights
obs = pm.Normal('observation',
                mu = y_pred,
                sd=scaled_s,
                observed=y)

This is identical to your model 2, but with equal weights applied to all observations which is the weighting implied by your original model 1. The estimates of s1 should be identical to your original model 1.

Now we can do the following:

# model 1
s1 = pm.HalfNormal('error', sd = 1)
weights = pm.Data("weights", 100*np.ones(size=N))
scaled_s = s1/weights
obs = pm.Normal('observation',
                mu = y_pred,
                sd=scaled_s,
                observed=y)

Here, we have the same, uniform weighting scheme, but we associate every observation with a weight of 100, instead of the previous weight of 1. The posterior of s1 will likely suggest larger values in order to compensate for the extremely large weights we have applied. But things should mostly balance out.

But this is very different than the following, which is basically the idea suggested in the original answer:

# model 1
s1 = pm.HalfNormal('error', sd = 1)
weights = pm.Data("weights", [100,100,1,1,100,1,100,1,1,1,100])
scaled_s = s1/weights
obs = pm.Normal('observation',
                mu = y_pred,
                sd=scaled_s,
                observed=y)

Here, the weights vary across observations. The sampled values of s1 will try to balance the variance in the observed data (y), but must accommodate the fact that some weights are large and some are small.

Does that help?

2 Likes

Thanks @cluhmann. If I understood your example above correctly, we would use weight of 100 for
the more interesting/important observed data like y_{exp} and the smaller weight of 1 for y_{sim}? The larger weight decreases scaled_s for the interesting/important observed data like y_{exp}, allowing these observed data to have more influence on the posterior. And the smaller weight increases scaled_s for y_{sim}. The s1 parameter may be somewhere between very large and small value to balance the overall variance of \boldsymbol{y}. Is my understanding here correct?

1 Like

Exactly!

1 Like

Thanks for this super helpful thread @mcao and @cluhmann! I used this weighting “trick” in a hierarchical model and I noticed after adding in the weighting to the sigma, the varying intercept effects significantly changed in scale.

Is there a change in interpretation when adding weights for varying intercepts?

1 Like

Can you say more about what you mean by this? Do you simply have a model with varying intercepts and that you have used the weighting scheme described above with this model? Or do you mean that you added “for varying intercepts” in some more specific way?

Sorry for the unclear description! The former! The model has varying intercepts (or hierarchical effects or whatever nomenclature that you’d use). Here’s a simplified, high level view of the model:

with pm.Model(coords=COORDS) as model:
    #global intercept
    intercept = pm.Normal('intercept', mu = 0, sigma = .15)
    
    # hyper-prior varying intercept
    indv_mean = pm.Normal("indv_mean", mu=0, sigma= .05)
    indv_sigma = pm.HalfNormal("indv_sigma", sigma = .15)

    #indv idx
    indv_idx = pm.ConstantData("indv_idx", indv_codes, dims="obs")
    
    #non-centered param
    indv_offset = pm.Normal('indv_offset', mu=0, sigma=1, dims = "obs") 
    indv= pm.Deterministic("imdv", indv_mean + (indv_offset * indv_sigma ))

    indv_eff = indv[indv_idx]

    #Covs
    covariates = pm.ConstantData('covariates', df.covariates, dims = "obs")
    b = pm.Normal("b", mu=0, sigma=.05)
    
    weights = pm.ConstantData('weights', df.weight.values, dims = "obs")

    # Model error:
    sigma = pm.HalfNormal("sigma", 20)
    
    scaled_sigma = pm.Deterministic('scaled_sigma', sigma / weights)
    
    mu = pm.Deterministic("mu", 
    intercept + (b * covariates) + indv_effect
    )
    y = pm.Normal("y", mu, sigma=scaled_sigma, observed= df.observed )

I don’t see any reason why the scaling “trick” would work in a fundementally different way once you incorporate varying intercepts. Of course, changing one part of the model will always affect the estimates of other parameters, but knowing exactly what consequences to expect is very much dependent on the specifics of your model and your data. I might suggest taking a subset of your data (e.g., a subset small enough to visualize individual data points) and running “weighted” and unweighted models so that you can better investigate how the intercepts and the weights might be related.

Yep, after digging around I think it all makes sense. Thanks @cluhmann!

1 Like

@cluhmann I love the trick you presented here and always adopt this approach for my problems. Do you have any references for this strategy of weighting?

Not entirely sure where I picked it up. It’s basically a Bayesian version of weighted regression. You can see some discussion on the weighted least squares Wikipedia page or Gelman & Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models (e.g., chapter 18). But I’m sure it’s covered elsewhere!

3 Likes