How to set different priors to different features?

I have the following model for a logistic regression where all predictors are following a ~N(0,1) distribution:

with pm.Model(coords={"predictors":X_binary.columns.values}) as model_1:
    X = pm.MutableData('X', X_train)
    y = pm.MutableData('y', y_train)

    constant = pm.Normal('constant', mu=-0.5, sigma=0.1)

    beta = pm.Normal('beta', mu=0, sigma=1, dims="predictors")

    score = pm.Deterministic('score', X@beta)

Now, for some given predictors, I would like to give them a ~N(1,0.5) distribution. I tried the following:

list_of_bad = ['remove-update', 'invoke-ssidexfil','-payload']
list_of_normal = [pred for pred in X_binary.columns.values 
                  if pred not in list_of_bad]

with pm.Model(coords={"normal_pred":list_of_normal, "bad_pred":list_of_bad}) as model_2:
    X = pm.MutableData('X', X_train)
    y = pm.MutableData('y', y_train)

    constant = pm.Normal('constant', mu=-0.5, sigma=0.1)
    
    beta_normal = pm.Normal('beta_normal', mu=0, sigma=1, dims='normal_pred')
    
    beta_bad = pm.Normal('beta_bad', mu=1, sigma=0.5, dims='bad_pred')

    beta = pm.math.concatenate([beta_normal,beta_bad])

    score = pm.Deterministic('score', X@beta)

But that didn’t seem to work, I get a wacky result. It could be because concatenating will change the order of my features and they won’t coincide with the order in X anymore…? But I’m not sure…

Thanks in advance!

You can give a vector of parameters to independent multidimensional distributions to set different values for each component:

beta = pm.Normal('beta', mu=[0, 1], sigma=[0.1, 1], dims="predictors")

You can double check things like this using pm.draw to get a bunch of samples from a declared random variable, then check the summary statistics match what you’ve set:

pm.draw(beta, 1000).mean(axis=0)
>>> array([-1.38747093e-04,  1.00139075e+00])

pm.draw(beta, 1000).std(axis=0)
>>> array([0.10131437, 0.99300722])

Your concatenation method can also be checked this way, if you’re worried about the order of the betas. I think that way should work fine, what is wacky about your result?

1 Like

Thank you for your answer. But this solution doesn’t seem to work… my shapes stop matching.

Maybe I wasn’t clear on my question. Let me try and clarify it:

Consider my X to have these columns (features/predictors):

beta = ['-windowstyle', 'hidden', '-executionpolicy', 'bypass', 'powershellscript', '-nologo', 'remove-update', 'invoke-ssidexfil', '-payload']

I have given them all a prior distribution of ~N(0,1), then did score = beta@X and then fit this to a sigmoid curve to get the probability my sample belongs to class 1.

However, consider that I have gotten domain information that ['remove-update', 'invoke-ssidexfil', '-payload'] are features that increase the probability of my sample to belong to class 1; then I would like to give them a ~N(1,0.5) distribution, to increase how much they push my end probability closer to 1.

Might be relevant to know that the beta.shape = (721,) and I have around 400 samples.

EDIT:

I checked that and indeed the concatenate just includes these 3 betas in the end, which causes the order of the columns of X to not match the orders of beta anymore.

I then fixed this by enforcing the last 3 columns to be ['remove-update', 'invoke-ssidexfil', '-payload'] so that beta and X are correct, I tested it with both betas~N(0,1) and it seems that this fixed the issue…

…however, the outcome is a little bit different than model_1 but I would expect them to be the same. Any ideas why?

  1. Does the order of my features in X or beta influence the result? I would imagine that it doesn’t…
  2. Every time I train my model, even if I do not change anything, it might yield results that are a little bit different because of how MCMC works?

In my first suggestion I tacitly assumed you had 2 features. Here are two suggestions for how to proceed (all code is untested):

  1. Give arrays of length k (the number of features) to the beta constructor
vars_to_set = # list of column names whose priors are not N(0, 0.1)
set_mask = [x in vars_to_set for x in data_df.columns]
k = data_df.shape[1]

beta_mu = np.zeros(k)
beta_sigma = np.ones_like(beta_mu) * 0.1
beta_mu[set_mask] = 1
beta_sigma[set_mask] = 1

with pm.Model():
    beta = pm.Normal('beta', mu=beta_mu, sigma=beta_sigma, dims=['predictors'])
    mu = pm.Deterministic('mu', data_df.values @ beta)
  1. Split data_df into X1 and X2, and make betas for them separately
X1 = data_df.drop(columns=vars_to_set)
X2 = data_df[vars_to_set]

coords = {'features_1':X1.columns, 'features_1':X2.columns}

with pm.Model(coords=coords):
    beta_1 = pm.Normal('beta_1', mu=0, sigma=0.1, dims=['features_1'])
    beta_2 = pm.Normal('beta_2', mu=1, sigma=1.0, dims=['features_2'])

    mu = pm.Deterministic('mu', X1.values @ beta_1 + X2.values @ beta_2)

I’d go with number 2 personally, because it’s a bit more organized. But I wanted to show what I was originally thinking about with number 1. 2 is also easier to introduce hierarchical elements to.

As to your questions:

  1. The order of the betas doesn’t matter as long as the betas are homogeneous. The value of the beta will be updated to match the data its paired with, so it matters in that sense. If all the betas are the same, though, it doesn’t matter who initially is paired with who. Once you make a couple of them different, however, you have to start tracking what goes where.

  2. MCMC is a random algorithm, yeah. You should expect to see small variations in the output. You can increase the number of draws to reduce this. How big are we talking here? 3rd decimal place? 2nd?

1 Like

In order to test this I ran both models, one with all betas ~N(0,1) and one with the approach you just proposed, implemented as below:

vars_to_set = ["http-backdoor", "invoke-ssidexfil", "add-regbackdoor", 
                  "powershellscript"]

X1_train = X_train.drop(columns=vars_to_set)
X2_train = X_train[vars_to_set]

X1_test = X_test.drop(columns=vars_to_set) # all - vars_to_set
X2_test = X_test[vars_to_set] # vars_to_set

coords = {"normal_pred":X1.columns, "bad_pred":X2.columns}

with pm.Model(coords=coords) as model_2:
    X1 = pm.MutableData('X1', X1_train)
    X2 = pm.MutableData('X2', X2_train)
    y = pm.MutableData('y', y_train)

    constant = pm.Normal('constant', mu=-0.5, sigma=0.1)    
    beta_normal = pm.Normal('beta_normal', mu=0, sigma=1, dims='normal_pred')    
    beta_bad = pm.Normal('beta_bad', mu=0, sigma=1, dims='bad_pred')
    score = pm.Deterministic('score', X1 @ beta_normal + X2 @ beta_bad)
    noisy_score = pm.Normal('noisy_score', mu=score, sigma=5)
    p = pm.Deterministic('p', pm.math.sigmoid(constant + noisy_score))

    # define likelihood
    observed = pm.Bernoulli('obs', p, observed=y)

    step = pm.NUTS()
    idata = pm.sample(step=step)
    idata_prior = pm.sample_prior_predictive(samples=50)

    with model_2:
        pm.set_data({'X1':X1_test,'X2':X2_test, 'y':np.zeros_like(y_test)})
        y_pred = pm.sample_posterior_predictive(idata)

idata2 = idata.copy()
idata_prior2 = idata_prior.copy()

The difference that I got in the probabilities of my 79 test samples range from [-0.03, 0.02] which I guess is what you expected.

Thanks for your help! :slight_smile:

1 Like