Many Divergences for a simple model

I am trying to cross train a model to test it’s fit and make inference about the posterior.

The model is based on the idea of how people may think when allocating money between two people with uneven payoffs. They may give the same amount or just maximize how much is given.
The payoff proportions are my x variables and the y variables is what percentage of money was given to person one (as opposed to person two).

I read all the tutorials and have searched all over for why this model has so many divergences.

train_x = np.array([ 0.7837751 ,  0.19241506, -0.2124012 , -0.82463659,  0.78311322, 0.39562878, -1.13989023, -0.86822927, -0.60025648,  0.3742629 ])
train_y = np.array([5.63498758e-01, 2.88627774e-01, 2.92535202e-01, 3.35866828e-01, 6.67700455e-01, 9.99999856e-01, 1.45195098e-07, 1.34094597e-07, 1.02634229e-07, 9.99999885e-01])
hold_out_x = np.array([-1.47333049,  0.19393238, -0.21990887,  0.74694174, -0.34769498, 0.26646929,  0.72834725, -0.47401384, -1.11955062,  0.46028331])
hold_out_y =  np.array([0.21395213, 0.41452217, 0.47663821, 0.79562313, 0.08839646, 0.62057976, 0.765773  , 0.28458744, 0.1346239 , 0.69640536])
shape = len(train_x)
with pm.Model() as model_weighted:
    beta = pm.Normal('sigmoid_beta_param', mu=-.5, sigma=1, shape=shape)
    slopes = pm.Data("slopes", train_x)
    alpha = pm.Uniform('fair_maximizer_param', lower=0, upper=1, shape=shape)
    # sigmoid so the value is between zero and one. alpha acts as tendency to be fair or maximizer. 
    mean_param = pm.Deterministic("sigmoid", alpha*1/(1+tt.exp(beta*slopes)) + (1-alpha)*.5)
    # since output value is between zero and 1, use truncated normal
    x = pm.TruncatedNormal('obs', mu=mean_param, sigma=.1, lower=0.0, upper=1, observed=train_y)
    trace_new = pm.sample(5000, tune=5000)

result: [40000/40000 06:33<00:00 Sampling 4 chains, 19,988 divergences]

Changing the number of chains did not help, or the values in sample. This also happens for sampling the posterior.

Hello Bayes,

It’s not entirely clear to me what you are attempting to do. It looks a bit like logistic regression, but not exactly. You can check this example to see if it’s of any use.

To your main question, you are attempting to estimate 20 parameters (10 betas and 10 alphas) from 10 data points. Except in specialized circumstances, this is unlikely to work out well and, I assume, is not what you intend. If I remove the , shape=shape from the declaration of both beta and alpha (and alter swap out the truncated normal with a beta), I sampling is faster and has fewer (but not zero) divergences. To comment further, I would need to know more about the intent of the model.

If I am reading your first post correctly, the observations are proportions (i.e., proportion of money given to person 1). In that case, I would probably use a Binomial distribution, which takes a probability (such as the sigmoid quantity returned from the logistic function) and a count (i.e., n), which would be the total number of units (dollars, quarters, dimes, pennies) that a user could give to person 1. Note that treating your data in this way requires that “unconvert” everything from the proportions you currently have to just the numerators of those proportions (and then n is the denominator). So instead of observing 30% of $10 dollars given to person 1, your observation would be that $3 out of a possible $10 were given to person 1 (or, equivalently, 300 cents out of a possible 1,000 cents).

Regarding the use of tt.repeat(), it may work, but it is a rather brittle way of handling subject-specific parameters. For example, if the number of trials per subject were to change or subjects’ data were intermixed in train_x/train_y (e.g., the first and third and fourth observations were from one subject and the second, fifth and sixth were from another), it wouldn’t be obvious that the call to tt.repeat() would need to be altered. I have mocked up a more conventional way of constructing and then referencing parameter arrays. Note that the subject_id list is zero-indexed (first subject is subject 0) and subjects are numbered consecutively. You can do this manually, but I usually convert subject identifiers into categorical variables via pandas and then use the category codes as my indicies into arrays of subject-specific parameters.

import numpy as np
import pymc3 as pm
import theano.tensor as tt

train_x = np.array([ 0.7837751 ,  0.19241506, -0.2124012 , -0.82463659,  0.78311322, 0.39562878, -1.13989023, -0.86822927, -0.60025648,  0.3742629 ])
train_y = np.array([5.63498758e-01, 2.88627774e-01, 2.92535202e-01, 3.35866828e-01, 6.67700455e-01, 9.99999856e-01, 1.45195098e-07, 1.34094597e-07, 1.02634229e-07, 9.99999885e-01])
hold_out_x = np.array([-1.47333049,  0.19393238, -0.21990887,  0.74694174, -0.34769498, 0.26646929,  0.72834725, -0.47401384, -1.11955062,  0.46028331])
hold_out_y =  np.array([0.21395213, 0.41452217, 0.47663821, 0.79562313, 0.08839646, 0.62057976, 0.765773  , 0.28458744, 0.1346239 , 0.69640536])

###########
# new stuff

# assume the first 5 observations are from subject 0
# and the next 5 are from subject 1
subject_id = [0,0,0,0,0,1,1,1,1,1]

# assume that the amount of money potentially given to person 1 is 1000
n_pennies = 1000
# you could instead allow different observations to have different amounts
# in that case you could do something like this:
# n_pennies = [1000, 500, 100, 2000, 300, 100, 250, 700, 350, 1200]

# now discretize the observations so that they are integers (e.g., how many pennies were given to person 1) 
train_y_discrete = (n_pennies * train_y).astype(int)

# end (most of the) new stuff
###########


shape = len(train_x)
person_shape = int(shape/5)

with pm.Model() as model_others:
    # this isn't necessary
    #slopes = pm.Data("slopes", train_x)
    mu = pm.Normal('mu', mu=0, sigma=1)
    sigma = pm.HalfNormal('sigma', sd=2.0) 
    beta = pm.Normal('beta', mu=mu, sigma=sigma, shape=person_shape)

    # here, we use subject_id to select the appropriate beta
    p = pm.Deterministic('p', 1/(1+tt.exp(beta[subject_id]*train_x)))
    # you can also just use pm.math.sigmoid() or pm.math.invlogit()
    # which is simpler for a variety of reasons
    #p = pm.math.sigmoid(beta[subject_id] * train_x)
    #p = pm.math.invlogit(beta[subject_id] * train_x)

    # here we feed p to pm.Binomial() to explain train_y_discrete and n_pennies
    obs = pm.Binomial('obs', p=p, n=n_pennies, observed=train_y_discrete)
    trace_new = pm.sample(2000, tune=1000, target_accept=.85)

Thank you very much! I will try this out! The thing is that the total amount of money is changing. So, I guess I should make n_pennies the total number of dollars. (This can be a decimal) and observed the amount of money spent on one good?

I am actually not sure this would work. The amount given to a particular good is like a sloped line. So one dollar change in good one will result in a slope dollar change in the other good. Since this slope changes randomly, I am not sure how n_pennies can be known, unless I may n_pennies something constant. I guess I will put it as amount to one specific good.

Once I have this, how can I use the MAP to predict on test data?

I changed to use the pm.Data so that I can set the data like this?

with model_others:
    pm.set_data({"slopes": test_x_values, "subject_id":subject_id_test, "n_tokens":test_n_tokens})
    post_pred = pm.sample_posterior_predictive(trace_new, samples=10000)
    
predicted = pd.DataFrame(post_pred['obs']).mode(axis=0).mean().values/test_n_tokens
actual = test_y_discrete/test_n_tokens

With Bayesian methods, how do people generally asset predictive ability? I know AIC and BIC are used, but do people tend to use R^2 over MSE?

This method seems to be much slower than previous methods and I actually seem to obtain a worse R^2.

Thank you so much!

How you code your outcome variable is up to you, but quite a lot is tied up with how you decide to do so (so it’s not a choice to be taken lightly). When you have a outcome that is a proportion, the binomial is pretty standard (e.g., it’s the basis of logistic and probit regressions), but some people use standard linear regression to tackle proportions (e.g., using a normal distribution for the likelihood). The latter may not have all the desirable properties one might wish (e.g., predictions may fall outside the valid [0,1] range of proportions), but may reflect the most natural interpretation of a particular application. I have seen linear regression suggested in cases where the proportions cannot be thought of as a set of discrete dichotomous outcomes (e.g., proportion of some total amount of liquid).

Regardless, if you don’t have a whole lot of confidence about this central aspect of your model, I would suggest backing up and thinking through these issues. They are pretty fundamental to the entire process.

Once I have this, how can I use the MAP to predict on test data?

This one is easy: don’t. Using MAP estimates and modes ignores all that sweet, sweet uncertainty you worked so hard to get a handle on in the first place and are thus pretty unnatural within a Bayesian approach. MAPs, like MLEs, are convenient because dealing with scalars is always easier than dealing with full distributions. But distributions are just what happens when you acknowledge uncertainty. When asking questions of models, expect to get answers in the form of distributions. What is the MSE of my model? The answer will be a distribution. What is the R^2 of my model? The answer will be a distribution. Taking the mean/median/model of a distribution and pretending like it represents an exhaustive summary of the full distribution will likely get you into trouble.

With Bayesian methods, how do people generally asset predictive ability?

This will have many answers. AIC is not generally used as it quantifies model complexity in ways that aren’t a natural fit for Bayesian models, but there are information criteria that are used in Bayesian contexts (see cite below). Some calculate quantities such as MSE. Some use validation (e.g., leave-one-out). Here again, I would strongly suggest that you think through what you want before figuring out how to try and get it.

In the meantime, you can check out a couple of the presentations from PyMCon, which took place a couple of weeks ago:

Posterior Predictive Sampling in PyMC3 by Luciano Paz

A Tour of Model Checking techniques by Rob Zinkov

For more information about WAIC/AIC/DIC/deviance, I would suggest this paper:
Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing , 24 (6), 997-1016.

Thank you for all this information. Another thing that I am trying to do is incorporate many features. Does it make sense to use a multivariate gaussian for this which has priors from another multivariate gaussian and LKJ Cholesky Covariance?

I implemented this an it take a 2 hours to run with just 4 individuals with 4 features while the above example would take approx 30 minutes. Is it likely I am messing up somewhere or these tend to take a long time?

I’ll defer on the multivariate normal question, but in terms of computation time, the bit of script I wrote above took me 5 seconds to generate 6,000 samples (1,000 tuning and 2,000 draws in each of 2 chains). If your sampling seems substantially slower (for that same code), I might suggest running some established, working code (e.g., one of the many examples) and asking a (separate) question about any suspiciously slow performance you are seeing (and include any output and platform/install/version information).

Yes, the example above runs quickly–I am not sure it took 5 seconds, but a maybe a couple minutes.
I am working on a laptop so perhaps that is why it is slow. I would not think that a MVN would take so long as I would guess it would not take much longer than 16 normal distributions.

Thanks for all the help!

I am going to unmark this solution, because I am not sure this is consider a binomial distribution.

A person has a portion of money and allocates the money by some sort of payoff between two goods.
How can giving 50 of 100 dollars to the y-good be seen as 50 success of 100?
So your saying giving y_i of y_max tokens is seen as y_i success with probability p?
Remember an observation is generation by a single person.

So, I am not sure if that follows the following requirement for a Binomial:

– Each observation falls into one of just two categories: success
vs. failure, male vs. female, child vs. adult
– The n observations are all independent
– The probability of success, call it p, is the same for each
observation

I appreciate the help a lot, but I think I need some sort of justification for using a binomial.
I don’t think this is a sequence of successful events.

…I am not sure this is consider a binomial distribution

It is not nearly this simple. There are no right or wrong answers here. There are choices about how to model certain data to answer certain questions and reasonable people can disagree about how those choices should be made.

I appreciate the help a lot, but I think I need some sort of justification for using a binomial.

Indeed. All modeling assumptions (not just the use of the binomial likelihood) need “some sort of justification”.

Whether or not the use of a binomial in this case is appropriate or not is not clear to me. However, as I said before, figuring out how to model your data so that it reflects the way you think about your data is very important.

How can giving 50 of 100 dollars to the y-good be seen as 50 success of 100?

Use of a binomial likelihood to model a choice of how to split $100 2 ways is conceptualizing that choice as if it were instead 100 choices about who to give $1. This is because binomial-distributed variables can be though of as reflecting a sequence of Bernoulli trials. On trial (flip) #1, I decide to give it to person 1 (heads). On trial (flip) #2, I decide to give it to person 1 (tails). And so on. If each of these choices is guided by a common parameter, p, (e.g., the probability of giving the $1 to person 1), then one would expect 100p dollars to be given to person 1.

Now, this is obviously not how the decision is (was) actually made, but that does not mean it’s inappropriate to model the data this way.

I don’t think this is a sequence of successful events.

It’s not clear to me what exactly the objection is. Furthermore, it is difficult to evaluate the use of a binomial likelihood without a specific alternative. Asking whether the binomial is “correct” doesn’t mean much if it’s the only viable likelihood function. So if not the binomial, then what?