I am looking at some RNA count data where the count data is modelled as a negative binomial distribution. The problem however is that the logp of the model is -inf. I read an earlier thread where another user had the same issue, and the problem were nan values in the input. However, in my case this is not the problem. Essentially, this line is enough to reproduce the problem:
Y = pm.NegativeBinomial("Y", mu=mu, alpha=alpha, observed=Y_OBS)
Regardless of mu and alpha (which can be replaced by constants), Y will have a logp of -inf. If i replace Y_OBS with Y_OBS / 100 however it works. So I guess the count data is simple too large? The max value in Y_OBS however is only 80 000, which to me doesn’t seem like it is too big?
Anyway, is there a method for “normalizing” the count data in my case or apply any other kind of transformation? I guess simply dividing the count data by a constant won’t work since Y_OBS is supposed to be discrete.
Here’s a couple of questions - would it be possible for you to provide a full example of a few data points (or a randomly created dataset) that blow up the log density? Does a similar model lead to problems if you use a Poisson likelihood instead? Alternatively, if you are willing to rescale or normalize your data, you could treat this as a gamma regression problem.
I tried using a Poisson likelihood instead of the negative binomial one but with the same result.
I am not sure rescaling the data is a valid option. I am essentially trying to implement the DESeq2 model in pymc but I must admit my theoretical background is a bit shaky. As common in RNASeq analysis we have many low counts, and the idea is to make a robust regression on the counts which avoids the infinite difference between data points with zero and non-zero counts. But maybe it is still possible to do using a gamma distribution (of which I have zero experience with).
I will try to generate small set of data that replicates my problem as I am unfortunately not allowed to share the real data.
I was able to reduce the data to a single point which would give a “SamplingError: Bad initial energy” error. Strangely it was not the largest data point in the data set.
Y_OBS = 100 # No error
Y_OBS = 561270; # Error
Y_OBS = 561270 - 1000; # Error
Y_OBS = 561270 - 10000; # No error
Y_OBS = 798568; # No error
mu = pm.Normal("mu", mu=0, sd=10)
Y = pm.NegativeBinomial("Y", mu=2**mu, alpha=10, observed=Y_OBS);
Any idea of why this happens?
Thanks for the investigative work tracking down the troublesome data point - I have a couple of things here:
- Can you confirm that the model component which is blowing up the logp is the likelihood? See here for how to do that.
- Can you try changing alpha individually on its own and then changing the hyperparameters for
mu on their own and see whether or not the resulting error occurs?
- Finally, can you make a plot of the initial logp for values approaching the bad Y_OBS? If you post a complete minimal viable reproduction I can help with this.
Thanks for the help. I have taken a closer look at the logp values returned using the following code:
test_values = np.arange(32752, 32772, 1);
for v in test_values:
model = pm.Model();
mu = pm.Normal("mu", mu=0, sd=10);
Y = pm.NegativeBinomial("Y", mu=2**mu, alpha=10, observed=v);
for RV in model.basic_RVs:
if RV.name == "Y":
- Comparing RV “Y” and “mu”, it is always “Y” that has problematic values. “mu” is for this particular example -3.2 for all of the test values.
- Changing the hyperparameters of “mu” does change the initial logp value of “Y”, but only for observed values which returns a valid logp for “Y” (-inf is still -inf). Changing alpha also does not solve the problem.
- I checked the logp of a large range of numbers and which number returns -inf seems quite random. Below are the logp values where I increase the observed value by one each iteration.
- For observed values between 32752 to 32757: Logp about -78456
- For observed values between 32758 to 32766: Logp positive infinity
- For 32767: nan logp
- For observed values 32768 and larger: negative infinity
For values much larger than 32768 the logp will become valid again, and for even larger values it will become -inf again. This patterns of valid and non-valid logp values is repeated for a quite large range of numbers.
Maybe a rounding error could be the source of this problem as it seems quite random?
This is very helpful information and I appreciate you trying these values out. 32768 is a special value since it is 2^15 and, allowing for negative numbers, is the maximum value allowed for a 16 bit integer. Since the code runs successfully with no issues on my computer, I’d like to identify any points at which a different integer representation might be used on your computer. Would you mind running the script located here and showing the inputs here?
I solved the problem. The issue was that theano.config.floatX was float32 and not float64, which I think makes pymc3 use int16 instead of int32. Thus Y in the code you linked would be int16 (while the test values were of type int64). Simply changing theano.config.floatX to float64 solved this problem.
Thanks! I would never have figured out this was the issue without your help.
I’m glad to hear you solved it. I often use Theano with the 32 bit floating point representation so I will look out for this issue in the future.