Probability of a certain value in a discrete distribution

Hi, sorry if this sounds very basic for you, but I’m a PyMC3 beginner, and I have a problem continuing my learning.

I have this code and I want to know the probability of nombes_clientes in the value 34. When I talk about probability, I’m talking about a value between 0 and 1, not the logp that I saw in a lot of posts here.

with pm.Model() as clientes:

nombres_clientes = pm.Bound(pm.Geometric, upper=100)('nombres_clientes', p=0.02685)
trace = pm.sample(10000, cores=1)
nombres_tr = trace['nombres_clientes']

I don’t know why I can’t find it, and maybe it’s an stupid thing and I have my eyes closed. Thankyou so much.

You can use .dist to get a distribution from random variable. Using the logp method, you can get the logarithm of the probability and applying np.exp transform will give you probability of that event which is between 0 and 1. Here’s an example:

>>> import pymc3 as pm
>>> import numpy as np
>>> dist = pm.Geometric.dist(p=0.02685)
>>> np.exp(dist.logp(34).eval())
0.010936472378101412
>>> dist_bounded = pm.Bound(pm.Geometric, upper=100).dist(p=0.02685)
>>> np.exp(dist_bounded.logp(34).eval())
0.010936472378101412

If you like, you can refer to this notebook.

2 Likes

thankyou so much, I think I got the idea, I have my last question. How can I assign a ‘tag’ to that dist_bound to identify it for plotting ? Because if I put

dist_bounded = pm.Bound(pm.Geometric, upper=100).dist(**'names_customers'**,p=0.02685)

the program give me an error. Thankyou so much another time.

You will have to declare a model and put the rv in context like this:

>>> with pm.Model() as model:
...  x = pm.Bound(pm.Geometric, upper=100)('x', 0.02685)
...

Then you can get logp at a sample point using:

>>> lp = model.logp({'x': 34})
>>> np.exp(lp)
0.010936472378101412

This is one of the FAQ’s ==> here’s the thread

1 Like

Thankyou again ! I understand the importance of the model. The last thing I want to ask you, if I want to add to my model another distribution, but this time continuous distribution as the uniform one, like this:

with pm.Model() as model:

name_customers = pm.Bound(pm.Geometric, upper=100)('x', p=0.02685)
id_customers = pm.Uniform('id', lower=12000000, upper=99999999)
trace = pm.sample(10000, cores=1)

and I’m trying to get the logp (only of the name_customers in the value = 36) like you told me:

lp = model.logp({‘x’: 36})
x = np.exp(lp)

I have this Error:

TypeError: Missing required input: id_interval__

Can you tell me what I’m doing wrong?
Sorry for my basic knowledge, I’m trying to document myself and you are helping me a lot.

I know the pm.Uniform is a continuous distribution, so mathematically P{id = x} = 0 for all id . I should be asking about probabilities of ranges, not points, in continuous distributions. Maybe I have that error ?

Hi Eduardo

The error is due to the fact that you are asking for the logp of the model yet you only supply one input. The model has two distributions: ‘x’ and ‘id’, so your logp request needs two inputs. A further complication is that PyMC3 has transformed your ‘id’ behind the scenes within the model. I’m not sure if accessing the model’s logp is the best approach for you. It might help if you further explain what you want to achieve.

In any case from the code you have presented there is no link between ‘x’ and ‘id’, so they are independent. You can get the logp of ‘x’ separately from ‘id’. You could construct two different models.

I usually extract stats from either the posterior or prior depending on what I want. In your case you could do

np.mean(trace['x']==36) #equals 0.0084

and for continuous distributions you can look at ranges

np.mean( np.logical_and(trace['id']>(13e6-500e3), trace['id']<(13e6+500e3)) ) #equals 0.0123

A further point is that PyMC3 works best with smaller absolute numbers. ID values should start at zero and increment by 1 as integers. A common approach is to standarise both x and y. So subtract means and divide by standard deviations.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

2 Likes

Hi Nicholas, thankyou so much.
Okay ! I understand.

My goal that I want to achive is that I want to “ask” to the program:
"(1)What is the probability of a name called ‘JOHN’ ?. (2)And what is the probability to find customer whose name is ‘JOHN’(name_customer distribution) and his ‘id’(id_customers distribution) is 13467780 ?".
(I wrote the example for name and id here, but most commonly I look for name and age (that it is also an continuous distribution like ‘id’))

And the program could give me the probabilities between 0 and 1. (Note that ‘JOHN’ is inside a dictionary and his value is, for example, 30 in the name_customers distribution)

The first part of the question I learn right now how to do. (with an independent model as you told me)

But the second part, I need to combine both distributions in one model, but knowing that one is continuous(id_customers), and the other discrete(name_customers), I don’t know how to get that joint probability I want to get.

And the last thing you’re telling me, I use large numbers to make them as close as possible to a normal country identifier, so the 8-digit numbers.

Thankyou so much for you attention, I appreciate too much ! (from a PyMC beginner) :slight_smile:

Okay I realized that in the second question I wrote before, it would be enough to multiply the probability for a value of name_customers(in this case ‘JOHN’) and the probability for a value of id_customers(in the example I wrote, 13467780), which are independient.

Multiplying these two values ​​I would get the probability that I am looking for the second question.

Anyway, apart from the official documentation, does anyone know any documentation that can help me to write the models and adjust my distributions as best as possible? Thankyou.

Hi Edoardo,
If you are looking for general educational resources about Bayesian modeling, I summed up several of them in a previous thread.
Hope you’ll find that useful!