Get probability of parameter given new data

Hello,

I have been working in the parameters identification for a bimodal distribution as follow:

Identification of multimodal distributions

The aim of this notebook is to review how pymc3 can handle identification of mixed gaussians generating data.

import pymc3 as pm
from pymc3 import *
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import theano
import theano.tensor as tt

First we define a generic function able to generate linear $$y = a.x + b + \epsilon$$
wit \epsilon following a gaussian law with null mean and variance \sigma.

def generate_data(size = 100, true_intercept = 1, true_slope = 2, scale = .1):
    #
    # generating data
    #
    x = np.random.uniform(0., 1., size=size)
    # y = a + b*x
    true_regression_line = true_intercept + true_slope * x
    # add noise
    #np.random.seed(453)
    y = true_regression_line + np.random.normal(scale=scale, size=size)

    return x, y, true_regression_line

1- Constant values

In this part, the relation between y and x is constant (e.g. a=0).
We consider two components with respective means \mu_1=0 and \mu_2=1.

def generate_mixed(size=100, data_plot=True):
    # data generation
    x_data_1, y_data_1, true_regression_line = generate_data(size=size, true_slope=0., true_intercept=0)
    x_data_2, y_data_2, true_regression_line_2 = generate_data(size=size, true_slope=0., true_intercept=1)

    if(data_plot):
        # display data & true regression line
        fig = plt.figure(figsize=(7, 7))
        ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
        ax.plot(x_data_1, y_data_1, 'x', label='sampled data 1')
        ax.plot(x_data_2, y_data_2, 'o', label='sampled data 2')
        ax.plot(x_data_1, true_regression_line, label='true regression line', lw=2.)
        ax.plot(x_data_2, true_regression_line_2, label='true regression line 2', lw=3.)
        plt.legend(loc=0);
        plt.show()

    x_data = np.concatenate((x_data_1, x_data_2))
    y_data = np.concatenate((y_data_1, y_data_2))
    
    print('    Overall y mean={0:.2f}'.format(y_data.mean()))
    print('Component y_1 mean={0:.2f}'.format(y_data_1.mean()))
    print('Component y_2 mean={0:.2f}'.format(y_data_2.mean()))

    data = {'x': x_data, 'y': y_data}
    return data
data = generate_mixed(size=1000)
x_data = data['x']
y_data = data['y']

output_6_0

    Overall y mean=0.50
Component y_1 mean=0.00
Component y_2 mean=0.99
sns.distplot(y_data)
<matplotlib.axes._subplots.AxesSubplot at 0x1c396a2c18>

output_7_1

The model is assuming a mixture of two Gaussians

# https://docs.pymc.io/notebooks/marginalized_gaussian_mixture_model.html
W = np.array([0., 0])
with pm.Model() as model_mixed:
    w = pm.Dirichlet('w', np.ones_like(W))
    mu = pm.Normal('mu', 0., 10., shape=W.size)
    sigma = pm.Uniform('sigma', 0., 5., shape=W.size)

    # Likelihood
    y_obs = pm.NormalMixture('y_obs', w, mu, sd=sigma, observed=y_data)
    
    trace_mixed = pm.sample(3000, cores=4, step=pm.SMC())

pm.traceplot(trace_mixed)
pm.summary(trace_mixed)
Sample initial stage: ...
Stage: 0 Beta: 0.000325 Steps: 25
100%|██████████| 3000/3000 [00:27<00:00, 110.90it/s]
Stage: 1 Beta: 0.000917 Steps: 17
100%|██████████| 3000/3000 [00:20<00:00, 150.97it/s]
Stage: 2 Beta: 0.001724 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 645.63it/s]
Stage: 3 Beta: 0.003111 Steps: 25
100%|██████████| 3000/3000 [00:28<00:00, 106.95it/s]
Stage: 4 Beta: 0.005534 Steps: 5
100%|██████████| 3000/3000 [00:05<00:00, 507.53it/s]
Stage: 5 Beta: 0.007663 Steps: 25
100%|██████████| 3000/3000 [00:28<00:00, 104.99it/s]
Stage: 6 Beta: 0.009495 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 628.53it/s]
Stage: 7 Beta: 0.012544 Steps: 25
100%|██████████| 3000/3000 [00:30<00:00, 99.27it/s] 
Stage: 8 Beta: 0.019295 Steps: 3
100%|██████████| 3000/3000 [00:03<00:00, 812.85it/s]
Stage: 9 Beta: 0.023961 Steps: 21
100%|██████████| 3000/3000 [00:24<00:00, 121.05it/s]
Stage: 10 Beta: 0.027704 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 642.48it/s]
Stage: 11 Beta: 0.032374 Steps: 25
100%|██████████| 3000/3000 [00:29<00:00, 102.82it/s]
Stage: 12 Beta: 0.039610 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 621.53it/s]
Stage: 13 Beta: 0.050035 Steps: 25
100%|██████████| 3000/3000 [00:30<00:00, 99.43it/s] 
Stage: 14 Beta: 0.064108 Steps: 3
100%|██████████| 3000/3000 [00:03<00:00, 777.79it/s]
Stage: 15 Beta: 0.083553 Steps: 15
100%|██████████| 3000/3000 [00:18<00:00, 160.28it/s]
Stage: 16 Beta: 0.116714 Steps: 3
100%|██████████| 3000/3000 [00:03<00:00, 795.55it/s]
Stage: 17 Beta: 0.183498 Steps: 8
100%|██████████| 3000/3000 [00:09<00:00, 309.22it/s]
Stage: 18 Beta: 0.327297 Steps: 4
100%|██████████| 3000/3000 [00:05<00:00, 586.19it/s]
Stage: 19 Beta: 0.613976 Steps: 7
100%|██████████| 3000/3000 [00:08<00:00, 341.50it/s]
Stage: 20 Beta: 1.000000 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 602.59it/s]
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
mean sd mc_error hpd_2.5 hpd_97.5
mu__0 0.994766 0.003140 0.000060 0.988806 1.000994
mu__1 0.004642 0.003169 0.000055 -0.001040 0.011343
w__0 0.500689 0.011411 0.000209 0.480149 0.523877
w__1 0.499311 0.011411 0.000209 0.476123 0.519851
sigma__0 0.097263 0.002185 0.000035 0.093006 0.101483
sigma__1 0.100837 0.002264 0.000043 0.096461 0.105173

ppc = pm.sample_posterior_predictive(trace_mixed, samples=500, model=model_mixed)
samples_from_posterior = np.asarray(ppc['y_obs'])
print(samples_from_posterior.shape)
_, ax = plt.subplots(figsize=(12, 6))
for y in ppc['y_obs']:
    sns.distplot(y , hist=False, ax=ax, kde_kws={"lw": 1, 'alpha':0.3})
sns.distplot(y_data, ax=ax, hist=False,
            kde_kws={"color": "k", "lw": 3, "label": "Dataset"},
             hist_kws={"histtype": "step", "linewidth": 3,
                       "alpha": 1, "color": "g"})
100%|██████████| 500/500 [00:00<00:00, 570.65it/s]


(500, 2000)





<matplotlib.axes._subplots.AxesSubplot at 0x1c376346d8>

As you can see things are going pretty well (in my opinion).
I am now willing to calculate the probability of w given a new data point y. How can I do that?

Thank you in advance.

1 Like

Not sure I understand - The probability of w is already fixed given the model and the observed (represented as samples in trace). So you either looking to:

  1. update the posterior of w given a new data point y.
    or
  2. given a new data point y, compute the probability it’s belonging to component 1 or 2

Thanks for your answer.
Indeed, my question was not crystal clear:

  • I am trying to figure out option 2 : given a new data point y, how to compute the probability that it belongs to component 1 or 2.

In that case, you can evaluate the likelihood of y belonging to each component, and then normalized the likelihood, something like:

ynew = np.asarray([.5])

complogp = y_obs.distribution._comp_logp(theano.shared(ynew))
f_complogp = model_mixed.model.fastfn(complogp)

weight_ynew = []
for ichain in range(trace_mixed.nchains):
    for point_idx in range(len(trace_mixed)):
        point  = trace_mixed._straces[ichain].point(point_idx)
        point.pop('sigma')
        point.pop('w')
        prob = np.exp(f_complogp(point))
        prob /= prob.sum()
        weight_ynew.append(prob)

weight_ynew = np.asarray(weight_ynew).squeeze()

sns.kdeplot(weight_ynew[:, 0]);
2 Likes

This is working perfectly. :+1:
Thank you very much!! :grinning:

To complete the topic, here is the evolution of P(w_1) and P(w_2) wrt y.

ynew = np.asarray([0.51])

def prob_weights(model_mixed, trace_mixed, ynew):
    complogp = y_obs.distribution._comp_logp(theano.shared(ynew))
    f_complogp = model_mixed.model.fastfn(complogp)
    weight_ynew = []
    for ichain in range(trace_mixed.nchains):
        for point_idx in range(len(trace_mixed)):
            point  = trace_mixed._straces[ichain].point(point_idx)
            point.pop('sigma')
            point.pop('w')
            prob = np.exp(f_complogp(point))
            prob /= prob.sum()
            weight_ynew.append(prob)

    weight_ynew = np.asarray(weight_ynew).squeeze()

    return(weight_ynew.mean(axis=0))

n=1000
min_x = -0.5
max_x = 1.5
delta_x = (max_x - min_x )/n
probability_weights = np.array([[],[],[]])

for i in range(n):
    xx = min_x + i * delta_x
    res = prob_weights(model_mixed, trace_mixed, xx)
    res = np.append(xx, res).reshape(3,1)
    probability_weights = np.concatenate([probability_weights, res], axis=1)
print(probability_weights.shape)
probability_weights = pd.DataFrame(probability_weights.T, columns=['x', 'P(w_1)', 'P(w_2)'])
probability_weights

(3, 1000)
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
x P(w_1) P(w_2)
0 -0.500 1.000000e+00 4.424003e-42
1 -0.498 1.000000e+00 5.257151e-42
2 -0.496 1.000000e+00 6.247663e-42
3 -0.494 1.000000e+00 7.425351e-42
4 -0.492 1.000000e+00 8.825689e-42
5 -0.490 1.000000e+00 1.049089e-41
6 -0.488 1.000000e+00 1.247121e-41
7 -0.486 1.000000e+00 1.482645e-41
8 -0.484 1.000000e+00 1.762779e-41
9 -0.482 1.000000e+00 2.095998e-41
10 -0.480 1.000000e+00 2.492392e-41
11 -0.478 1.000000e+00 2.963974e-41
12 -0.476 1.000000e+00 3.525045e-41
13 -0.474 1.000000e+00 4.192638e-41
14 -0.472 1.000000e+00 4.987038e-41
15 -0.470 1.000000e+00 5.932400e-41
16 -0.468 1.000000e+00 7.057496e-41
17 -0.466 1.000000e+00 8.396600e-41
18 -0.464 1.000000e+00 9.990538e-41
19 -0.462 1.000000e+00 1.188795e-40
20 -0.460 1.000000e+00 1.414678e-40
21 -0.458 1.000000e+00 1.683607e-40
22 -0.456 1.000000e+00 2.003811e-40
23 -0.454 1.000000e+00 2.385093e-40
24 -0.452 1.000000e+00 2.839139e-40
25 -0.450 1.000000e+00 3.379875e-40
26 -0.448 1.000000e+00 4.023903e-40
27 -0.446 1.000000e+00 4.791011e-40
28 -0.444 1.000000e+00 5.704790e-40
29 -0.442 1.000000e+00 6.793364e-40
... ... ... ...
970 1.440 2.567641e-38 1.000000e+00
971 1.442 2.194209e-38 1.000000e+00
972 1.444 1.875275e-38 1.000000e+00
973 1.446 1.602860e-38 1.000000e+00
974 1.448 1.370153e-38 1.000000e+00
975 1.450 1.171348e-38 1.000000e+00
976 1.452 1.001487e-38 1.000000e+00
977 1.454 8.563431e-39 1.000000e+00
978 1.456 7.323064e-39 1.000000e+00
979 1.458 6.262970e-39 1.000000e+00
980 1.460 5.356858e-39 1.000000e+00
981 1.462 4.582286e-39 1.000000e+00
982 1.464 3.920093e-39 1.000000e+00
983 1.466 3.353918e-39 1.000000e+00
984 1.468 2.869791e-39 1.000000e+00
985 1.470 2.455782e-39 1.000000e+00
986 1.472 2.101701e-39 1.000000e+00
987 1.474 1.798843e-39 1.000000e+00
988 1.476 1.539775e-39 1.000000e+00
989 1.478 1.318142e-39 1.000000e+00
990 1.480 1.128517e-39 1.000000e+00
991 1.482 9.662622e-40 1.000000e+00
992 1.484 8.274136e-40 1.000000e+00
993 1.486 7.085835e-40 1.000000e+00
994 1.488 6.068759e-40 1.000000e+00
995 1.490 5.198155e-40 1.000000e+00
996 1.492 4.452858e-40 1.000000e+00
997 1.494 3.814773e-40 1.000000e+00
998 1.496 3.268425e-40 1.000000e+00
999 1.498 2.800583e-40 1.000000e+00

1000 rows × 3 columns

probability_weights.set_index('x', inplace=True)
sns.scatterplot(data = probability_weights)

output_13_1

1 Like

I have been digging in the code you have given and there is some points I have not fully understood.

Basically, the idea is to evaluate for each component the Expectation of P(y_{new} | w=w_i).
For that we are using the _comp_logp function that is returning in our case the two components of the mixture loglikelihood defined by y_obs.
To evaluate the expectation we are using the posterior distribution of the parameters from the trace through two nested for loops.
What I am not understanding is why do we drop sigma and w variables from the trace point when evaluating f_complogp. What seems like some magic to me at this point is that the definition of this function seems to request such a drop “by default”…

Thank you in advance.

The above code is not evaluating the Expectation of P(y_new | w=w_i), instead, it is evaluating the expectation of P(y_new is generated from component k). Also, we drop sigma from the trace is not because we dont need the information, the posterior of the transformed parameter (sigma_log__ which is the actually free parameter being sampled) is used. However, we dont use the w in f_complogp.

To get P(y_new | w=w_i), you need to work out the conditional probability which is P(y_new, w=w_i) / P(w=w_i).

That’s pretty clear for me now. thank you again! :grinning:

1 Like