Get probability of parameter given new data

jean-phi66 · January 12, 2019, 5:21am

Hello,

I have been working in the parameters identification for a bimodal distribution as follow:

Identification of multimodal distributions

The aim of this notebook is to review how pymc3 can handle identification of mixed gaussians generating data.

import pymc3 as pm
from pymc3 import *
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import theano
import theano.tensor as tt

First we define a generic function able to generate linear $$y = a.x + b + \epsilon$$
wit \epsilon following a gaussian law with null mean and variance \sigma.

def generate_data(size = 100, true_intercept = 1, true_slope = 2, scale = .1):
    #
    # generating data
    #
    x = np.random.uniform(0., 1., size=size)
    # y = a + b*x
    true_regression_line = true_intercept + true_slope * x
    # add noise
    #np.random.seed(453)
    y = true_regression_line + np.random.normal(scale=scale, size=size)

    return x, y, true_regression_line

1- Constant values

In this part, the relation between y and x is constant (e.g. a=0).
We consider two components with respective means \mu_1=0 and \mu_2=1.

def generate_mixed(size=100, data_plot=True):
    # data generation
    x_data_1, y_data_1, true_regression_line = generate_data(size=size, true_slope=0., true_intercept=0)
    x_data_2, y_data_2, true_regression_line_2 = generate_data(size=size, true_slope=0., true_intercept=1)

    if(data_plot):
        # display data & true regression line
        fig = plt.figure(figsize=(7, 7))
        ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
        ax.plot(x_data_1, y_data_1, 'x', label='sampled data 1')
        ax.plot(x_data_2, y_data_2, 'o', label='sampled data 2')
        ax.plot(x_data_1, true_regression_line, label='true regression line', lw=2.)
        ax.plot(x_data_2, true_regression_line_2, label='true regression line 2', lw=3.)
        plt.legend(loc=0);
        plt.show()

    x_data = np.concatenate((x_data_1, x_data_2))
    y_data = np.concatenate((y_data_1, y_data_2))
    
    print('    Overall y mean={0:.2f}'.format(y_data.mean()))
    print('Component y_1 mean={0:.2f}'.format(y_data_1.mean()))
    print('Component y_2 mean={0:.2f}'.format(y_data_2.mean()))

    data = {'x': x_data, 'y': y_data}
    return data

data = generate_mixed(size=1000)
x_data = data['x']
y_data = data['y']

output_6_0

    Overall y mean=0.50
Component y_1 mean=0.00
Component y_2 mean=0.99

sns.distplot(y_data)

<matplotlib.axes._subplots.AxesSubplot at 0x1c396a2c18>

output_7_1

The model is assuming a mixture of two Gaussians

# https://docs.pymc.io/notebooks/marginalized_gaussian_mixture_model.html
W = np.array([0., 0])
with pm.Model() as model_mixed:
    w = pm.Dirichlet('w', np.ones_like(W))
    mu = pm.Normal('mu', 0., 10., shape=W.size)
    sigma = pm.Uniform('sigma', 0., 5., shape=W.size)

    # Likelihood
    y_obs = pm.NormalMixture('y_obs', w, mu, sd=sigma, observed=y_data)
    
    trace_mixed = pm.sample(3000, cores=4, step=pm.SMC())

pm.traceplot(trace_mixed)
pm.summary(trace_mixed)

Sample initial stage: ...
Stage: 0 Beta: 0.000325 Steps: 25
100%|██████████| 3000/3000 [00:27<00:00, 110.90it/s]
Stage: 1 Beta: 0.000917 Steps: 17
100%|██████████| 3000/3000 [00:20<00:00, 150.97it/s]
Stage: 2 Beta: 0.001724 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 645.63it/s]
Stage: 3 Beta: 0.003111 Steps: 25
100%|██████████| 3000/3000 [00:28<00:00, 106.95it/s]
Stage: 4 Beta: 0.005534 Steps: 5
100%|██████████| 3000/3000 [00:05<00:00, 507.53it/s]
Stage: 5 Beta: 0.007663 Steps: 25
100%|██████████| 3000/3000 [00:28<00:00, 104.99it/s]
Stage: 6 Beta: 0.009495 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 628.53it/s]
Stage: 7 Beta: 0.012544 Steps: 25
100%|██████████| 3000/3000 [00:30<00:00, 99.27it/s] 
Stage: 8 Beta: 0.019295 Steps: 3
100%|██████████| 3000/3000 [00:03<00:00, 812.85it/s]
Stage: 9 Beta: 0.023961 Steps: 21
100%|██████████| 3000/3000 [00:24<00:00, 121.05it/s]
Stage: 10 Beta: 0.027704 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 642.48it/s]
Stage: 11 Beta: 0.032374 Steps: 25
100%|██████████| 3000/3000 [00:29<00:00, 102.82it/s]
Stage: 12 Beta: 0.039610 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 621.53it/s]
Stage: 13 Beta: 0.050035 Steps: 25
100%|██████████| 3000/3000 [00:30<00:00, 99.43it/s] 
Stage: 14 Beta: 0.064108 Steps: 3
100%|██████████| 3000/3000 [00:03<00:00, 777.79it/s]
Stage: 15 Beta: 0.083553 Steps: 15
100%|██████████| 3000/3000 [00:18<00:00, 160.28it/s]
Stage: 16 Beta: 0.116714 Steps: 3
100%|██████████| 3000/3000 [00:03<00:00, 795.55it/s]
Stage: 17 Beta: 0.183498 Steps: 8
100%|██████████| 3000/3000 [00:09<00:00, 309.22it/s]
Stage: 18 Beta: 0.327297 Steps: 4
100%|██████████| 3000/3000 [00:05<00:00, 586.19it/s]
Stage: 19 Beta: 0.613976 Steps: 7
100%|██████████| 3000/3000 [00:08<00:00, 341.50it/s]
Stage: 20 Beta: 1.000000 Steps: 4
100%|██████████| 3000/3000 [00:04<00:00, 602.59it/s]

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	mean	sd	mc_error	hpd_2.5	hpd_97.5
mu__0	0.994766	0.003140	0.000060	0.988806	1.000994
mu__1	0.004642	0.003169	0.000055	-0.001040	0.011343
w__0	0.500689	0.011411	0.000209	0.480149	0.523877
w__1	0.499311	0.011411	0.000209	0.476123	0.519851
sigma__0	0.097263	0.002185	0.000035	0.093006	0.101483
sigma__1	0.100837	0.002264	0.000043	0.096461	0.105173

ppc = pm.sample_posterior_predictive(trace_mixed, samples=500, model=model_mixed)
samples_from_posterior = np.asarray(ppc['y_obs'])
print(samples_from_posterior.shape)
_, ax = plt.subplots(figsize=(12, 6))
for y in ppc['y_obs']:
    sns.distplot(y , hist=False, ax=ax, kde_kws={"lw": 1, 'alpha':0.3})
sns.distplot(y_data, ax=ax, hist=False,
            kde_kws={"color": "k", "lw": 3, "label": "Dataset"},
             hist_kws={"histtype": "step", "linewidth": 3,
                       "alpha": 1, "color": "g"})

100%|██████████| 500/500 [00:00<00:00, 570.65it/s]


(500, 2000)





<matplotlib.axes._subplots.AxesSubplot at 0x1c376346d8>

As you can see things are going pretty well (in my opinion).
I am now willing to calculate the probability of w given a new data point y. How can I do that?

Thank you in advance.

junpenglao · January 12, 2019, 8:33am

Not sure I understand - The probability of w is already fixed given the model and the observed (represented as samples in trace). So you either looking to:

update the posterior of w given a new data point y.
or
given a new data point y, compute the probability it’s belonging to component 1 or 2

jean-phi66 · January 12, 2019, 8:45am

Thanks for your answer.
Indeed, my question was not crystal clear:

I am trying to figure out option 2 : given a new data point y, how to compute the probability that it belongs to component 1 or 2.

junpenglao · January 12, 2019, 9:15am

In that case, you can evaluate the likelihood of y belonging to each component, and then normalized the likelihood, something like:

ynew = np.asarray([.5])

complogp = y_obs.distribution._comp_logp(theano.shared(ynew))
f_complogp = model_mixed.model.fastfn(complogp)

weight_ynew = []
for ichain in range(trace_mixed.nchains):
    for point_idx in range(len(trace_mixed)):
        point  = trace_mixed._straces[ichain].point(point_idx)
        point.pop('sigma')
        point.pop('w')
        prob = np.exp(f_complogp(point))
        prob /= prob.sum()
        weight_ynew.append(prob)

weight_ynew = np.asarray(weight_ynew).squeeze()

sns.kdeplot(weight_ynew[:, 0]);

jean-phi66 · January 12, 2019, 6:27pm

This is working perfectly.
Thank you very much!!

jean-phi66 · January 12, 2019, 7:55pm

To complete the topic, here is the evolution of P(w_1) and P(w_2) wrt y.

ynew = np.asarray([0.51])

def prob_weights(model_mixed, trace_mixed, ynew):
    complogp = y_obs.distribution._comp_logp(theano.shared(ynew))
    f_complogp = model_mixed.model.fastfn(complogp)
    weight_ynew = []
    for ichain in range(trace_mixed.nchains):
        for point_idx in range(len(trace_mixed)):
            point  = trace_mixed._straces[ichain].point(point_idx)
            point.pop('sigma')
            point.pop('w')
            prob = np.exp(f_complogp(point))
            prob /= prob.sum()
            weight_ynew.append(prob)

    weight_ynew = np.asarray(weight_ynew).squeeze()

    return(weight_ynew.mean(axis=0))

n=1000
min_x = -0.5
max_x = 1.5
delta_x = (max_x - min_x )/n
probability_weights = np.array([[],[],[]])

for i in range(n):
    xx = min_x + i * delta_x
    res = prob_weights(model_mixed, trace_mixed, xx)
    res = np.append(xx, res).reshape(3,1)
    probability_weights = np.concatenate([probability_weights, res], axis=1)
print(probability_weights.shape)
probability_weights = pd.DataFrame(probability_weights.T, columns=['x', 'P(w_1)', 'P(w_2)'])
probability_weights

(3, 1000)

.dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	x	P(w_1)	P(w_2)
0	-0.500	1.000000e+00	4.424003e-42
1	-0.498	1.000000e+00	5.257151e-42
2	-0.496	1.000000e+00	6.247663e-42
3	-0.494	1.000000e+00	7.425351e-42
4	-0.492	1.000000e+00	8.825689e-42
5	-0.490	1.000000e+00	1.049089e-41
6	-0.488	1.000000e+00	1.247121e-41
7	-0.486	1.000000e+00	1.482645e-41
8	-0.484	1.000000e+00	1.762779e-41
9	-0.482	1.000000e+00	2.095998e-41
10	-0.480	1.000000e+00	2.492392e-41
11	-0.478	1.000000e+00	2.963974e-41
12	-0.476	1.000000e+00	3.525045e-41
13	-0.474	1.000000e+00	4.192638e-41
14	-0.472	1.000000e+00	4.987038e-41
15	-0.470	1.000000e+00	5.932400e-41
16	-0.468	1.000000e+00	7.057496e-41
17	-0.466	1.000000e+00	8.396600e-41
18	-0.464	1.000000e+00	9.990538e-41
19	-0.462	1.000000e+00	1.188795e-40
20	-0.460	1.000000e+00	1.414678e-40
21	-0.458	1.000000e+00	1.683607e-40
22	-0.456	1.000000e+00	2.003811e-40
23	-0.454	1.000000e+00	2.385093e-40
24	-0.452	1.000000e+00	2.839139e-40
25	-0.450	1.000000e+00	3.379875e-40
26	-0.448	1.000000e+00	4.023903e-40
27	-0.446	1.000000e+00	4.791011e-40
28	-0.444	1.000000e+00	5.704790e-40
29	-0.442	1.000000e+00	6.793364e-40
...	...	...	...
970	1.440	2.567641e-38	1.000000e+00
971	1.442	2.194209e-38	1.000000e+00
972	1.444	1.875275e-38	1.000000e+00
973	1.446	1.602860e-38	1.000000e+00
974	1.448	1.370153e-38	1.000000e+00
975	1.450	1.171348e-38	1.000000e+00
976	1.452	1.001487e-38	1.000000e+00
977	1.454	8.563431e-39	1.000000e+00
978	1.456	7.323064e-39	1.000000e+00
979	1.458	6.262970e-39	1.000000e+00
980	1.460	5.356858e-39	1.000000e+00
981	1.462	4.582286e-39	1.000000e+00
982	1.464	3.920093e-39	1.000000e+00
983	1.466	3.353918e-39	1.000000e+00
984	1.468	2.869791e-39	1.000000e+00
985	1.470	2.455782e-39	1.000000e+00
986	1.472	2.101701e-39	1.000000e+00
987	1.474	1.798843e-39	1.000000e+00
988	1.476	1.539775e-39	1.000000e+00
989	1.478	1.318142e-39	1.000000e+00
990	1.480	1.128517e-39	1.000000e+00
991	1.482	9.662622e-40	1.000000e+00
992	1.484	8.274136e-40	1.000000e+00
993	1.486	7.085835e-40	1.000000e+00
994	1.488	6.068759e-40	1.000000e+00
995	1.490	5.198155e-40	1.000000e+00
996	1.492	4.452858e-40	1.000000e+00
997	1.494	3.814773e-40	1.000000e+00
998	1.496	3.268425e-40	1.000000e+00
999	1.498	2.800583e-40	1.000000e+00

1000 rows × 3 columns

probability_weights.set_index('x', inplace=True)
sns.scatterplot(data = probability_weights)

output_13_1

jean-phi66 · January 15, 2019, 5:08pm

I have been digging in the code you have given and there is some points I have not fully understood.

Basically, the idea is to evaluate for each component the Expectation of P(y_{new} | w=w_i).
For that we are using the _comp_logp function that is returning in our case the two components of the mixture loglikelihood defined by y_obs.
To evaluate the expectation we are using the posterior distribution of the parameters from the trace through two nested for loops.
What I am not understanding is why do we drop sigma and w variables from the trace point when evaluating f_complogp. What seems like some magic to me at this point is that the definition of this function seems to request such a drop “by default”…

Thank you in advance.

junpenglao · January 15, 2019, 8:14pm

The above code is not evaluating the Expectation of P(y_new | w=w_i), instead, it is evaluating the expectation of P(y_new is generated from component k). Also, we drop sigma from the trace is not because we dont need the information, the posterior of the transformed parameter (sigma_log__ which is the actually free parameter being sampled) is used. However, we dont use the w in f_complogp.

To get P(y_new | w=w_i), you need to work out the conditional probability which is P(y_new, w=w_i) / P(w=w_i).

jean-phi66 · January 18, 2019, 5:53am

That’s pretty clear for me now. thank you again!

Topic		Replies	Views
Gaussian mixture - Infer probability of latent variable on new data Questions	7	643	February 7, 2021
Beyond linear regression with pymc v5 modeling	11	661	October 29, 2022
Likelihood of data belonging to mixture component v5 modeling	3	237	November 7, 2023
Fitting mixture of binomials v5 modeling	30	98	February 4, 2025
Compute probability of parameters given data in a bayesian network Questions	4	991	June 8, 2020

Get probability of parameter given new data

Identification of multimodal distributions

1- Constant values

Related topics