Data aggregation in crowdsensing application

supervit10 · April 10, 2018, 10:33am

Hi guys, I’m new to pymc3 world but In my opinion this tool is fantastic! I’m working on a simple crowdsensing application where a group of users send me estimates of a target (continuous measurements), for example gps location. In my model I want to find the mean value of these measures and a reliability value for each user. How can I set up my model? I thought to model uncertainty of measures with a normal distribution but how can I model user’s reliability? Do you have any suggestion?

junpenglao · April 10, 2018, 12:55pm

I would first try with a hierarchical normal model with the target location model as a latent mean \mu, and each observation (from one user) model as \mu_{user} \sim \text{Normal}(\mu, \sigma). If you take multiple measurements from each user you can model individual \sigma for each user (e.g., \sigma_{user} \sim \text{HalfNormal()}), otherwise either a fixed value or scaler parameter for \sigma will do.

supervit10 · April 10, 2018, 1:35pm

Thank you for the quick response. I started writing the code but I have another doubt: how can I aggregate the mean from users’ measurements to one general mean? In general how can I aggregate different normal distributions?

junpenglao · April 10, 2018, 1:47pm

If you do

with pm.Model() as m:
    overall_mean = pm.Normal('mu', 0., 100.)
    sigma = 10. # <- replace 10. with a distribution or other information
    obs = pm.Normal('observation', overall_mean, sigma, observed=data)

The information from your observation data will be aggregated automatically toward overall_mean by Bayes’ theorem (or as some people call it, backward inference).

supervit10 · April 10, 2018, 2:09pm

I get it! We are almost there. The last question is: how can I extract a reliability value for each user from the overall sigma?

junpenglao · April 10, 2018, 2:34pm

It depends. Either you have some information of how precise each measurement is (from each device’s property), or you collect multiple data points from each user and estimated it from the data.

supervit10 · April 10, 2018, 5:02pm

It is impossible to estimate reliability from the measures that I use in the model?

junpenglao · April 10, 2018, 5:07pm

In general you can interpret the \sigma as the reliability. For example if you take multiple measures in a short period for each user (ie., multiple data point from each user), then you can interpret the individual \sigma above as the reliability for each user.

supervit10 · April 11, 2018, 8:04am

I understand, but a doubt remains: why not calculate the mean and variance directly from the measurements? Why is inference better? Excuse me if my questions may seem silly, but but I’m new in the probabilistic programming world

junpenglao · April 11, 2018, 8:29am

In Gaussian case with uniform prior the result would be identical. The strength of Bayesian comes in the prior and information exchange between different components. When you have your model set up you can do all kinds of crazy thing, like taken into account the temporal information (eg trust more the measurements that are closer in time), spatial information (eg trust more the users that report closer measurement), weighted average (eg trust more the user with newer/better equipment) etc. All of these could be done by introducing additional structure and correlation into your model.

supervit10 · April 11, 2018, 8:46am

Practically I have to model the prior in the correct way. I want to consider what you suggested to me, for example spatial information and weighted average! How can I insert them in my model? This is the crucial point, thanks for the help

junpenglao · April 11, 2018, 8:53am

Glad to help but it might be easier if you have some code and data set up already. The recommended workflow is to generated some simulation data and graduatelly build more complex model.

supervit10 · April 12, 2018, 3:25pm

Ok I’m trying to model our intuitions, I set up a hierachical model in which each user send me a measure that has a mean that comes from a general distribution:

# Initialize random number generator
np.random.seed(123)

# True parameter values
true_mu = 38.095
true_sigma = 0.005

# Size of dataset
size = 100

# Simulate some data, we suppose each user makes a measurement
y = true_mu + np.random.randn(size)*true_sigma

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    sigma = pm.HalfNormal('sigma', 1.)

    # Each user has a mean and a standard devitation that I called theta and rho
    theta = pm.Normal('theta', mu, sigma, shape=size)
    rho = sigma = pm.HalfNormal('rho', 1., shape=size)

    like = pm.Normal('like', theta, rho, observed=y)

    trace = pm.sample(500)
    print(trace['mu'].mean(axis=0), trace['sigma'].mean(axis=0))

    pm.traceplot(trace)
    plt.show()

Unfortunately, the execution time is very long and I can not get the desired results (the posterior parameters are wrong). What am I doing wrong? How can I model your suggestions?

junpenglao · April 12, 2018, 4:10pm

Your model is unidentifiable like this, as your model have a different generation process compare to your data. Thus there is no user information, the latent mean and the latent sigma is being explained by theta and rho (they are overparameterized, actually)

supervit10 · April 12, 2018, 4:32pm

If I remove theta and rho the program works perfectly, but how can I consider spatial information or weighted average that you suggested to me?:

# True parameter values
true_mu = 38.095
true_sigma = 0.005

# Size of dataset
size = 100

# Simulate some data, we suppose each user makes a measurement
y = true_mu + np.random.randn(size)*true_sigma

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    sigma = pm.HalfNormal('sigma', 1.)

    like = pm.Normal('like', mu, sigma, observed=y)

    trace = pm.sample(500)
    print(trace['mu'].mean(axis=0), trace['sigma'].mean(axis=0))

    pm.traceplot(trace)
    plt.show()

junpenglao · April 12, 2018, 4:41pm

It only makes sense when you have multiple measurements for each user, then it follows similar formulation as a mixed-effect model.

supervit10 · April 14, 2018, 8:52am

I made progress with the code, now I suppose that users send me multiple measurement then I compute the latent mean and the standard deviation for each user:

# True parameter values
true_mu = 38.095
true_sigma = 0.005

# Size of dataset
size = 10

# Simulate some data, each user makes 10 measures
y1 = true_mu + np.random.randn(size)* true_sigma
y2 = true_mu + np.random.randn(size)*(true_sigma-0.001)
y3 = true_mu + np.random.randn(size)*(true_sigma-0.002) # Best values
y4 = true_mu + np.random.randn(size)*(true_sigma+0.005) # Worst values
x = np.array([y1, y2, y3, y4])

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    rho = pm.HalfNormal('rho', .1)

    sigma = pm.HalfNormal('sigma', rho, shape=(4, 1))

    like = pm.Normal('like', mu, sigma, observed=x)

    trace = pm.sample(500)
    print('mu: ',trace['mu'].mean(axis=0), 'rho: ', trace['rho'].mean(axis=0))
    print(trace['sigma'].mean(axis=0))

    pm.traceplot(trace)
    plt.show()

Lastly, I would like to give more weight to users who send me similar measures. There is a simple way to say this?

supervit10 · April 17, 2018, 9:00pm

@junpenglao I’m almost finishing my little project, in the last version I’m trying to use in a loop the last posterior as the new prior, in particular I want to update the prior for the standard deviation. This is the code:

import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
from pymc3.distributions import Interpolated
from scipy import stats

def from_posterior(param, samples):
    smin, smax = np.min(samples), np.max(samples)
    width = smax - smin
    x = np.linspace(smin, smax, 100)
    y = stats.gaussian_kde(samples)(x)

    x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
    y = np.concatenate([[0], y, [0]])
    return Interpolated(param, x, y)

# Initialize random number generator
np.random.seed(123)

# Target values
t = [38.095, 13.346, 30.106, 13.353]

# Users' precision in term of sigma
s = [0.001, 0.005, 0.003, 0.012]

# Number of measures for each target
n_measures = 10

# Simulate some data
measures = []
for i in range(0,4):
    y1 = t[i] + np.random.randn(n_measures) * s[0]
    y2 = t[i] + np.random.randn(n_measures) * s[1]
    y3 = t[i] + np.random.randn(n_measures) * s[2] 
    y4 = t[i] + np.random.randn(n_measures) * s[3]
    x = np.array([y1, y2, y3, y4])
    measures.append(x) 

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    rho = pm.HalfNormal('rho', 10.)

    sigma1 = pm.HalfNormal('sigma1', rho)
    sigma2 = pm.HalfNormal('sigma2', rho)
    sigma3 = pm.HalfNormal('sigma3', rho)
    sigma4 = pm.HalfNormal('sigma4', rho)

    like1 = pm.Normal('like1', mu, sigma1, observed=measures[0][0])
    like2 = pm.Normal('like2', mu, sigma2, observed=measures[0][1])
    like3 = pm.Normal('like3', mu, sigma3, observed=measures[0][2])
    like4 = pm.Normal('like4', mu, sigma4, observed=measures[0][3])

    trace = pm.sample(1000)

for j in range(1,4):

    with pm.Model() as model:
        # Latent mean and sigma, I want to estimate these parameters
        mu = pm.Normal('mu', 0., 1.)
        rho = pm.HalfNormal('rho')

        # Priors are posteriors from previous iteration
        sigma1 = from_posterior('sigma1', trace['sigma1'])
        sigma2 = from_posterior('sigma2', trace['sigma2'])
        sigma3 = from_posterior('sigma3', trace['sigma3'])
        sigma4 = from_posterior('sigma4', trace['sigma4'])
    
        like1 = pm.Normal('like1', mu, sigma1, observed=measures[j][0])
        like2 = pm.Normal('like2', mu, sigma2, observed=measures[j][1])
        like3 = pm.Normal('like3', mu, sigma3, observed=measures[j][2])
        like4 = pm.Normal('like4', mu, sigma4, observed=measures[j][3])
        
        trace = pm.sample(1000)

Unfortunately the code inside the loop doesn’t work. The error is:

ValueError: Bad initial energy: inf. The model might be misspecified.

I tried with:

for RV in model.basic_RVs:
    print(RV.name, RV.logp(model.test_point))

But I have not found the variable that gives me .inf. Do you have any idea? Thank you.
I hope I do not seem too insistent, I’m trying to learn as much as I can.

junpenglao · April 17, 2018, 9:42pm

I see what you are doing now - to me it makes sense.

As for the Bad initial energy problem, I see two potential problem:

it is likely related to the jitter initialization that sometimes make some values go outside of the support. Try trace = pm.sample(1000, init='adapt_diag')
in your data simulation process, the error of the measurement (sigma) is too small - it is causing numerical errors. I tried scaling the s 10 times and it seems to work fine s = [0.01, 0.05, 0.03, 0.12]

supervit10 · April 18, 2018, 7:55am

Great! Now it works perfectly. Thank you so much, you have been very helpful.

Topic		Replies	Views
Marginal likelihood implementation and measurement uncertainty Questions	12	938	November 2, 2019
Computer model calibration using a Gaussian Process: From stan to pymc v5 gaussian_process	10	767	June 22, 2023
Project for learning on pymc site for finding mean temperatures for european cities	6	374	August 6, 2023
Creating a Data Fusion with PyMC3 Questions	7	1291	May 4, 2021
How to decide on what priors distributions to use for parameters? Questions	22	5969	January 7, 2022

Data aggregation in crowdsensing application

Related topics