Data aggregation in crowdsensing application


#1

Hi guys, I’m new to pymc3 world but In my opinion this tool is fantastic! I’m working on a simple crowdsensing application where a group of users send me estimates of a target (continuous measurements), for example gps location. In my model I want to find the mean value of these measures and a reliability value for each user. How can I set up my model? I thought to model uncertainty of measures with a normal distribution but how can I model user’s reliability? Do you have any suggestion?


#2

I would first try with a hierarchical normal model with the target location model as a latent mean \mu, and each observation (from one user) model as \mu_{user} \sim \text{Normal}(\mu, \sigma). If you take multiple measurements from each user you can model individual \sigma for each user (e.g., \sigma_{user} \sim \text{HalfNormal()}), otherwise either a fixed value or scaler parameter for \sigma will do.


#3

Thank you for the quick response. I started writing the code but I have another doubt: how can I aggregate the mean from users’ measurements to one general mean? In general how can I aggregate different normal distributions?


#4

If you do

with pm.Model() as m:
    overall_mean = pm.Normal('mu', 0., 100.)
    sigma = 10. # <- replace 10. with a distribution or other information
    obs = pm.Normal('observation', overall_mean, sigma, observed=data) 

The information from your observation data will be aggregated automatically toward overall_mean by Bayes’ theorem (or as some people call it, backward inference).


#5

I get it! We are almost there. The last question is: how can I extract a reliability value for each user from the overall sigma?


#6

It depends. Either you have some information of how precise each measurement is (from each device’s property), or you collect multiple data points from each user and estimated it from the data.


#7

It is impossible to estimate reliability from the measures that I use in the model?


#8

In general you can interpret the \sigma as the reliability. For example if you take multiple measures in a short period for each user (ie., multiple data point from each user), then you can interpret the individual \sigma above as the reliability for each user.


#9

I understand, but a doubt remains: why not calculate the mean and variance directly from the measurements? Why is inference better? Excuse me if my questions may seem silly, but but I’m new in the probabilistic programming world :slight_smile:


#10

In Gaussian case with uniform prior the result would be identical. The strength of Bayesian comes in the prior and information exchange between different components. When you have your model set up you can do all kinds of crazy thing, like taken into account the temporal information (eg trust more the measurements that are closer in time), spatial information (eg trust more the users that report closer measurement), weighted average (eg trust more the user with newer/better equipment) etc. All of these could be done by introducing additional structure and correlation into your model.


#11

Practically I have to model the prior in the correct way. I want to consider what you suggested to me, for example spatial information and weighted average! How can I insert them in my model? This is the crucial point, thanks for the help :slight_smile:


#12

Glad to help but it might be easier if you have some code and data set up already. The recommended workflow is to generated some simulation data and graduatelly build more complex model.


#13

Ok I’m trying to model our intuitions, I set up a hierachical model in which each user send me a measure that has a mean that comes from a general distribution:

# Initialize random number generator
np.random.seed(123)

# True parameter values
true_mu = 38.095
true_sigma = 0.005

# Size of dataset
size = 100

# Simulate some data, we suppose each user makes a measurement
y = true_mu + np.random.randn(size)*true_sigma

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    sigma = pm.HalfNormal('sigma', 1.)

    # Each user has a mean and a standard devitation that I called theta and rho
    theta = pm.Normal('theta', mu, sigma, shape=size)
    rho = sigma = pm.HalfNormal('rho', 1., shape=size)

    like = pm.Normal('like', theta, rho, observed=y)

    trace = pm.sample(500)
    print(trace['mu'].mean(axis=0), trace['sigma'].mean(axis=0))

    pm.traceplot(trace)
    plt.show()

Unfortunately, the execution time is very long and I can not get the desired results (the posterior parameters are wrong). What am I doing wrong? How can I model your suggestions?


#14

Your model is unidentifiable like this, as your model have a different generation process compare to your data. Thus there is no user information, the latent mean and the latent sigma is being explained by theta and rho (they are overparameterized, actually)


#15

If I remove theta and rho the program works perfectly, but how can I consider spatial information or weighted average that you suggested to me?:

# True parameter values
true_mu = 38.095
true_sigma = 0.005

# Size of dataset
size = 100

# Simulate some data, we suppose each user makes a measurement
y = true_mu + np.random.randn(size)*true_sigma

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    sigma = pm.HalfNormal('sigma', 1.)

    like = pm.Normal('like', mu, sigma, observed=y)

    trace = pm.sample(500)
    print(trace['mu'].mean(axis=0), trace['sigma'].mean(axis=0))

    pm.traceplot(trace)
    plt.show()

#16

It only makes sense when you have multiple measurements for each user, then it follows similar formulation as a mixed-effect model.


#17

I made progress with the code, now I suppose that users send me multiple measurement then I compute the latent mean and the standard deviation for each user:

# True parameter values
true_mu = 38.095
true_sigma = 0.005

# Size of dataset
size = 10

# Simulate some data, each user makes 10 measures
y1 = true_mu + np.random.randn(size)* true_sigma
y2 = true_mu + np.random.randn(size)*(true_sigma-0.001)
y3 = true_mu + np.random.randn(size)*(true_sigma-0.002) # Best values
y4 = true_mu + np.random.randn(size)*(true_sigma+0.005) # Worst values
x = np.array([y1, y2, y3, y4])

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    rho = pm.HalfNormal('rho', .1)

    sigma = pm.HalfNormal('sigma', rho, shape=(4, 1))

    like = pm.Normal('like', mu, sigma, observed=x)

    trace = pm.sample(500)
    print('mu: ',trace['mu'].mean(axis=0), 'rho: ', trace['rho'].mean(axis=0))
    print(trace['sigma'].mean(axis=0))

    pm.traceplot(trace)
    plt.show()

Lastly, I would like to give more weight to users who send me similar measures. There is a simple way to say this?


#18

@junpenglao I’m almost finishing my little project, in the last version I’m trying to use in a loop the last posterior as the new prior, in particular I want to update the prior for the standard deviation. This is the code:

import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
from pymc3.distributions import Interpolated
from scipy import stats

def from_posterior(param, samples):
    smin, smax = np.min(samples), np.max(samples)
    width = smax - smin
    x = np.linspace(smin, smax, 100)
    y = stats.gaussian_kde(samples)(x)

    x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
    y = np.concatenate([[0], y, [0]])
    return Interpolated(param, x, y)

# Initialize random number generator
np.random.seed(123)

# Target values
t = [38.095, 13.346, 30.106, 13.353]

# Users' precision in term of sigma
s = [0.001, 0.005, 0.003, 0.012]

# Number of measures for each target
n_measures = 10

# Simulate some data
measures = []
for i in range(0,4):
    y1 = t[i] + np.random.randn(n_measures) * s[0]
    y2 = t[i] + np.random.randn(n_measures) * s[1]
    y3 = t[i] + np.random.randn(n_measures) * s[2] 
    y4 = t[i] + np.random.randn(n_measures) * s[3]
    x = np.array([y1, y2, y3, y4])
    measures.append(x) 

with pm.Model() as hierachical_model:
    # Latent mean and sigma, I want to estimate these parameters
    mu = pm.Normal('mu', 0., 10.)
    rho = pm.HalfNormal('rho', 10.)

    sigma1 = pm.HalfNormal('sigma1', rho)
    sigma2 = pm.HalfNormal('sigma2', rho)
    sigma3 = pm.HalfNormal('sigma3', rho)
    sigma4 = pm.HalfNormal('sigma4', rho)

    like1 = pm.Normal('like1', mu, sigma1, observed=measures[0][0])
    like2 = pm.Normal('like2', mu, sigma2, observed=measures[0][1])
    like3 = pm.Normal('like3', mu, sigma3, observed=measures[0][2])
    like4 = pm.Normal('like4', mu, sigma4, observed=measures[0][3])

    trace = pm.sample(1000)

for j in range(1,4):

    with pm.Model() as model:
        # Latent mean and sigma, I want to estimate these parameters
        mu = pm.Normal('mu', 0., 1.)
        rho = pm.HalfNormal('rho')

        # Priors are posteriors from previous iteration
        sigma1 = from_posterior('sigma1', trace['sigma1'])
        sigma2 = from_posterior('sigma2', trace['sigma2'])
        sigma3 = from_posterior('sigma3', trace['sigma3'])
        sigma4 = from_posterior('sigma4', trace['sigma4'])
    
        like1 = pm.Normal('like1', mu, sigma1, observed=measures[j][0])
        like2 = pm.Normal('like2', mu, sigma2, observed=measures[j][1])
        like3 = pm.Normal('like3', mu, sigma3, observed=measures[j][2])
        like4 = pm.Normal('like4', mu, sigma4, observed=measures[j][3])
        
        trace = pm.sample(1000)

Unfortunately the code inside the loop doesn’t work. The error is:

ValueError: Bad initial energy: inf. The model might be misspecified.

I tried with:

for RV in model.basic_RVs:
    print(RV.name, RV.logp(model.test_point))

But I have not found the variable that gives me .inf. Do you have any idea? Thank you.
I hope I do not seem too insistent, I’m trying to learn as much as I can.


#19

I see what you are doing now - to me it makes sense.

As for the Bad initial energy problem, I see two potential problem:

  1. it is likely related to the jitter initialization that sometimes make some values go outside of the support. Try trace = pm.sample(1000, init='adapt_diag')
  2. in your data simulation process, the error of the measurement (sigma) is too small - it is causing numerical errors. I tried scaling the s 10 times and it seems to work fine s = [0.01, 0.05, 0.03, 0.12]

#20

Great! Now it works perfectly. Thank you so much, you have been very helpful. :slight_smile: