Expanding the common globe tossing example to multiple globes

Hi PyMC discourse,

This is a common example, but I found it in McElreath - measuring points on a globe to determine the proportion that is water (see code). I’d like to know how to expand this into a situation with multiple globes, which may have slightly different proportions of water. I’m looking to share information between the globes, though, to determine the expected overall proportion of water on any globe.

Single globe example:

import pymc3 as pm

#A value of 0 signifies a land observation, a value of 1 signifies a water observation
observations = [0, 0, 1, 0, 1]
water_observations = sum(observations)
total_observations = len(observations)

with pm.Model() as planet_model:
    # Prior
    p_water = pm.Uniform("p_water", 0 ,1)
    
    # Likelihood
    w = pm.Binomial("w", p=p_water, n=total_observations, observed=water_observations)
    
    # Inference Run/ Markov chain Monte Carlo 
    trace_5_obs = pm.sample(5000, chains=2)

My first thought is to just change the shape parameters, but there are a large number of individual measurements - is this the best approach? As an example, in the multiple globes case, one might have:
observation_globe1 = [0, 0, 1, 0, 1]
observation_globe2 = [0, 0, 1, 0, 1, 0, 0, 1]
observation_globe3 = [0, 1]

observation_globe650000 = [1, 0, 0, 0, 1]

After this I may stratify into groups of globes and do a hierarchical run, but this is just the first step.
Thankyou!

Hi Lewis,
Yes I think you should go the hierarchical road if you wanna share information between globes.
I think I’d start with something like:

with pm.Model() as planet_model:
    a = pm.Exponential("a", 1)
    b = pm.Exponential("b", 1)

    p_globes = pm.Beta("p_globes", a , b, shape=n_globes)
    
    w = pm.Binomial("w", p=p_globes, n=total_observations, observed=water_observations)

This is just a raw idea of course. There is a good notebook about hierarchical models on PyMC’s website (note that this tutorial has been updated but is not yet deployed on the website).
Hope this helps :vulcan_salute:

1 Like

Hi, thanks @AlexAndorra

I’ve had a crack at this:

import numpy as np
import copy
import pymc3 as pm

#Creating a synthetic dataset:
data = list()
probs = [0.9, 0.1]
for j in range(100):
    observations = list()
    #all globes must have at least one measurement:
    observations.append(np.random.choice([0,1], p=probs))
    #now take several more measurements:
    while np.random.uniform(0,1)>0.3:
        observations.append(np.random.choice([0,1], p=[0.9,0.1]))
    data.append(observations)

##put into long format:
ids = list()
observations = list()

for count, obs in enumerate(data):
    for d in obs:
        ids.append(count)
        observations.append(d)
        
total_observations = len(observations)
water_observations = copy.copy(observations)

with pm.Model() as planet_model:    
    a = pm.Exponential("a", 1)
    b = pm.Exponential("b", 1)

    p_globes = pm.Beta("p_globes", a , b, shape=100)
    
    w = pm.Binomial("w", p=p_globes[ids], n=total_observations, observed=water_observations)

with planet_model:
    trace = pm.sample(target_accept=0.99)

This seems to work for a first pass however doesn’t seem to capture the 10% proportion I set up in the synthetic data. I gather it will get quite slow at large number of samples (650,000). Is there a smarter way to include this much data or should I just wait it out?
cheers

1 Like

Turns out there’s a nice example at https://docs.pymc.io/notebooks/hierarchical_partial_pooling.html , which works OK for my synthetic data!

For anyone else trying this, here’s a self-contained example:

%matplotlib inline
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import theano.tensor as tt

#Creating a synthetic dataset:
data = list()
probs = [0.9, 0.1]
for j in range(100):
    observations = list()
    #all globes must have at least one measurement:
    observations.append(np.random.choice([0,1], p=probs))
    #now take several more measurements:
    while np.random.uniform(0,1)>0.1:
        observations.append(np.random.choice([0,1], p=probs))
    data.append(observations)
    
    
#parse for input to pymc3:
num_measurements = [len(i) for i in data]
num_waters = [sum(i) for i in data]
N = len(data)



#pymc3 gear:
with pm.Model() as baseball_model:
    #group parameters:
    phi = pm.Uniform('phi', lower=0.0, upper=1.0)
    
    ##
    kappa_log = pm.Exponential('kappa_log', lam=1.5)
    kappa = pm.Deterministic('kappa', tt.exp(kappa_log))
    
    #instance parameters:
    thetas = pm.Beta('thetas', alpha=phi*kappa, beta=(1.0-phi)*kappa, shape=N)
    
    #likelihood:
    y = pm.Binomial('y', n=num_measurements, p=thetas, observed=num_waters)

with baseball_model:
    trace = pm.sample(draws=500, tune=500, chains=2,
                      target_accept=0.9)
    
    
pm.traceplot(trace, var_names=['phi', 'kappa']);

Now that I’ve got the hang of it, I happen to think this model using odds and probability might be more intuitive:

#pymc3 gear:
with pm.Model() as baseball_model:
    #group parameter:
    phi = pm.Normal('phi', mu=0, sigma=1.0)
    #probability:
    logit = pm.Deterministic('logit', 1 / (1 + np.exp(phi)))
    odds = pm.Deterministic('odds', (1-logit)/logit)
        
    #instance parameters:
    thetas = pm.Beta('thetas', alpha=1, beta=odds, shape=N)
    
    #likelihood:
    y = pm.Binomial('y', n=num_measurements, p=thetas, observed=num_waters)

with baseball_model:
    trace = pm.sample(draws=2000, tune=2000, chains=2,
                      target_accept=0.99)
    
    
pm.traceplot(trace, var_names=['phi', 'logit', 'odds']);
2 Likes