Expanding the common globe tossing example to multiple globes

lewiso1 · January 15, 2020, 5:20am

Hi PyMC discourse,

This is a common example, but I found it in McElreath - measuring points on a globe to determine the proportion that is water (see code). I’d like to know how to expand this into a situation with multiple globes, which may have slightly different proportions of water. I’m looking to share information between the globes, though, to determine the expected overall proportion of water on any globe.

Single globe example:

import pymc3 as pm

#A value of 0 signifies a land observation, a value of 1 signifies a water observation
observations = [0, 0, 1, 0, 1]
water_observations = sum(observations)
total_observations = len(observations)

with pm.Model() as planet_model:
    # Prior
    p_water = pm.Uniform("p_water", 0 ,1)
    
    # Likelihood
    w = pm.Binomial("w", p=p_water, n=total_observations, observed=water_observations)
    
    # Inference Run/ Markov chain Monte Carlo 
    trace_5_obs = pm.sample(5000, chains=2)

My first thought is to just change the shape parameters, but there are a large number of individual measurements - is this the best approach? As an example, in the multiple globes case, one might have:
observation_globe1 = [0, 0, 1, 0, 1]
observation_globe2 = [0, 0, 1, 0, 1, 0, 0, 1]
observation_globe3 = [0, 1]
…
observation_globe650000 = [1, 0, 0, 0, 1]

After this I may stratify into groups of globes and do a hierarchical run, but this is just the first step.
Thankyou!

AlexAndorra · January 15, 2020, 10:05am

Hi Lewis,
Yes I think you should go the hierarchical road if you wanna share information between globes.
I think I’d start with something like:

with pm.Model() as planet_model:
    a = pm.Exponential("a", 1)
    b = pm.Exponential("b", 1)

    p_globes = pm.Beta("p_globes", a , b, shape=n_globes)
    
    w = pm.Binomial("w", p=p_globes, n=total_observations, observed=water_observations)

This is just a raw idea of course. There is a good notebook about hierarchical models on PyMC’s website (note that this tutorial has been updated but is not yet deployed on the website).
Hope this helps

lewiso1 · January 16, 2020, 12:11am

Hi, thanks @AlexAndorra

I’ve had a crack at this:

import numpy as np
import copy
import pymc3 as pm

#Creating a synthetic dataset:
data = list()
probs = [0.9, 0.1]
for j in range(100):
    observations = list()
    #all globes must have at least one measurement:
    observations.append(np.random.choice([0,1], p=probs))
    #now take several more measurements:
    while np.random.uniform(0,1)>0.3:
        observations.append(np.random.choice([0,1], p=[0.9,0.1]))
    data.append(observations)

##put into long format:
ids = list()
observations = list()

for count, obs in enumerate(data):
    for d in obs:
        ids.append(count)
        observations.append(d)
        
total_observations = len(observations)
water_observations = copy.copy(observations)

with pm.Model() as planet_model:    
    a = pm.Exponential("a", 1)
    b = pm.Exponential("b", 1)

    p_globes = pm.Beta("p_globes", a , b, shape=100)
    
    w = pm.Binomial("w", p=p_globes[ids], n=total_observations, observed=water_observations)

with planet_model:
    trace = pm.sample(target_accept=0.99)

This seems to work for a first pass however doesn’t seem to capture the 10% proportion I set up in the synthetic data. I gather it will get quite slow at large number of samples (650,000). Is there a smarter way to include this much data or should I just wait it out?
cheers

lewiso1 · January 16, 2020, 5:08am

Turns out there’s a nice example at https://docs.pymc.io/notebooks/hierarchical_partial_pooling.html , which works OK for my synthetic data!

For anyone else trying this, here’s a self-contained example:

%matplotlib inline
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import theano.tensor as tt

#Creating a synthetic dataset:
data = list()
probs = [0.9, 0.1]
for j in range(100):
    observations = list()
    #all globes must have at least one measurement:
    observations.append(np.random.choice([0,1], p=probs))
    #now take several more measurements:
    while np.random.uniform(0,1)>0.1:
        observations.append(np.random.choice([0,1], p=probs))
    data.append(observations)
    
    
#parse for input to pymc3:
num_measurements = [len(i) for i in data]
num_waters = [sum(i) for i in data]
N = len(data)



#pymc3 gear:
with pm.Model() as baseball_model:
    #group parameters:
    phi = pm.Uniform('phi', lower=0.0, upper=1.0)
    
    ##
    kappa_log = pm.Exponential('kappa_log', lam=1.5)
    kappa = pm.Deterministic('kappa', tt.exp(kappa_log))
    
    #instance parameters:
    thetas = pm.Beta('thetas', alpha=phi*kappa, beta=(1.0-phi)*kappa, shape=N)
    
    #likelihood:
    y = pm.Binomial('y', n=num_measurements, p=thetas, observed=num_waters)

with baseball_model:
    trace = pm.sample(draws=500, tune=500, chains=2,
                      target_accept=0.9)
    
    
pm.traceplot(trace, var_names=['phi', 'kappa']);

Now that I’ve got the hang of it, I happen to think this model using odds and probability might be more intuitive:

#pymc3 gear:
with pm.Model() as baseball_model:
    #group parameter:
    phi = pm.Normal('phi', mu=0, sigma=1.0)
    #probability:
    logit = pm.Deterministic('logit', 1 / (1 + np.exp(phi)))
    odds = pm.Deterministic('odds', (1-logit)/logit)
        
    #instance parameters:
    thetas = pm.Beta('thetas', alpha=1, beta=odds, shape=N)
    
    #likelihood:
    y = pm.Binomial('y', n=num_measurements, p=thetas, observed=num_waters)

with baseball_model:
    trace = pm.sample(draws=2000, tune=2000, chains=2,
                      target_accept=0.99)
    
    
pm.traceplot(trace, var_names=['phi', 'logit', 'odds']);

Topic		Replies	Views
Hierarchical Pooling but "same level" ie "sibling pooling" Questions	3	400	July 23, 2021
Hierarchical Binomial with weights to priors Questions	2	645	May 5, 2019
Hierarchical AB testing in PyMC3 Questions	4	2557	March 2, 2021
Hierarchical logistic regression giving non-sensible results? Do I formulate it correctly? Questions	15	921	August 4, 2020
Sampling many dimensions, with different PDF in each dimension version agnostic modeling	13	783	March 30, 2023

Expanding the common globe tossing example to multiple globes

Related topics