Site occupancy model with site- and visit-level covariates

Hello PyMC community!

I’m trying to build a site-occupancy model, a common class of model in ecology, in PyMC. The goal of this model is to estimate the true occurrence probability of a species (i.e., prevalence), admitting the fact that we don’t always detect the species during surveys. The typical way of representing this model is through the true latent occurrence state z_j for site j=1,2,\dots,J, with z_j \sim \text{Bernoulli}(\psi_j). Then, the detection of the species y_{j,k} during survey k=1,2,\dots,K depends on the latent state z_j, with y_{j,k} \sim \text{Bernoulli}(p_{j,k} z_j), where p_{j,k} is the probability of detecting the species given that it’s present. Formulating the model in this way permits the use covariates that affect the detection and occurrence probabilities.

I would like to write this model in PyMC without the use of the latent occurrence state (see bottom of this post for a working example that uses z_j). My first thought was to leverage ZeroInflatedBinomial, but setting \psi=1 when we’ve previously detected the species at the site, e.g.,

psi_eff = psi[site_idx] ** (1 - detected[site_idx])
obs = pm.ZeroInflatedBinomial('y', p=p, psi=psi_eff, n=1, observed=y_flat)

This, however, did not sample well. And sadly that’s as far as I’ve gotten. Do you, gracious PyMC user, have any ideas? If it helps, here is one link and another link to folks doing this in JAGS and Stan.

import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
import pytensor.tensor as pt

# ecologists like to scale covariates this way
def scale(x):
    return (x - x.mean()) / x.std()

def invlogit(x):
    return 1 / (1 + np.exp(-x))

## simulation

SEED = None
rng = np.random.default_rng(seed=SEED)

# sampling characteristics
site_count = 100
survey_count = 5

# index for the site 
site = np.arange(site_count)
site_idx = np.repeat(site, survey_count)

## ecological model

# true parameter values
beta0_true = 0.5
beta1_true = 1.25

# covariates 
x = scale(rng.uniform(size=site_count))

# linear model
mu_true = beta0_true + beta1_true * x
psi_true = invlogit(mu_true)

# simulate occurrence state
z_true = rng.binomial(1, psi_true)

## detection model

# true parameter values
alpha0_true = 1.5

# covariates
pass 

# linear model
nu_true = alpha0_true
p_true = invlogit(nu_true)

# simulate detection
y_flat = rng.binomial(1, p_true, survey_count * site_count) * z_true[site_idx]

# reformat data
y = y_flat.reshape(site_count, survey_count)
y_sum = y.sum(axis=1)
detected = y_sum > 0

with pm.Model() as occupancy:

    # occurrence process 
    # priors 
    beta0 = pm.Normal("beta0", mu=0, sigma=2)
    beta1 = pm.Normal("beta1", mu=0, sigma=2)
    
    # linear model
    mu = beta0 + beta1 * x[site_idx]
    psi = pm.Deterministic("psi", pm.math.invlogit(mu))

    # latent occurrence state
    z = pm.Bernoulli('z', p=psi)

    # detection process
    # priors
    alpha0 = pm.Normal('alpha0', mu=0, sigma=2)

    # linear model
    nu = alpha0
    p = pm.Deterministic('p', pm.math.invlogit(nu))
            
    # # likelihood
    p_eff = z[site_idx] * p
    obs = pm.Bernoulli('y', p=p_eff, observed=y_flat)

pm.model_to_graphviz(occupancy)

This line is problematic on its own. When z is zero, p will also be zero, which is incompatible with any observation that is not zero.

If there are no false-positives, you could override z for the site_idxs where you had an observation, as that must be True!

p_eff = pm.math.switch(pm.math.eq(y_flat, 1), 1, z[site_idx]) * p

Also I assume z shouldn’t be a scalar variable but a variable as large as site?

Well, it’s obviously not problematic. It’s a simple way of codifying our assumption that P(y=1|z=0)=0, that is, that we do not see or hear the species when it is not there. See below from a snippet of Royle and Dorazio, Chapter 3

Thank you for your suggestion. Clearly, it still includes the latent occurrence state z_j, which I was trying to avoid.

For posterity, see below for my solution. It relies on a CustomDist specified with the logp.

def logp(x, p, psi):
    
    bern = (p ** x) * ((1 - p) ** (1 - x))
    bern_prod = pm.math.prod(bern, axis=1)
    
    res = pm.math.switch(
        x.sum(axis=1) > 0,
        bern_prod * psi,
        bern_prod * psi + (1 - psi)
    )
    
    return pm.math.log(res)

I haven’t rigorously tested it but it appears to work okay for this example. See below for the full example.

import numpy as np
import pymc as pm

def scale(x):
    return (x - x.mean()) / x.std()

def invlogit(x):
    return 1 / (1 + np.exp(-x))

## simulation

SEED = 17
rng = np.random.default_rng(seed=SEED)

# sampling characteristics
site_count = 250
survey_count = 20

# index for the site 
site = np.arange(site_count)
site_idx = np.repeat(site, survey_count)
ones = np.ones(site_count).astype(int)

## ecological model

# true parameter values
beta0_true = 0.5
beta1_true = 1.25

# covariates 
x = scale(rng.uniform(size=site_count))

# linear model
mu_true = beta0_true + beta1_true * x
psi_true = invlogit(mu_true)

# simulate occurrence state
z_true = rng.binomial(1, psi_true)

## detection model

# true parameter values
alpha0_true = 1.5
alpha1_true = -1

# covariates
w = rng.uniform(size=site_count * survey_count).reshape(site_count, survey_count)
w = scale(w)

# linear model
nu_true = alpha0_true + alpha1_true * w
p_true = invlogit(nu_true)

# simulate detection
flips = rng.binomial(1, p_true)
y = (flips.T * z_true).T

# save output 
np.savetxt('y.csv',y, delimiter=',')
np.savetxt('x.csv',x, delimiter=',')
np.savetxt('w.csv',w, delimiter=',')

# likelihood for detection/non-detection data
def logp(x, p, psi):
    
    bern = (p ** x) * ((1 - p) ** (1 - x))
    bern_prod = pm.math.prod(bern, axis=1)
    
    res = pm.math.switch(
        x.sum(axis=1) > 0,
        bern_prod * psi,
        bern_prod * psi + (1 - psi)
    )
    
    return pm.math.log(res)

with pm.Model() as marginal:

    # occurrence process 
    # priors 
    beta0 = pm.Normal("beta0", mu=0, sigma=2)
    beta1 = pm.Normal("beta1", mu=0, sigma=2)
    
    # linear model
    mu = beta0 + beta1 * x
    psi = pm.Deterministic("psi", pm.math.invlogit(mu))

    # detection process
    # priors
    alpha0 = pm.Normal('alpha0', mu=0, sigma=2)
    alpha1 = pm.Normal('alpha1', mu=0, sigma=2)

    # linear model
    nu = alpha0 + alpha1 * w
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # likelihood
    pm.CustomDist(
        'y',
        p,
        psi,
        logp=logp,
        observed=y,
    )

pm.model_to_graphviz(marginal)

with marginal:
    marginal_idata = pm.sample()

az.plot_trace(
    marginal_idata,
    var_names=['beta0', 'beta1', 'alpha0', 'alpha1'],
    lines=[("beta0", {}, [beta0_true]), ("beta1", {}, [beta1_true]), 
           ('alpha0', {}, [alpha0_true]), ('alpha1', {}, [alpha1_true])]
)

You were sampling a variable that is completely determined by the data. If not problematic for the sampler it can still be wasteful.

But glad it was not the problem