Dependent density regression (Dirichlet process) with outliers / unreliable count data

yemsb · October 1, 2024, 4:05pm

Dear Bayesians,

I’m trying to model the number of trees in parks of different sizes in a particular city. I adapted the example for dependent density regression somewhat successfully. But two things pose difficulty for me:

While the independent variable (park size) is reliable, the dependent variable (number of trees) is not, especially for larger parks. This also leads to many (large or small) parks having assigned zero trees to them (when in reality they have trees, as seen on satellite images). This skews the posterior significantly, as seen here (x-axis is the standardised park area in log-scale, y-axis is the standardised number of trees in log-scale):

grafik669×511 42.1 KB

In the above data, the zero-counts were already removed. I would expect a mostly linearly increasing posterior. This issue becomes even more apparent when converting back to linear scale, after de-standardising (park areas in ha):

grafik704×511 46.8 KB

Now, the major reason why I try to model the tree density in parks is because I want to use the model to estimate the number of trees for parks with zero trees assigned to them. In other words, I try to generate plausible y-values for out-of-sample x-values. Even though the data are not 100% reliable, there is some structure (or blob) suggesting that these points can be trusted to a certain degree (and for some of these parks I checked and concluded that almost every tree was indeed included in the count).
I’ve also been looking into a simple robust linear model (following this example) with variable sigma dependent on x, but I have no justification for modelling sigma in a specific way yet, so I try to be as model-independent as possible right now.
Obviously, my y-data is discrete and positive. Additionally, the x-variable seems to be somewhat lognormally distributed (which is why I ran the model on the log of the park size and of the tree count). How would I go about modelling my y as a discrete distribution instead of a normal distribution in the context of the Dirichlet process model?

My model so far (mostly copied from the first example:

def norm_cdf(z):
    return 0.5 * (1 + tt.erf(z / np.sqrt(2)))


def stick_breaking(v):
    return v * tt.concatenate(
        [tt.ones_like(v[:, :1]), tt.extra_ops.cumprod(1 - v, axis=1)[:, :-1]], axis=1
    )

N = len(std_park_size)
K = 20

with pm.Model(coords={"N": np.arange(N), "K": np.arange(K) + 1, "one": [1]}) as model:
    alpha = pm.Normal("alpha", 0.0, 5.0, dims="K")
    beta = pm.Normal("beta", 0.0, 5.0, dims=("one", "K"))
    x = pm.Data("x", std_park_size, dims="obs_id")
    v = norm_cdf(alpha + pm.math.dot(x[:, np.newaxis], beta))
    w = pm.Deterministic("w", stick_breaking(v), dims=["N", "K"])

with model:
    gamma = pm.Normal("gamma", 0.0, 10.0, dims="K")
    delta = pm.Normal("delta", 0.0, 10.0, dims=("one", "K"))
    mu = pm.Deterministic("mu", gamma + pm.math.dot(x[:, np.newaxis], delta))

with model:
    tau = pm.Gamma("tau", 1.0, 1.0, dims="K")
    y = pm.Data("y", std_trees)
    obs = pm.NormalMixture("obs", w, mu, tau=tau, observed=y, dims="obs_id")

SAMPLES = 2000
BURN = 1000

with model:
    step = pm.Metropolis()
    trace = pm.sample(SAMPLES, tune=BURN, step=step, random_seed=42, return_inferencedata=True)

And to draw out-of-sample values:

new_x = np.array([-2., 2.]) # for example (standardised log-space park area)

with model:
    model.set_data('x', new_x, coords={'obs_id': range(len(new_x))})

    y_pred = pm.sample_posterior_predictive(trace, var_names=['obs'], return_inferencedata=True, predictions=True)

After generating a realisation for all parks with zero trees (in red):
grafik
Which doesn’t look terrible and I couldn’t tell that these were not in the original distribution. However, the bias towards lower counts for higher park areas is clear.

Additionally, I tried to adapt another example which models heteroscedastic noise with some success. After binning the data and calculating the means and standard deviations (binning is not ideal like this, I should probably choose binnings with approximately equal counts),
grafik
and fitting a homoscedastic model (didn’t proceed further at this point)

which looks somewhat promising (at least it’s monotone). But before proceeding I wanted to ask if this model is even appropriate for the data because 1) it doesn’t account for asymmetric errors and 2) the distribution of y is not Gaussian.

In case you are interested, you can find the data here
std_park_size_and_std_trees.txt (29.4 KB)

Topic		Replies	Views
Very simple model with DensityDist Questions	2	503	December 27, 2017
Dirichlet Gaussian Process Model - Suggestions for Improvement v5 modeling	0	249	January 18, 2023
Modeling with count data as predictors and continuous as outcome variable in pymc v5 hierarchical	2	1122	November 9, 2022
"Bayesian Non-parametric Models for Data Science using PyMC3" - Chris Fonnesbeck, from PyCon 2018 Sharing video	1	2281	August 8, 2018
Is my model setup in proper way? Dependent Dirichlet process (DDP) v5 development , modeling , sampling	1	38	April 21, 2025

Dependent density regression (Dirichlet process) with outliers / unreliable count data

Related topics