Dependent density regression (Dirichlet process) with outliers / unreliable count data

Dear Bayesians,

I’m trying to model the number of trees in parks of different sizes in a particular city. I adapted the example for dependent density regression somewhat successfully. But two things pose difficulty for me:

  1. While the independent variable (park size) is reliable, the dependent variable (number of trees) is not, especially for larger parks. This also leads to many (large or small) parks having assigned zero trees to them (when in reality they have trees, as seen on satellite images). This skews the posterior significantly, as seen here (x-axis is the standardised park area in log-scale, y-axis is the standardised number of trees in log-scale):

    In the above data, the zero-counts were already removed. I would expect a mostly linearly increasing posterior. This issue becomes even more apparent when converting back to linear scale, after de-standardising (park areas in ha):

    Now, the major reason why I try to model the tree density in parks is because I want to use the model to estimate the number of trees for parks with zero trees assigned to them. In other words, I try to generate plausible y-values for out-of-sample x-values. Even though the data are not 100% reliable, there is some structure (or blob) suggesting that these points can be trusted to a certain degree (and for some of these parks I checked and concluded that almost every tree was indeed included in the count).
    I’ve also been looking into a simple robust linear model (following this example) with variable sigma dependent on x, but I have no justification for modelling sigma in a specific way yet, so I try to be as model-independent as possible right now.
  2. Obviously, my y-data is discrete and positive. Additionally, the x-variable seems to be somewhat lognormally distributed (which is why I ran the model on the log of the park size and of the tree count). How would I go about modelling my y as a discrete distribution instead of a normal distribution in the context of the Dirichlet process model?

My model so far (mostly copied from the first example:

def norm_cdf(z):
    return 0.5 * (1 + tt.erf(z / np.sqrt(2)))


def stick_breaking(v):
    return v * tt.concatenate(
        [tt.ones_like(v[:, :1]), tt.extra_ops.cumprod(1 - v, axis=1)[:, :-1]], axis=1
    )
N = len(std_park_size)
K = 20

with pm.Model(coords={"N": np.arange(N), "K": np.arange(K) + 1, "one": [1]}) as model:
    alpha = pm.Normal("alpha", 0.0, 5.0, dims="K")
    beta = pm.Normal("beta", 0.0, 5.0, dims=("one", "K"))
    x = pm.Data("x", std_park_size, dims="obs_id")
    v = norm_cdf(alpha + pm.math.dot(x[:, np.newaxis], beta))
    w = pm.Deterministic("w", stick_breaking(v), dims=["N", "K"])

with model:
    gamma = pm.Normal("gamma", 0.0, 10.0, dims="K")
    delta = pm.Normal("delta", 0.0, 10.0, dims=("one", "K"))
    mu = pm.Deterministic("mu", gamma + pm.math.dot(x[:, np.newaxis], delta))

with model:
    tau = pm.Gamma("tau", 1.0, 1.0, dims="K")
    y = pm.Data("y", std_trees)
    obs = pm.NormalMixture("obs", w, mu, tau=tau, observed=y, dims="obs_id")

SAMPLES = 2000
BURN = 1000

with model:
    step = pm.Metropolis()
    trace = pm.sample(SAMPLES, tune=BURN, step=step, random_seed=42, return_inferencedata=True)

And to draw out-of-sample values:

new_x = np.array([-2., 2.]) # for example (standardised log-space park area)

with model:
    model.set_data('x', new_x, coords={'obs_id': range(len(new_x))})

    y_pred = pm.sample_posterior_predictive(trace, var_names=['obs'], return_inferencedata=True, predictions=True)

After generating a realisation for all parks with zero trees (in red):
grafik
Which doesn’t look terrible and I couldn’t tell that these were not in the original distribution. However, the bias towards lower counts for higher park areas is clear.


Additionally, I tried to adapt another example which models heteroscedastic noise with some success. After binning the data and calculating the means and standard deviations (binning is not ideal like this, I should probably choose binnings with approximately equal counts),
grafik
and fitting a homoscedastic model (didn’t proceed further at this point)


which looks somewhat promising (at least it’s monotone). But before proceeding I wanted to ask if this model is even appropriate for the data because 1) it doesn’t account for asymmetric errors and 2) the distribution of y is not Gaussian.


In case you are interested, you can find the data here
std_park_size_and_std_trees.txt (29.4 KB)