Missing data: np.nan in a multidimensional Poisson distribution

HongyuXie · September 13, 2022, 7:47am

In the following model, the observed variable AD is 2-dimensional and Poisson distributed. Each row of AD corresponds to 1 “position” and is associated with 1 AF. The number of positions (nPos) is fixed. Each position is associated with multiple “samples” which correspond to the columns (nSample) and the number of samples observed varies among positions. The ‘missing’ samples are np.nan in the input array. So in a way, the samples are pooled at the ‘position’ level and the positions are pooled at the AF level so that AF for positions are sampled from a single Gamma distribution. TD is an observed covariate that has the same dimensioan as AD and np.nan in exactly the same coordinates in the matrix.
Now, with missing data filled as np.nan, inference gave an error ‘SamplingError: Initial evaluation of model at starting point failed!’ with -inf likelihood for AD, which I assume to be caused by the np.nan. My question is, without imputation, how do I rewrite the model to make it work?

with pm.Model() as m:
        TD_observed = pm.MutableData('TD_observed', np.array(dataDict['TD_observed']))  # (nPos, nSample)
        AD_observed = pm.MutableData('AD_observed', np.array(dataDict['AD_observed']))  # (nPos, nSample)

        gammaMu = pm.LogNormal("gammaMu", mu=1, sigma=1)  # (1,)
        gammaStd = pm.LogNormal("gammaStd", mu=1, sigma=1)  # (1,)
        gammaShape = pm.Deterministic('gammaShape', gammaMu ** 2 / gammaStd ** 2)  # (1,)
        gammaBeta = pm.Deterministic('gammaBeta', (gammaShape / gammaMu))  # (1,)
        AF = pm.Gamma('AF', alpha=gammaShape, beta=gammaBeta, shape=(dataDict['nPos'], 1))  # (nPos, 1)

        AF_coeff = .0001
        lambda_p = pm.Deterministic('lambda_p', TD_observed * AF * AF_coeff)  # (nPos, nSample) * (nPos, 1) > (nPos, nSample)
        AD = pm.Poisson('AD', mu=lambda_p, observed=AD_observed)  # (nPos, nSample)

Any help is appreciated.

ricardoV94 · September 13, 2022, 10:50am

The problem is that you are using MutableData for the observed. Imputation only works with constant data, because we need to know which RVs to create when building the model.

Automatic imputation is a shortcut for creating unobserved variables with the same distribution as the observed ones (plus a Deterministic that joins the two together for prior/posterior predictive):

The following are equivalent (but not exactly what we do with automatic imputation):

pm.Normal("x", 0, 1, observed=[0, 1, 2, np.nan, np.nan])

x_obs = pm.Normal("x_obs", 0, 1, observed=[0, 1, 2])
x_missing = pm.Normal("x_missing", 0, 1, shape=(2,))
x = pm.Deterministic("x", at.concatenate([x_obs, x_missing])

We might try to implement Mutable imputation in the future, but that is not in the roadmap at the moment.

HongyuXie · September 13, 2022, 12:42pm

Thank you for the answer!
Not sure if I understood it correcly, but after changing the model to the following

    with pm.Model() as m:

        gammaMu = pm.LogNormal("gammaMu", mu=1, sigma=1)  # (1,)
        gammaStd = pm.LogNormal("gammaStd", mu=1, sigma=1)  # (1,)
        gammaShape = pm.Deterministic('gammaShape', gammaMu ** 2 / gammaStd ** 2)  # (1,)
        gammaBeta = pm.Deterministic('gammaBeta', (gammaShape / gammaMu))  # (1,)
        AF = pm.Gamma('AF', alpha=gammaShape, beta=gammaBeta, shape=(dataDict['nPos'], 1))  # (nPos, 1)

        AF_coeff = .0001
        lambda_p = pm.Deterministic('lambda_p', dataDict['TD_observed'] * AF * AF_coeff)  # (nPos, nSample) * (nPos, 1) > (nPos, nSample)
        AD = pm.Poisson('AD', mu=lambda_p, observed=dataDict['AD_observed'])  # (nPos, nSample)

imputation seemed to work (an imputation warning raised). However, a similar error was raised still:

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'gammaMu_log__': array(1.5), 'gammaStd_log__': array(1.5), 'AF_log__': array([[1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5],
       [1.5]]), 'AD_missing': array([-9223372036854775808, -9223372036854775808, -9223372036854775808,
       -9223372036854775808])}

Initial evaluation results:
{'gammaMu': -1.04, 'gammaStd': -1.04, 'AF': -20.0, 'AD_missing': -inf, 'AD_observed': -969.88}

Regarding this special case where a covariate TD also has missing data, since missing ADs are to be imputed from the Poisson, which relies partially on TD, is this causing the imputation failure? Or the more direct question, how is the TD imputated internally?

ricardoV94 · September 13, 2022, 3:00pm

You are getting some insane values for AD_missing, suggesting your prior on lambda_p may be too large (and overflow?)

HongyuXie · September 21, 2022, 4:03am

Some updates and a follow-up question. I have changed the model to use a mask and it worked. Specifically, I filled missing observed AD and covariates TD with redundant ones and applied a 0/1 array mask at the last step. My understanding is that, since the covariates TD have no prior distribution, there is no way to imputate it and using mask is the only way to this type of problem. In addition, now I need to predict on hold-out data, meaning I have to change the input to MutableData. Once again, it seems using mask is the only way out. Is that right? @ricardoV94

ricardoV94 · September 21, 2022, 5:32am

You can also try to imputate covariates, but it’s a model decision. There’s a good talk on this: Partial Missing Multivariate Observation and What to Do With Them by Junpeng Lao

HongyuXie · September 21, 2022, 6:27am

Again, thank you very much.

Topic		Replies	Views
PyMC v5.14.0 introduces error in MultivariateNormal with NaN v5 development , bug , modeling	3	130	May 14, 2024
Automatic imputation - array dimension problem Questions	2	666	February 10, 2022
Reconciliation via Poisson Sampling v5 modeling , sampling	10	189	December 21, 2023
Understanding error while sampling a model with conditional probabilities Questions	7	1997	May 28, 2018
Missing Data Imputation - Obscurities Questions	0	555	January 18, 2022

Missing data: np.nan in a multidimensional Poisson distribution

Related topics