Missing values in a model?

I’m having trouble with a model, and looking in it, I find that it has FreeRVs in it that have no distribution., and have the substring _missing in their names. AFAICT, these were created by code in as_tensor in pymc3.model.

The question I have is why is this done instead of raising an error when PyMC3 can’t look up a variable successfully? Is there some case where these missing variables are legitimate? I’m just wondering why this case doesn’t raise an exception to help the programmer debug, instead of quietly inserting what look like garbage entities to me.


hmmm i am not complete sure what you mean, could you please share your model?

@junpenglao I’m afraid I can’t easily share the model, but I found the cause of the behavior:

I was applying an existing model to a new dataset that was stored in a pandas data frame. This data frame (unlike the data set I had used before) had NaN values in it. The problem went away when we filtered them out.

This still leaves me with my original question: why would PyMC3 quietly create these new _missing random variables, and link them into my model? It seems to me that this should be an error condition, but maybe I’m missing something. E.g., is there some case in which PyMC3 would backpatch values or something?

The block of code that injects these new FreeRV objects is in as_tensor() in model.py, but there are no comments, so I don’t know why the code does that, and what might break if it was replaced with signaling an exception.

My guess is that the developers had something very specific in mind when handling pandas dataframes with missing values, but I don’t know what it is. git blame shows a lot of people working on this bit of the code (@fonnesbeck, @twiecki, John Salvatier and others).

@rpgoldman Yes, the idea is that nans represent missing data that we can impute in PyMC3. It’s actually quite powerful but I can see how it might cause confusion if it’s not intentional. Maybe we should add a log message that we are imputing in that case.

1 Like

If the input data is a numpy array, it will not automatically impute the missing values (unless it is specifically a masked array), whereas DataFrame inputs with missing values will be imputed. I’m happy to add a warning message when this occurs.

I still don’t typically work with Pandas data structures as inputs because I get unexpected behavior at times (usually related to indexes). I almost always strip out the values. We should probably improve the reliability of Pandas inputs, I agree.

1 Like

Would it make sense to have a model construction option that indicates whether or not missing data should be treated as erroneous?

For at least some user, missing data will indicate failed filtering, so it would be nice to get a report.

TBH, my pandas fu isn’t sufficient to provide a check. Is it the case that hasattr(data,'mask') will be false if there are no missing values? If so I could try to add a PR that optionally traps this.

A PR that adds a warning to the log would be much appreciated!

@rpgoldman would something like this work for you?

Yes, but I was actually thinking of adding a boolean property, impute_values to the Model object on creation, and produce an error rather than a warning if there are missing data and impute_values is False.

One thing I don’t understand is that it didn’t seem like my data got imputed, but I may be wrong. In the presence of missing data, my colleague ran the model and sent me this error message:

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
process_plan: r1bbktv6x4xke
Processing: plan_attributes.json
Found 93 Records in r1bbktv6x4xke
Processing r1bbktv6x4xke 7.5e-05 UWBF_NOR
Trace directory is: gander-data/UWBF_NOR/trace
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [output (1, 1)/5_missing, output (1, 1)/4_missing, output (1, 1)/3_missing, output (1, 1)/2_missing, output (1, 1)/1_missing, output (1, 1)/0_missing, output (0, 0)/5_missing, output (0, 0)/4_missing, output (0, 0)/3_missing, output (0, 0)/2_missing, output (0, 0)/1_missing, output (0, 0)/0_missing, output (1, 0)/5_missing, output (1, 0)/4_missing, output (1, 0)/3_missing, output (1, 0)/2_missing, output (1, 0)/1_missing, output (1, 0)/0_missing, output (0, 1)/5_missing, output (0, 1)/4_missing, output (0, 1)/3_missing, output (0, 1)/2_missing, output (0, 1)/1_missing, output (0, 1)/0_missing, sigma_(1, 1), sigma_(0, 0), sigma_(1, 0), sigma_(0, 1), mu_(1, 1) offset, mu_(0, 0) offset, mu_(1, 0) offset, mu_(0, 1) offset, hyper_sigma_sigma_(1, 1), hyper_sigma_sigma_(0, 0), hyper_sigma_sigma_(1, 0), hyper_sigma_sigma_(0, 1), hyper_sigma_mu_(1, 1), hyper_sigma_mu_(0, 0), hyper_sigma_mu_(1, 0), hyper_sigma_mu_(0, 1), hyper_mu_sigma_(1, 1), hyper_mu_sigma_(0, 0), hyper_mu_sigma_(1, 0), hyper_mu_sigma_(0, 1), hyper_mu_mu_(1, 1), hyper_mu_mu_(0, 0), hyper_mu_mu_(1, 0), hyper_mu_mu_(0, 1)]
Sampling 4 chains:   0%|                                                        | 0/4000 [00:00<?, ?draws/s]
Traceback (most recent call last):
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
    point, stats = self._compute_point()
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 117, in astep
    'might be misspecified.' % start.energy)
ValueError: Bad initial energy: inf. The model might be misspecified.

The above exception was the direct cause of the following exception:

ValueError: Bad initial energy: inf. The model might be misspecified.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home1/05426/plotnick/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home1/05426/plotnick/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home1/05426/plotnick/xplan/xplan-to-autoprotocol-reactor/helpers/gander.py", line 16, in <module>
    tp_main("UWBF_NOR", max_plans=1, od=7.5e-5, train=train)
  File "/home1/05426/plotnick/xplan/xplan-to-autoprotocol-reactor/helpers/train_prior.py", line 337, in main
    all_models.append(train(_gate, ygdata))
  File "/home1/05426/plotnick/xplan/xplan-to-autoprotocol-reactor/helpers/gander.py", line 12, in train
    model = make_model(gate, data)
  File "/home1/05426/plotnick/xplan-experiment-analysis/ygmodel.py", line 580, in make_model
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/sampling.py", line 449, in sample
    trace = _mp_sample(**sample_args)
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/sampling.py", line 999, in _mp_sample
    for draw in sampler:
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 305, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "/home1/05426/plotnick/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 223, in recv_draw
    six.raise_from(RuntimeError('Chain %s failed.' % proc.chain), old)
  File "<string>", line 3, in raise_from
RuntimeError: Chain 1 failed.```

And when I used graphviz to plot the model, my observation nodes foo_missing were displayed as parents of the observation node, rather than children, and showed as Missing Distribution. So maybe there’s something addition one must do to cause the imputation to actually happen?

Here’s a snippet from the graphviz figure:

The way it works under the hood is that the model looks for masked arrays in the observed argument’s value as a cue to impute. When a Pandas series is passed, the values of the array are cast to a masked array if there are any missing values present. This is all done at the time of model specification, so not when sample is called, because it affects the structure of the model. So, I’m not sure how your proposal would work, since it would have to go back and alter the structure of the model itself. I suppose you could just have it look for _missing variables in the model and refuse to sample if the flag is set.

The best way to avoid imputation is to remove the missing values. At least a warning would let you know what’s happening and let you do something about it before trying again. I will go ahead and add the message in a pull request for now, and if you want to have a go at a sampling flag, you can do that separately.

1 Like

PS – not sure exactly what’s going in on your model. Does it happen with chains=1?

I’m not sure, either. I’ll have to go back and “unfix” it so that I can test.

I guess my general feeling is that data imputation should be the “marked” (non-default) mode of operation, instead of the unmarked (default).

If you expect there to be no missing data, then your model fails in very hard-to-track ways when imputation happens (or doesn’t happen, which seems to be my case) instead of getting a warning about missing data.

I don’t think that I’m wrong in this – I suspect most users don’t expect imputation, but I could be wrong.

I disagree about the default behavior to not impute. The onus should be on the user not to pass missing values in their response variables, and giving a warning allows users to remove missing values if they want to. The only cost to imputing is additional computation from drawing from the posterior predictive distribution during sampling, so there is no risk of getting the “wrong answer” by imputing data.

Its not clear to me that the breakage that you are demonstrating is due to missing data being imputed, but I’m happy for you to convince me otherwise.