Linear regression model: Three chains work, one chain never moves

Update: When I first wrote this I didn’t understand how the plotting was being done. Now that I do, I can see that this shows one of the chains (probably the fourth) just not moving at all – I didn’t realize that samples from the chains were being plotted one after another – rather than all of the chains getting stuck. But this remains a failure mode that is completely new to me.

I have never seen this before, and was looking for some suggestions. Hoping that someone will look at the behavior I have and say “I’ve seen that before!”

When I look at the trace for my model the sampler seems to have gotten stuck and not been able to get out of a local minimum:

Another oddity, by the way, when I try to do pm.stats.summary(trace) PyMC3 hangs.
There’s a term for influences of one independent variable (the growth medium), and terms for the “ideal output” (IO) and “deviation output” (DO). There are also four input conditions – plotted as the colors in the XOR IO, XOR DO, and Cluster plots.

Some of my data is bimodal, so I added the possibility that some of the observations come from a “deviation group” and get their output from DO instead of IO. Whether they do or not is switched based on the Bernoulli variable Cluster, which has a beta prior, one for each input condition.

As you will see, at some point the “medium influences” get pinned to a high value, and stick. But, since it’s the same high value for each of the three, this seems to be a reflection of the fact that (as far as I can tell by eyeballing the data) the choice of growth medium has a minimal effect on the observed output.

Interestingly, the Cluster variable seems to be making roughly the right inference – the input condition that is colored orange is the one with the most pronounced bimodality.

By the way, I’m building on (and very grateful for) the advice from @chartl here: Sampling semantics of multiple observed variables

The one odd thing about my model is that I have separate 1/0 cov matrices for the growth medium and the input condition, so I have a cascaded linear sum instead of one. Diagram follows:

Using Arviz I now can see that it’s one of the chains that seems to have been stuck from beginning to end:

There are four chains, but one of them is constant at 1.5ish for all three growth media.

1 Like

Thank you very much for the advice! I am looking into those things now.

I think I may have found a PyMC3 issue, as well, though. In my notebook, when I extract the medium influences from the trace, and plot it chainwise, I see this:


Done by:

dfs = [pd.DataFrame(x, columns=sm.medium_varnames) for x in trace.get_values(varname='medium influences', chains=[0,1,2, 3], squeeze=False, combine=False)]
for i, df in enumerate(dfs):
    df.plot(subplots=True, title='Chain %d'%i)

But when I use traceplot, I see this:

Done by:

with sm.model:
        pm.traceplot(trace, varnames=['medium influences'], combined=False)

These two don’t look like the same thing at all.

I suspect that my Jupyter session may be so messed up by now that I should restart it…

At first blush this looks extremely complicated for a latent variable model. I was under the impression that many of these variables (IO/DO/medium) were directly observed; in which case they should not have distributions associated with them (unless “XOR IO” indicates the distribution of \beta_{io} in y=\alpha + \beta_{io}x_{io} + \beta_{do}x_{do} + \dots)?

My best guess is that there may be a number of extra parameters around that don’t strictly need to be there which make the model not uniquely identifiable. You might consider linking notebook with (toy) data and your model.

Actually, the only observed variables are the growth medium (which isn’t shown as an observation here, because it’s reflected in the unseen cov array input into medsum) and the output (obs). So the IO, DO, medium influences and Cluster (categorical variable used to gate either IO or DO) variables are all latent.


One thing that’s a little odd to me, coming from a Bayes net background in computer science (for discrete RVs), instead of statistics, is that I would expect the picture of the model to look more like this:

This gets at my confusion about the semantics of observations, etc.: sometimes observations come in as evidence, and sometimes as conditioners, and they are modeled differently.

I would treat the left side of the tree as a fixed effect, using an encoding of the medium like (for 3 media)

baseline  is_m1  is_m2
       1      1      0
       1      0      1
       1      0      0

and use the “standard”

b = pm.Normal('beta_fixed', 0., 1., shape=3)
yfixed =, b)

And the right side (correct or deviant) looks like a gaussian mixture model:

X_{do} \sim N(\mu_{do}, \sigma_{do}^2)
X_{io} \sim N(\mu_{io}, \sigma_{io}^2)
B_{id} \sim \mathrm{Bernoulli}(p)
X_{obs} = B_{id}X_{io} + (1-B_{id})X_{do}

is that right? Are these only distinguished by the prior distributions? Are they identifiable by the model or could one swap the id and do labels and achieve the same likelihood?

As it is written, I would treat this as a gaussian mixture model ( random effect. It does not need to be multiplied by a coefficient (this just shifts/scales the centers and variances, so leads to identifiability issues). Also if the standard errors are taken to be the same, there is non-identifiability between the random effect and the measurement error – so drop the error term. (In fact, this should always look like an additive factor to all of the cluster SDs, so it should always be dropped I think).

Incorporating the fixed effects into the mixture model is straightforward: just add it to the cluster means

points = pm.Normal('obs',
                       mu=means[category] + y_fixed,
1 Like

Thank you for the recommendations. I think I have almost managed to translate my model into the marginalized form. One question, which is sort of off-topic: how do you debug these complex theano structures? I have some shape error in the middle of the normal mixture expression, and while I know that I have a mis-match, I don’t know what is mis-matching or what are the shapes of the arguments to this function.
Theano shape variables are opaque, although by the time I get to building the normal mixture distribution, all the shapes should be known constants. But I don’t know how to break open a shape variable to debug…

A lot of trial and errors… the traceback on jupyter notebook helps as well

I found the best thing was to extensively comment my model with comments specifying the shape of every entity. That helped, but I still got it wrong in a few places.
If I can find the time, I will try to provide a Theano patch to the notorious dimension mismatch error that provides more useful information.
It would also be super helpful if someone who understands Theano would provide a debugging function to provide information about shapes. Stack Overflow has several postings from people with the dimension mismatch problem, but no good answers.
Thank you again for the helpful advice and pointer to that example notebook.

One more follow-up: what do you mean by “normalization operation” in the above? I think you are using that term in a way that I don’t understand.
Ordinarily, I understand normalization as rescaling to 0 - 1, but that doesn’t seem to be what you mean here.

Things like softmax rv/rv.sum()

Sorry to keep bothering you, but could you say something about the relative advantages of the indexing approach that you use in the schizophrenia case, and the dot-product approach to indexing?

What I mean is, I have seen a few examples where someone builds an array of 0s and 1s and then takes the dot product of them and an array of other variables to select a single value per row.

On the other hand, you use an array of indices to achieve the same purpose in the example you pointed me to.

I would guess that the indexing approach is better, and PyMC3 + theano would optimize it better, but I don’t know for sure, so I thought I would ask.

In theory the two is identical, but your intuition is correct - theano optimized the two differently, and depending on your data size and the hardware you are using, it is no grantee one is faster than the other in my experience.

I’m sorry to keep bothering you with questions, but in the Schizophrenic reaction time notebook, I see these two lines:

w = tt.stack([1. - Z_latent.flatten(),
              Z_latent.flatten()]).T  # <- careful of the shape here
mu_patient = tt.as_tensor_variable([alphas[sbj_patient],
                                    alphas[sbj_patient] + tau]).T

Two questions:

  1. Is the first line (the assignment to w), is tt.stack(...).T the same as tt.stack(..., axis=1)? (I have no idea if one would be more efficient than another).
  2. Why does the second line use as_tensor_variable instead of stack? Is there a difference?

I forgot why I am doing that but if i remember correctly after flatten there is no axis=1 as the shape is (n,)

It is probably better to do stack here actually.

Thanks! I really appreciate all of your help! :pray:

1 Like