Non-deterministic results between different machines

Hi all,

I’ve been using v4.0.0b6 to fit a simple multivariate regression model to some data. In order to make the output deterministic there is a call to np.random.seed(1234). I’ve posted some relevant parts of the code below.

When adopting this approach, initially I thought all was well. Running pm.sample() repeatedly on the same machine appeared to give convergence to the same solution (checked by inspecting trace.model_logp, and looking at the model parameters. Aside: is this a valid way to do it?)

This was checked on the same machine running (a) in a poetry environment, and (b) inside a docker container; in both cases the environments were set up using the same poetry commands and in both cases the fitted model appeared to be identical using the inspection approach described above.

The same docker container was then used in CI to run the same model fit process, and it does not appear to converge to the same solution. However, running it multiple times on the same CI machine gives the same converged solution each time, just a different one to the original machine.

A colleague then ran it on a third machine, and saw the same behaviour, i.e. different to machines 1 and 2, but consistent convergence on that machine.

Does anyone have any idea what might be causing this or any thoughts on where to look?

One potentially relevant warning we are seeing is "aesaraf.py:1005: UserWarning: The parameter ‘updates’ of aesara.function() expects an OrderedDict, got <class ‘dict’>. Using a standard dictionary here results in non-deterministic behavior. You should use an OrderedDict if you are using Python 2.7 (collections.OrderedDict for older python), or use a list of (shared, update) pairs. Do not just convert your dictionary to this type before the call as the conversion will still be non-deterministic."

However, a bit of searching suggests that we can ignore it in this case (?)

Many thanks for any suggestions.

Code snippets below. If more is needed please say.

# Model (x1, x2, x3, noise are random arrays, c0, c1... consts)
y = c0 + (c1 * x1) + (c2 * x2) + (c3 * x3) + (c4 * noise)

# dataframe constructed from the following dict:
data = {'x1': x1, 'x2': x2, 'x3': x3, 'y': y}

# Fit model
np.random.seed(self.seed)
...
with pm.Model() as model:
    # Priors
    sigma = pm.HalfNormal('sigma', sigma=2)
    intercept = pm.Normal('intercept', mu=20, sigma=200)

    beta = pm.Normal('beta', mu=15, sigma=20, dims=3)

    mu = intercept + pm.math.dot(df[['x1','x2','x3'], beta)
    pm.Normal('likelihood', mu=mu, sigma=sigma, observed=df['y'])

    # draw posterior samples
    trace = pm.sample(draws=1000, tune=1000, chains=4, random_seed=self.seed,
                      return_inferencedata=False,
                      compute_convergence_checks=True)

# inspection, for e.g:
trace.model_logp[0:10]
trace.beta[0:5]
2 Likes

I know the seeding behavior was somewhat unpredictable at some point. @ricardoV94 what’s the current status on that?

It’s a bit o a wild guess, but a quick google suggests some MKL floating-point operations are not necessarily deterministic, and I see you have a dot operation there… numpy - Same Python code, same data, different results on different machines - Stack Overflow

Can you try on a model with just just a simple unobserved normal?

with pm.Model() as m:
  x = pm.Normal("x")
  pm.sample(..., random_seed=self.seed)

Also do the final results look very different across the different machine?

Thanks @ricardoV94 - those were useful tips.

I tried naively applying various MKL flags and didn’t have much luck on that, however on your tip about the dot operator I had more luck: Replacing dot with a solution manually summing the products has, for the moment, resolved the issue. It has introduced a new issue however, which I will described in a subsequent reply.

One other interesting observation I made with the original model (with the dot operator) was that using just a subset of the observation data could result in agreement between both machines. I’m struggling to explain this one, but I guess it could fit under the “vague indeterminate floating point operations” umbrella (?)

The differences were significant enough for the model to converge successfully on one machine and not on the other: we noticed it because we are running a convergence check in CI and it was that model which was failing to converge - otherwise we might not have spotted the issue.

Thanks for your help on this.

It seems your model might be particular unstable, which may be the case where those “non-deterministic” operations play a bigger role (still just an hypothesis).

The issue mentioned above relates to why I was previously using the dot operator: to allow the dimensions of the model to scale automatically dependent on the number of columns of observations supplied.

Previously I had something like:

inputs  # dataframe with N rows of M input observations (M = [x1, x2, x3] in the original example)

coords = {"observations": inputs.index, "input_variables": inputs.columns}

with pm.Model(coords=coords) as model:
  ...
  # beta size set to M
  beta = pm.Normal('beta', mu=15, sigma=20, dims="input_variables")

  # observed size set to N
  data_observed = pm.MutableData("data_observed", output, dims=("observations"))
  # input data size set to N x M
  data_input = pm.MutableData("data_input", inputs, dims=("observations", "input_index"))

  # Intention is for mu to be size N, each entry equal to c0 + (c1 * x1) + (c2 * x2) + (c3 * x3) 
  mu = intercept + pm.math.dot(data_input, beta)
  pm.Normal('observed', mu=mu, sigma=sigma, observed=data_observed)
  ...

If a column was added to inputs then the size of beta etc updates automatically.

Is there a clean way to achieve the same goal without use of the dot operator?

I don’t think you want to get rid of the dot operator (it should be much faster), but instead investigate the underlying model instability that showed up to the surface because of it

That sounds sensible. Is there any general guidance anywhere on approaches to take investigating instabilities? I’m fairly new to Bayesian approaches. The real input data includes some categoricals, one-hot mapped, which I can imagine might be more problematic than numeric inputs.

I suggest you open a new topic for that specific question, you’ll probably have to share some details about the data as well

1 Like