Memory Error when sampling my Hierarchical logistic model

Here are my input variables:

X_train_hier, X_test_hier, y_train_hier, y_test_hier = train_test_split(X_hier, y, test_size=.15, random_state=seed)
X_idx = theano.shared(np.asarray(X_train_hier.uid.values))
X_user = len(np.unique(np.asarray(X.uid)))
X_train_hier = X_train_hier.drop("uid", axis=1)
X_len = len(X_train_hier.keys())
X_data = theano.shared(X_train_hier.values)

Here are my model:

with pm.Model() as hier:
    #hyperpriors:
    mu_a = pm.Normal("mu_a", 0.0, 1e4)
    sigma_a = pm.Exponential("sigma_a", 1e4)
    mu_b = pm.Normal("mu_b", 0.0, 1e4)
    sigma_b = pm.Exponential("sigma_b", 1e4)
    

    #intercept prior for each user
    a = pm.Normal("a", mu_a, sigma_a, shape=X_user)

    #coefficient prior for each user
    b = pm.Normal("b", mu_b, sigma_b, shape=(X_len, X_user))

    #likelihood
    theta = pm.Deterministic("theta", pm.invlogit(tt.sum(a[X_idx] + tt.dot(X_data, b[:, X_idx]))))
    
    #observation from sigmoid
    obs = pm.Bernoulli(name="AT", p=theta, observed=y_train_hier)

So the model runs, and the output model is:


The real problem is when I start to sample the model:

with hier:
    start = pm.find_MAP()
    trace_hier = pm.sample(4000, 
                             tune=4000, 
                             init="advi+adapt_diag_grad", 
                             start=start, 
                             chains = 4, 
                             random_seed=seed, 
                             return_inferencedata=True)

It gives me error like this:

MemoryError                               Traceback (most recent call last)
~\AppData\Roaming\Python\Python38\site-packages\theano\compile\function\types.py in __call__(self, *args, **kwargs)
    973             outputs = (
--> 974                 self.fn()
    975                 if output_subset is None

MemoryError: failed to alloc dot22 output

During handling of the above exception, another exception occurred:

MemoryError                               Traceback (most recent call last)
<ipython-input-19-62b506239b95> in <module>
      1 with hier:
----> 2     start = pm.find_MAP()
      3     trace_hier = pm.sample(4000, 
      4                              tune=4000,
      5                              init="advi+adapt_diag_grad",

~\AppData\Roaming\Python\Python38\site-packages\pymc3\tuning\starting.py in find_MAP(start, vars, method, return_raw, include_transformed, progressbar, maxeval, model, *args, **kwargs)
    102     else:
    103         update_start_vals(start, model.test_point, model)
--> 104     check_start_vals(start, model)
    105 
    106     start = Point(start, model=model)

~\AppData\Roaming\Python\Python38\site-packages\pymc3\util.py in check_start_vals(start, model)
    230             )
    231 
--> 232         initial_eval = model.check_test_point(test_point=elem)
    233 
    234         if not np.all(np.isfinite(initial_eval)):

~\AppData\Roaming\Python\Python38\site-packages\pymc3\model.py in check_test_point(self, test_point, round_vals)
   1381 
   1382         return Series(
-> 1383             {RV.name: np.round(RV.logp(test_point), round_vals) for RV in self.basic_RVs},
   1384             name="Log-probability of test_point",
   1385         )

~\AppData\Roaming\Python\Python38\site-packages\pymc3\model.py in <dictcomp>(.0)
   1381 
   1382         return Series(
-> 1383             {RV.name: np.round(RV.logp(test_point), round_vals) for RV in self.basic_RVs},
   1384             name="Log-probability of test_point",
   1385         )

~\AppData\Roaming\Python\Python38\site-packages\pymc3\model.py in __call__(self, *args, **kwargs)
   1558     def __call__(self, *args, **kwargs):
   1559         point = Point(model=self.model, *args, **kwargs)
-> 1560         return self.f(**point)
   1561 
   1562 

~\AppData\Roaming\Python\Python38\site-packages\theano\compile\function\types.py in __call__(self, *args, **kwargs)
    985                 if hasattr(self.fn, "thunks"):
    986                     thunk = self.fn.thunks[self.fn.position_of_error]
--> 987                 raise_with_op(
    988                     self.maker.fgraph,
    989                     node=self.fn.nodes[self.fn.position_of_error],

~\AppData\Roaming\Python\Python38\site-packages\theano\link\utils.py in raise_with_op(fgraph, node, thunk, exc_info, storage_map)
    506         # Some exception need extra parameter in inputs. So forget the
    507         # extra long error message in that case.
--> 508     raise exc_value.with_traceback(exc_trace)
    509 
    510 

~\AppData\Roaming\Python\Python38\site-packages\theano\compile\function\types.py in __call__(self, *args, **kwargs)
    972         try:
    973             outputs = (
--> 974                 self.fn()
    975                 if output_subset is None
    976                 else self.fn(output_subset=output_subset)

MemoryError: failed to alloc dot22 output
Apply node that caused the error: Dot22(<TensorType(float64, matrix)>, InplaceDimShuffle{1,0}.0)
Toposort index: 5
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(70345, 53), (53, 70345)]
Inputs strides: [(8, 562760), (8, 424)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Elemwise{Add}[(0, 1)](InplaceDimShuffle{x,0}.0, Dot22.0)]]

I seems not able to find anything on the Internet about this “failed to alloc dot22 output” error. Can someone help me understand what went wrong in my model? Any help is appreciated!

It looks like you are running out of ram memory during the dot product trying to calculate theta.

Given that theta is deterministic and can be calculated from the other parameters of the trace, I would not wrap in in a deterministic statement. The code will work the same, the only difference is that theta won’t be stored in the trace. If the reason for the memory error is that the whole trace is too big, this will reduce the memory requirements and probably fix the issue (there is always the possibility of this not reducing memory consumption enough which will require extra changes in addition to this). If you really need the values of theta, you can always calculate them afterwards using libraries like dask that support out of memory computation.

If the suggestion above is still not enough, I’d recommend running the chains 1 by 1 and then combining them. If you use return_inferencedata=True you can use az.concat to combine multiple traces into a single object that can then be passed to az.summary, az.plot_trace… Note that you should use idata_kwargs={"log_likelihood": False} to keep memory consumption to a minimum if you don’t need to perform model comparison.

Hi, Oriol

Thank you very much for the reply. However, even when I successfully load the model (see below), I still cannot run the

sampling even with one chain. The problem seem to be excessive use of RAM to store the model. It takes almost 27 GB of my RAM (I have 32GB of RAM install on the desktop), though the dataset is only 17MB.

Would you suggest a better model specification that allows me to build this hierarchical logistic model?

To be noted is that since each one of my 7420 layer represent a person, there are around 22-ish variables are associated with each person, meaning despite the data might be over 70 thousands rows, those variable only have at most 7420 unique combinations. So, in this case, would you think it might be a good idea to find some way to specifying the those person-related variables in 7420 indexes first, then add them to the rest of the data?

Thank you again for the time reading this!

On a closer look, it seems that the dot product is trying to build a 70k x 70k matrix which is probably what requires so much memory (if your original dataset has 70k rows and is 17Mb, this matrix should be roughly 70k times as big). Is this really what you want?

Hi, Oriol

I actually do not need 70k x 70k matrix. What I would mostly want is 7042 layers of alpha, beta 1 (matches 30-ish person level variables) and beta 2 (matches 20-ish event level variables). Is this possible to represent in a hierarchical model?

Thank you so much for your patience!

Something seems wrong here, you have

\theta = \Lambda(A + XB)

where A is 1 \times 7420 and B is 53 \times7420. The only way this works out is when X is 1 \times 53 in which case \theta is 1 \times 7420; but in the image it is 1 \times 70345? Is your X_idx doing what you think it is? Is there an extra axis being summed over?

Hi, chartl

According to Prediction using sample_ppc in Hierarchical model - Questions - PyMC Discourse here. I tried the sample from that question. Here is what I understand, my X_user is an index for the layers of data I want (7042) (unique value of X.uid), then X_idx will be the number of indexing variable (uid) from my training dataset (X_train_hier.uid.values). So I think the correct model should be around 6000 layer, since X_user is the index of “uid” from the whole dataset (X), where X_idx should automatically group the same “uid” from traning dataset (X_train_hier) to one layer, result in 6000-ish layer.

I know this is very confusing, I am still working my way around how to specifying hierarchical model in PyMC3. Actually, I would prefer to specify my model as:

theta = a[X_idx (layer of unique uid from substitutable input data)] + b1[X_idx (layer of unique uid from substitutable input data), 21 (uid related variables, which is the same for each unique uid value)] * X_data_people + b2[X_idx (layer of unique uid from substitutable input data), 32 (event related variables, which is different for every row of input data)]* X_data_event (represent 32 event related variables)

In general, I think I need to specify a model with subsitutible X_user (length of unique uid, 7042), X_idx (length of unique uid from input data, 6000+), X_data_people (uid related variables, 7042), and X_data_event (event related varaibles, 70345).

Could you help me clear out some possible theory of constructing such model?

I think you want to multiply X_data with b[:, X_idx], even if one needs to be transposed and then sum over that, something like:

pm.invlogit(a[X_idx] + tt.sum(X_data * b[:, X_idx]))

The A Primer on Bayesian Methods for Multilevel Modeling — PyMC3 3.10.0 documentation notebook should give some extra light on that, it has several hierarchical models implemented.

IIUC, you want to “loop” over indivudials and for each individual multiply the corresponding beta times the corresponding x and add all that to get the contribution for that individual. The dot is doing this calculation but for all combinations of individuals, i.e. of the 70k x 70k matrix, the (2, 78) position corresponds to the x values for individual #2 combined with the betas of individual #78, I think you only want the diagonal of that

1 Like

Hi, Oriol

Thank you very much for the help. It is indeed I should not use tt.dot here, you are right about the dimensions! I appreciate for your patience and expertise!