Lowest level in 4-level hierarchical bayes model has zero derivative errors

Hello there,

I am working on a hierarchical Bayes model which has four levels, which I want to estimate using pymc3.
I am testing my model on simulated data, where I can compare the parameter estimates to the true values. My problem is that the model works fine with three levels (accurate estimates on all three levels), but gives a “Mass matrix contains zeros on the diagonal” error when I include the fourth level. The derivatives of the 4-th level variables are zero it says. I have included my code below, including the data generation. I call the four levels: channel - product - cluster - campaign, from highest to lowest (the problem is related to media). Note that I assume that each campaign is present in each cluster, and each cluster is present in each product and each product is present in each channel (it is thus not hierarchically nested). The ‘cluster’ level is thus really the level containing (cluster-product-channel) combinations, and the ‘product’ level contains (product-channel) combinations etc.

I don’t see what the problem with this fourth level is and I can’t find anyone modeling a 4-level hierarchical model online. The only way for me to prevent the error is to make the priors more informative (that is, give them a lower variance) but this makes the estimates too restrictive around the estimate of their higher level, and in the data generating process I still draw all parameters with quite a variance from their higher level. I have tried using advi+adapt_diag but it gives the same error. As you can see from the code, I have also used a Non-Centered Parameterization, but that still doesn’t help.

Thanks in advance for any of your help.


#generate the paramaters in hierarchically related fashion
b1 =  [1,5,10]                               #channel parameters
b2 =  np.random.normal(b1,2, (3,3))          #product parameters
b3 =  np.random.normal(b2,2, (3,3,3))        #cluster parameters
b  =  np.random.normal(b3,2, (3,3,3,3))      #campaign parameters
X_media = np.random.normal(0,1,(400, 3*3*3*3))  #generate data with 400 observations for each campaign/cluster/product/channel combination
Y = np.dot(X_media,b.reshape(3*3*3*3))   #linear model
n_channels=3              #number of channels
n_products=3*3            #number of channel-product combinations
n_clusters=3*3*3          #number of channel-product-cluster combinations
n_campaigns=3*3*3*3       #number of channel-product-cluster-campaign combinations
#the b parameter has been flattened, and so the columns of X coincide with the order in which they've been flattened. 
#I use np.unravel_index to find out what the coordinates of the reshaped b would have been 
#in its original high-dimensional form, and these coordinates represent (campaign,cluster,product,channel).

#The camp_to_clust dictionary contains the corresonding (cluster,product,channel) combination.
camp_to_clust = {}
for i in range(n_campaigns):
    camp_to_clust[i] = tuple(np.concatenate(np.unravel_index([i], (3,3,3,3))[1:4]))
#All possible (cluster,product,channel) combinations.
clust_dict = {}
clust_list = [p for p in itertools.product([0,1,2], repeat=3)]
for i in range(len(clust_list)):
    clust_dict[i] = clust_list[i]

#Corresponding (product,channel) combinations of each (cluster,product,channel).    
clust_to_prod = {}
for i in range(n_clusters):
    clust_to_prod[i] = clust_dict[i][1:3]
#All possible (product,channel) combinations.
prod_dict = {}
prod_list = [p for p in itertools.product([0,1,2], repeat=2)]
for i in range(len(prod_list)):
    prod_dict[i] = prod_list[i]
#Corresponding (channel) of each (product,channel).    
prod_to_chan = {}
for i in range(n_products):
    prod_to_chan[i] = prod_dict[i][1]
#all channels
chan_dict = {}
chan_list = [p for p in itertools.product([0,1,2], repeat=1)]
for i in range(len(chan_list)):
    chan_dict[i] = chan_list[i][0]
#invert the dictionaries' key and values 
clust_dict_inv = dict(map(reversed, clust_dict.items()))
prod_dict_inv = dict(map(reversed, prod_dict.items()))
chan_dict_inv = dict(map(reversed, chan_dict.items()))
#create index arrays to move from each lower level combination to its higher level form.
campaign_to_cluster_idx = np.zeros(3*3*3*3)
for i in range(3*3*3*3):
    campaign_to_cluster_idx[i] = clust_dict_inv[camp_to_clust[i]]
cluster_to_product_idx = np.zeros(3*3*3)
for i in range(3*3*3):
    cluster_to_product_idx[i] = prod_dict_inv[clust_to_prod[i]]
product_to_channel_idx = np.zeros(3*3)
for i in range(3*3):
    product_to_channel_idx[i] = chan_dict_inv[prod_to_chan[i]]

product_to_channel_idx = product_to_channel_idx.astype(int)
cluster_to_product_idx = cluster_to_product_idx.astype(int)
campaign_to_cluster_idx = campaign_to_cluster_idx.astype(int)
#Bayesian model
with pm.Model() as hierarchical_model:
    # Intercept prior
    a = pm.Normal('a', mu=0, sigma=1)
    # variances
    channel_level_sigma = pm.HalfCauchy('channel_level_sigma', 1)
    product_level_sigma = pm.HalfCauchy('product_level_sigma', 1)    
    cluster_level_sigma = pm.HalfCauchy('cluster_level_sigma', 1) 
    campaign_level_sigma = pm.HalfCauchy('campaign_level_sigma', 1) 

    # Priors for the four levels
    mu_channel = pm.Normal('mu_channel', mu=0, sigma=100)
    channel_level_offset = pm.Normal('channel_level_offset', mu=0, sigma=1, shape=n_channels)
    channel_level = pm.Deterministic('channel_level', mu_channel + channel_level_offset*channel_level_sigma )
    product_level_offset = pm.Normal('product_level_offset', mu=0, sigma=1, shape=n_products)
    product_level = pm.Deterministic('product_level', channel_level[product_to_channel_idx] + product_level_offset*product_level_sigma )
    cluster_level_offset = pm.Normal('cluster_level_offset', mu=0, sigma=1, shape=n_clusters)
    cluster_level = pm.Deterministic('cluster_level', product_level[cluster_to_product_idx] + cluster_level_offset*cluster_level_sigma )
    campaign_level_offset = pm.Normal('campaign_level_offset', mu=0, sigma=1, shape=n_campaigns)
    campaign_level = pm.Deterministic('campaign_level', cluster_level[campaign_to_cluster_idx] + campaign_level_offset*campaign_level_sigma )
    # Model error
    eps = pm.HalfCauchy('eps', 10.)
    # Main
    mu_y = a +  pm.math.dot(X_media,campaign_level) 
    # Data likelihood
    y = pm.Normal('kpi', mu=mu_y,
                           sigma=eps, observed=Y)
    trace = pm.sample(draws = 500, tune = 2000, chains=4, progressbar=True)

Hi Mike,
Yeah I think the problem is partly coming from some of your priors: maybe try Exponential instead of HalfCauchy for your sigmas (still fat-tailed but less than Cauchy). The sigma of mu_channel seems really high too, especially for a slope. In general, you need much more regularizing priors with hierarchical models, especially for a model with so many levels. I think prior predictive checks could help you here.

I wonder though why you’re modeling four levels if your real data are not actually hierarchically nested, as you said – from what I understood however, your simulated data seem to be hierarchical (b depends on b3, which depends on b2, etc.).

Maybe what you’re looking for is modeling 4 different clusters in the same hierarchical layer? If that’s sounds interesting, here is a thread discussing it, and a notebook implementing it (code 13.22).

Hope this helps :vulcan_salute: