Clustered Error Modeling

Hi, I am trying to implement a particular hierarchical model, which flattens to a linear regression with a well defined error structure. Namely:

Y(ij) = b0 + b1*var_one(j) + b2*var_two(j) + b3*var_three(ij) + e(j) + e(ij)
where i is the index within a group and j is the is the group of the observation

However, I cannot figure out how to get the e(j) to not be redrawn for every observation. It seems like pymc3 is forwarding the normal error through and redrawing it for each observation. I’ve included a minimal example of the network and flattened behavior below. Is there a straightforward way for me to implement this behavior?

Thank you for your time!

import numpy as np

from pymc3 import Model, Normal, HalfCauchy, sample, traceplot, model_to_graphviz, Deterministic



class Template:    
    def __init__(self, var_one=None, var_two=None):
        self.var_one = rnd.uniform(0, 10) if var_one is None else var_one
        self.var_two = rnd.uniform(0, 10) if var_two is None else var_two
        
class Instance:
    def __init__(self, var_three=None, template=None, score=None):
        self.var_three = rnd.uniform(0, 10) if var_three is None else var_three
        self.template  = Template()         if template  is None else template
        
        score = 1/(1+self.var_three+self.template.var_two)
        self.score = score if score is None else score
        

    
templates = []
for i in range(10):
   templates.append(Template())
   
instances = []
for i in range(100):
    instances.append(Instance(template=rnd.choice(templates)))


var_one   = np.array([template.var_one for template in templates])
var_two   = np.array([template.var_two for template in templates])
var_three = np.array([instance.var_three for instance in instances])
scores    = np.array([instance.score for instance in instances])
which     = np.array([templates.index(instance.template) for instance in instances])


with Model() as net_model:
    b = Normal('b', 0, sd=1, shape=3)
    template_eps = Normal('template_eps', 0, sd=1)
    
    template = b[0]*var_one + b[1]*var_two + template_eps
        
    est = template[which] + b[2]*var_three
    instance_eps = HalfCauchy('instance_eps', 1)
    
    intercept = Normal('intercept', 0, sd=1)
    y = Normal('y', intercept+est, sd=instance_eps, observed=scores)
    
    
    net_trace = sample(1000, tune=1000)

model_to_graphviz(net_model)
traceplot(net_trace)



with Model() as flat_model:
    b = Normal('b', 0, sd=1, shape=3)
    
    est = b[0]*var_one[which] + b[1]*var_two[which] + b[2]*var_three
    instance_eps = HalfCauchy('instance_eps', 1)
    
    intercept = Normal('intercept', 0, sd=1)
    y = Normal('y', intercept+est, sd=instance_eps, observed=scores)
    
    
    flat_trace = sample(1000, tune=1000)

model_to_graphviz(flat_model)
traceplot(flat_trace)```

I think what you are looking for is a nested error term that are the same for each of the j group. I would do something like:

with Model() as net_model:
    ...
    template_eps = Normal('template_eps', 0, sd=1, shape=10)
    template = b[0]*var_one + b[1]*var_two + template_eps 
    est = template[which] + b[2]*var_three
    ...

and

with Model() as flat_model:
    ...
    template_eps = Normal('template_eps', 0, sd=1, shape=10)
    est = b[0]*var_one[which] + b[1]*var_two[which] + b[2]*var_three + template_eps[which]
    ...

That is close, but I’d like all the error terms to come from the same learned distribution. I think that using your model and my example of 10 templates and 100 instances, I would end up with 100 draws from 10 different learned distributions for my e(j), rather than 10 draws from one learned distribution that are reused each time the template is referenced.

Does that make sense? Is this possible?

I dont think this is a correct understanding.

The best way to see it is to count the number of free parameters. Here e(j) has 10 elements, which indicates there is 10 “draws” or “samples” from 1 prior distribution. Each of them then broadcast to 100.
To see this you should also rewrite the linear formula:
Y(ij) ~ Normal(b0 + b1*var_one(j) + b2*var_two(j) + b3*var_three(ij) + e(j), e)
e(j) is no different than other linear predictor var_one or var_two, if you treat it as known you can index it in a similar way. Also it avoid the confusion that e(ij) is the same as e(j) that you can index.

That would mean one prior distribution for the e(j)s, but 10 posteriors learned from the data. I would like to learn one posterior for the 10 e(j)s like I am learning one posterior for the 100 e(ij)s.

Here you do have one prior distribution for e(j) - it’s a Normal(mu, sd) where mu=0 and sd=1. If you want to have one posterior for the 10 e(j)s you will need to define a hyperprior (e.g., on the mu of the Normal)

I started out by adjusting the mean error hyperprior and giving it a shape of 1, but that seemed to result in the random variable being sampled for each instance, rather than for each template. I am guessing at this behavior since the graphviz of my net_model shows the hyperprior connected to the e(j) term which is directly connected to the y term rather than to the template term.