Two-stage Bayesian regression enforcing a fixed distribution (not Just Hierarchical regression)

I want to implement two-stage Bayesian regression. Where one models posterior feeds into the second model (i.e separate regression). It appears that PyMC only allows free random variables that are to be inferred or observed data and cannot take “fixed distributions” to be sampled as input.

What I want to achieve in the real-world problem:

My attempt at a solution (with a toy example):
In a previous related post @junpenglao suggested using a random stream to sample a “fixed/specified normal” distribution: [Prevent prior from updating? - #5 by mathiasp]

Following a similar approach, I created a custom aesara Op to sample from a fixed Kernal Density Estimate (simplified to t-distribution for a toy example) of the posterior from the first regression step. Here is a toy example that works to enforce a t-dist with loc=5,scale=1,df=10 :

import numpy as np
import pymc as pm
import scipy.stats as st
import aesara.tensor as at
class MyRV(at.Op):##custom op
    itypes = [at.dscalar]
    otypes = [at.dscalar]
    def perform(self,node, inputs_storage, output_storage): 
        (dummy_tensor_var,) = inputs_storage#dont do anything with input
        #or could also be a KDE from posterior from regression step 1 e.g.
MyRV=MyRV()##create op
srng= at.random.utils.RandomStream(1234)
dummy_tensor_var=srng.normal(10, 1)
bayesModel =pm.Model()

with bayesModel as m:
    ##custom Op t-dist(or KDE) randomstream
    post_from_regresion_step1 = pm.Deterministic('post_from_regresion_step1',MyRV(dummy_tensor_var))  
    ##infered free variables   
    sigma_likelihood = pm.HalfNormal('sigma_likelihood',1) 
    #some hierarchical function using our enforced custom distribution
    Hierachial_calc_result = pm.Deterministic('Hierachial_calc_result',mu-post_from_regresion_step1)
    #liklihood function with data centred at zero
    ## mu must therefore become equal to "post_from_regresion_step1" to minimise error           
    y = pm.Normal('y', Hierachial_calc_result, sigma_likelihood, observed=[-0.2,-0.1,0.0,0.1,0.2] )
    trace = pm.sample(draws=1*10**4,chains=1)

The graph of the toy example:

The trace of the toy example gives the expected result with our custom “enforced” t-distribution “post_from_regression_step_1” having a mean of 5 and std of 1. Further calculations in the hierarchical model based on these samples also work.

However, in the real-world Hierarchal model, the NUTS sampler fails with: “zero gradient error”.
Are we allowed to inject custom random samples in PyMC graph without NUTS knowing the logp and gradient info?

How can I input a fixed distribution from one regression stage into the next?

I think you are on the right track, but probably it is possible to use the RandomVariable from aesara directly. Maybe @ricardoV94 could advice a bit more?

I don’t think you can have a stochastic input to NUTS, how would it work?

Even ignoring the fact you can’t differentiate wrt to it, NUTS evaluates the logp graph multiple times per iteration. In every evaluation the value of the random node would change, and my guess is that would really screwup with NUTS?

CC @aseyboldt

I agree tend to agree. But I don’t know the internal workings of PyMC/graphs/derivatives ect.

My “hacky method” works but only for toy examples, any nonlinear function in the hierarchical model breaks the MCMC sampler.

How would you @ricardoV94 @junpenglao suggest inputting a fixed distribution in PyMC? Surely two-stage/sequential regression should be possible?

You would need to assign a custom sampler to the stochastic variable so that the nuts update happens separately from the random draw. Similar to how discrete variables are sampled separately from continuous ones in PyMC

This will become a stochastic likelihood and HMC does not differentiate over these stochastic node. You don’t need a custom sampler for this.

I am surprised that would work. I would imagine you need to hold the stochastic variable constant during the NUTS proposal. How will it use gradients if they change (discretely) at every evaluation?

The custom sampler would be just to hold the stochastic variable constant while NUTS updates.

@ricardoV94 That seems easy to implement; ill try it out. Do maybe have other suggestions @junpenglao?

step1=pm.Slice([post_from_regresion_step1])##Custom stochastic vairable
step2=pm.NUTS([sigma_likelihood,mu])##free vairables
trace = pm.sample(draws=1*10**4,step=[step1,step2])

From the PyMC docs for Compound Steps: "The concern with mixing discrete and continuous sampling is that the change in discrete parameters will affect the continuous distribution’s geometry so that the adaptation (i.e., the tuned mass matrix and step size) may be inappropriate for the Hamiltonian Monte Carlo sampling. HMC/NUTS is hypersensitive to its tuning parameters (mass matrix and step size). Another issue is that we also don’t know how many iterations we have to run to get a decent sample when the discrete parameters change. Though it hasn’t been fully evaluated, it seems that if the discrete parameter is in low dimensions (e.g., 2-class mixture models, outlier detection with explicit discrete labeling), the mixing of discrete sampling with HMC/NUTS works OK. However, it is much less efficient than marginalizing out the discrete parameters. "

It works because the stochastic node becomes a static value during the evaluation of logprob, but you are right it is not currently working in v4 because we are checking them here: pymc/ at c8db06b125b1bde286cc00298239980a807d26fa · pymc-devs/pymc · GitHub. Maybe it is easier to do with aeppl.

The custom step method will work for sure by returning a random sample from a fix distribution.


There is no way to make it static unless you pass it as givens or input (hence the need for the custom/ separate sampler). Otherwise a RV in a Aesara graph would evaluate to a new value everytime it’s called.

Having said that, there should be no problem if the variable is sampled separately. Simulator provides such a type of stochastic logp and Metropolis seems happy with it.

On another note, we definitely want to add some simple step method that simply takes random draws from a graph, for example to perform conjugate sampling (or the case raised in this issue). I just haven’t had the time to write something up :slight_smile:

I guess what I dont understand is that why it cannot be a Stochastic node that evaluate to new value without a Step method. In another word it behaves stochastically in forward graph evaluation but it is not a RV in the logp graph

@junpenglao my understanding is that the idea here was to mix sampling of free variables from the two methods: some with NUTS (logp graph) and others via forward draws (RV graph).

The step method is needed only to separare the two types of draws, so that the different types of variables can be held constant across the two graphs: RVs+forward draws held constant while doing logp+NUTS draws and vice-versa (assuming a bidirectional interdependence, although in the original example here the forward RV graph does not depend on any of the free variables being sampled with NUTS)

Yeah with a Step method it works as you can use a step method to “fix” these value by drawing from some static distribution.

My question above is more specific to have something like Prevent prior from updating? - #5 by mathiasp work again.

I think that’s the same principle. Assign the variable you don’t want to update to a separate step sampler that always draws from a given forward function.

I am not sure how to interpret it though. Such variable is not representing uncertainty but an irreducible stochastic process that influences the rest of your observations?

In any case, @Ray_Kruger this is a proof of concept you can take a look:
Say you have model at first level as:

mu0, sigma0 = 10., 1.5
y0 = np.random.normal(mu0, sigma0, size=1000)

with pm.Model() as m:
    mu = pm.Normal('mu', 0, 100)
    sigma = pm.HalfNormal('sigma', 10)
    y = pm.Normal('y', mu, sigma, observed=y0)
    trace = pm.sample(1000)

and another model like:

beta0 = 2.
sigma_new = 4.
X = np.linspace(-10, 10, 100)
y1 = np.random.normal(X * beta0, sigma_new)

with pm.Model() as m2:
    beta = pm.Normal('beta', 0, 100)
    sigma = pm.HalfNormal('sigma', 10)
    y = pm.Normal('y', X * beta, sigma, observed=y1)

To have sigma fixed in m2 using the posterior from m1, you can do:

from pymc.step_methods.arraystep import ArrayStepShared
from pymc.blocking import RaveledVars

class NormalProposal:
    def __init__(self, loc, scale):
        self.loc = loc
        self.scale = scale

    def __call__(self, rng=None, size=()):
        if rng is None:
            rng = np.random
        return rng.normal(self.loc, scale=self.scale, size=size)

class FixedDistSample(ArrayStepShared):
    """Return sample from a fixed proposal distribution.

    name = "fixed_dist_sample"

    generates_stats = False

    def __init__(self, vars, proposal_kwarg_dict, model=None):
        model = pm.modelcontext(model)
        initial_values = model.initial_point()

        vars = [model.rvs_to_values.get(var, var) for var in vars]
        vars = pm.inputvars(vars)
        initial_values_shape = [initial_values[].shape for v in vars]
        self.ndim = int(sum( for ivs in initial_values_shape))
        self.proposal_dist = NormalProposal(**proposal_kwarg_dict)

        shared = pm.make_shared_replacements(initial_values, vars, model)
        super().__init__(vars, shared)

    def astep(self, q0: RaveledVars) -> RaveledVars:
        point_map_info = q0.point_map_info
        q0 =
        q = self.proposal_dist(size=self.ndim)

        return RaveledVars(q, point_map_info)

from scipy.stats import t, norm
sigma_log__ = m.rvs_to_values[m.free_RVs[1]].tag.transform.forward(np.ravel(trace.posterior['sigma'].values))
sigma_log__ = aesara.function([], sigma_log__)()
# df, loc, scale =
loc, scale =

with m2:
    fixed_step = FixedDistSample([sigma], {'loc': loc, 'scale': scale})
    trace3 = pm.sample(1000, step=fixed_step)


You will see that sigma in trace3 follow the posterior from m instead of m2.

You can find the complete notebook here: link


Maybe this is a stupid question, but doesn’t differentiating though a stochastic process require you to pay attention to Ito terms? Would the symbolic derivatives computed by Aesara be correct in this case?

I understand it as a marginalizing process over some stochastic node, so I will be extremely hand-wavy: “should be asymptotically exact” :grimacing:

@junpenglao Thank you for your solution. It works as intended. If I understand correctly, you created a custom sampler (e.g like Metropolis-Hastings) from the step method class
but instead of taking in logp and returning samples from the logp density like normal samplers do it just samples from our specified distribution and assigns it to the specified viarable (e.g. “sigma”).“sigma” can then be used in for further hierarchical calculations (I’ve tested it on toy calculation). I think the remaining free variables are automatically assigned to the NUTS sampler, and it passes the samples sequentially (if there are two-step methods) between samplers at each step?

Yes, exactly! I think that this is what is happening the math side:

Yep that’s right.