Prior from a 2D histogram

Hello,
I am using PyMC3 to solve a problem, and I need help.
In short I have three variables x, y, and z, and I am just fitting some lines in the form y=mu+alphax+betaz, which in PyMC3 I write as:

    alpha = pm.Normal('alpha', mu=0, sd=4,shape=1)
    beta = pm.Normal('beta', mu=0, sd=4,shape=1)
    mu = pm.Normal('mu', mu=0, sd=1,shape=1)
    sigma = pm.HalfNormal('sigma', sd=1,shape=1)

    **xtzt = pm.MvNormal('xtzt',mu=mu_Lz, cov=cov_Lz, shape=(N,2))**
    xt=xtzt[:,0]
    zt=xtzt[:,1]
    yt = pm.Normal('yt',mu=mu+alpha*xt+beta*zt,sd=sigma,shape=N)


    like_x = pm.Normal('like_x',mu=xt,sigma=sigma_x_std,shape=N,observed=x_std)
    like_y = pm.Normal('like_y',mu=yt,sigma=sigma_y_std,shape=N, observed=y_std)
    like_z = pm.Normal('like_z',mu=zt,sigma=sigma_z_std,shape=N, observed=z_std)

As you can see, x and z are drawn here from a Multivariate Normal distribution, and y are drawn from Normal distribution around the mentioned line, with some scatter given by sigma. This works and gives some reasonable result, however this is not exactly what I want.

In fact, the prior on x-z distribution should not be a Multivariate Normal, but a distribution of which I do not know the formula, but I know the 2D distribution in the x-z plane, i.e. at given (x,z) on a grid I have the 2D-PDF (the grid is quite fine with 1000x1000 bins).

Now I know the two requirement by PyMC3 to use your own distribution as a new class is to have two functions that give logp and random, e.g.:

def logp(x,z):
ix=np.round((np.log(x)-np.log(x_range[0]))/x_step).astype(np.int32)
iz=np.round((z-z_range[0])/z_step).astype(np.int32)
return np.log(prob[ix,iz])

def random(size=1):
cdf = np.cumsum(prob.ravel())
cdf = cdf / cdf[-1]
values = np.random.rand(size)
value_bins = np.searchsorted(cdf, values)
x_idx, z_idx = np.unravel_index(value_bins,(len(x),len(z)))
random_from_cdf = np.column_stack(( x[x_idx], z[z_idx]))
return random_from_cdf

As you can see my grid is x-z, with x_step and z_step the increment on x and z. I know that I am not returning something very accurate, but given the fact that the grid is fine, is a good enough approximation for logp.

For the random instead I flatten all the distribution and get the cumulative and then use an approach similar to what is usually done in 1D, i.e. invert the CDF.

However I am not able to write this in a format compatible with a PyMC3 class. Therefore I was wondering if someone ever encountered a similar case, and if you have already have some suggestion on this.

TLDR; Does someone know a way to give to PyMC3 not a known MvPDF, but a 2D PDF computed on a grid. (x and z are continuous even if the PDF is on a grid)

Just in case it saves you some trouble, you only need the random method if you want to do prior or posterior predictive sampling. If all you need is the posterior trace (what you get with pm.sample()), it’s fine to only define the logp method.

Thanks, I also think I only need the logp method. In fact I was playing around with DensityDist and Potential, however the function as they are keep giving me errors when loading it as a model.

The real issue is, as far as I’ve tried with DensityDist for instance, that the logp function as I’ve wrote assumes input numpy array, while this is not what PyMC3 wants, and in fact gives error…

1 Like

I guess this is related to the fact that we cannot compute gradient for arbitrary PDFs?

In case someone find this helpful, for the 1D case (in v5), you can use pymc.Interpolated, but I haven’t been able to use the 2D PDFs.

Maybe there’s a way to implement the equivalent multivariate version from scipy.interpolate and implement the required logp, rand etc.

This is another option: prior_from_idata — pymc_experimental 0.0.5 documentation

1 Like