Censored pymc for power-law distribution of column density

Hi,
I am new to pymc, I have been leanring the previous codes since past three days. I am trying to fit power law for x vs y. For values where y_err is -1 they denote the upperlimit in y and where y_err = -999 denotes the lower limit in y. I tried following exactly the same one code where they say censoring modeling issue is solved. Below is my code. The doubt I have is I am getting very narrow distribution of fitted parameters say A and B ( if I use y = A * np.power(x , B). I do not have true values on A and B to check. So I want make sure if my code is correct (it seems like not to me!)

import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
from pymc.distributions.dist_math import normal_lccdf, normal_lcdf
import arviz as az
from copy import copy
import xarray as xr
import aesara.tensor as at



# Define your data
x     = np.array([6.7, 14.9, 7.8, 11.3, 12.1, 20.3, 6.7, 23.4, 16.9, 28.8, 30.0, 29.8, 28.6, 8.0, 5.5, 12.6, 15.1, 29.0, 26.6, 15.2, 13.3, 22.6, 3.2, 9.7, 30.9, 33.9, 30.9, 32.5])
y     = np.array([14.392, 13.393, 14.493, 14.898, 13.187, 13.512, 14.396, 13.489, 12.652, 13.838, 13.506, 12.796, 12.794, 14.586, 14.204, 13.107, 13.702, 12.483, 13.125, 13.392, 14.314, 14.458, 14.944, 14.777, 11.76328143, 12.30545265, 12.02984059, 11.66659327])
y_err = np.array([0.175, 0.059, 0.011, 0.417, 0.01, 0.024, 0.021, 0.013, 0.044, 0.029, 0.083, 0.049, 0.115, 0.011, 0.023, 0.078, 0.028, 0.098, 0.026, 0.034, 0.038, -1, -1, -1, -999, -999, -999, -999])


### y_err = -1:  lower_limits on y
### y_err = -999 : upper limits on y
######### =============  Model 3 ============================
def censored_regression3(x, y, y_err):

    
    # data pre-processing
    right_censored = (y_err < -10)
    left_censored = (y_err > -10) & (y_err < -0) 

        
    x_lc = x[left_censored]
    y_lc = y[left_censored]

    x_rc = x[right_censored]
    y_rc = y[right_censored]
    
    uncensored = (y_err > 0) 
    x_un = x[uncensored]
    y_un = y[uncensored]
    y_err_un = y_err[uncensored]

    
    with pm.Model() as model3:

        A = pm.Normal('A', mu=0, sigma=10)
        B = pm.Normal('B', mu=0, sigma=10)
        σ = pm.HalfNormal("σ", sigma=1)

        
        y_likelihood = pm.Normal("y_likelihood", mu= A * at.power(x_un, B), sigma=np.abs(y_err_un), observed=y_un)
        
        left_censored = pm.Potential("left_censored", normal_lcdf( A * at.power(x_lc, B), σ, y_lc))
        right_censored = pm.Potential("right_censored", normal_lccdf( A * at.power(x_rc, B), σ, y_rc))

    with model3:
        trace3 = pm.sample()

    return model3, trace3
#############=========================================

censored_model3, censored_fit3 = censored_regression3(x, y, y_err)
#az.plot_posterior(censored_fit3) #, var_names=['A'], ref_val=A)

def pp_plot(x, y, fit, ax, clr):

    # plot data
    ax.plot(x, y, "k.")
    
    # plot posterior predicted... samples from posterior
    xi = xr.DataArray(np.array([np.min(x), np.max(x)]), dims=["obs_id"])
    post = az.extract(fit)
    
    y_ppc = post["A"] * at.power(xi, post["B"])
    y_ppc = y_ppc.transpose()
    
    ax.plot(xi, y_ppc, c=clr, alpha=0.8, rasterized=True)

    ax.legend()
    ax.set(xlabel="x", ylabel="y")

    
fig, ax = plt.subplots(1, 1, figsize=(10, 5), sharex=True, sharey=True)
pp_plot(x, y, censored_fit3, ax, 'steelblue')
plt.show()

There is a functionality in pymc that allows you to censor univariate distributions automatically

https://www.pymc.io/projects/docs/en/latest/api/distributions/censored.html

and a nice page documenting examples

It will likely amount to what you are doing but without the added worry of “Am I using the lcdfs and lccdfs correctly”. Also if you have a model, you can always use that model (say with numpy.random or pymc itself) to generate simulated data and check that you predictions are correct.

I tried testing this using simulation. But this method is failing for highly scatter dataset. Also if the number of input data are less then also it is giving a very bad fit. Any comment?

I don’t quite understand why you applied censoring as above. Conventionally you have two limits say lower = 0 and upper =10, any data that is below lower is recorded as 0 (left censoring) and any data higher than 10 is recorded as 10 (right censoring). In your case, y (your data to fit) and y is given as

yerr = [0.175, 0.059, 0.011, 0.417, 0.01, 0.024, 0.021, 0.013, 0.044, 0.029, 0.083, 0.049, 0.115, 0.011, 0.023, 0.078, 0.028, 0.098, 0.026, 0.034, 0.038, -1, -1, -1, -999, -999, -999, -999]

y = [14.392     , 13.393     , 14.493     , 14.898     , 13.187     ,
       13.512     , 14.396     , 13.489     , 12.652     , 13.838     ,
       13.506     , 12.796     , 12.794     , 14.586     , 14.204     ,
       13.107     , 13.702     , 12.483     , 13.125     , 13.392     ,
       14.314     , 14.458     , 14.944     , 14.777     , 11.76328143,
       12.30545265, 12.02984059, 11.66659327]

and your are censoring via yerr with the following:

right_censored = (y_err < -10)
left_censored = (y_err > -10) & (y_err < -0) 

Are you really trying to censor or just want to model data with unexpected error separately?
If the latter, I don’t think you should be using censoring kind of approaches with CDFs but possibly three different likelihoods all with a powerlaw and maybe different parameters?

Or is there a reason why you went with a censoring kind of approach here that I am not understanding? In any case if you are confident that censoring is what you need, one thing I would not do is use separate sigmas like in

y_likelihood = pm.Normal("y_likelihood", mu= A * at.power(x_un, B), sigma=np.abs(y_err_un), observed=y_un)
left_censored = pm.Potential("left_censored", normal_lcdf( A * at.power(x_lc, B), σ, y_lc))
right_censored = pm.Potential("right_censored", normal_lccdf( A * at.power(x_rc, B), σ, y_rc))

Because the whole point of censoring is to assume censored and uncensored data comes from the same distribution but you can not pinpoint the exact location of the censored data so you use CDFs instead of PDFs. So if you want to model sigma, I would go with

y_likelihood = pm.Normal("y_likelihood", mu= A * at.power(x_un, B), sigma=σ, observed=y_un)

From this change I get something that looks like

where as your code gives for me


which I assume is coming from the fact that your censored and uncensored data are coming from different distributions.

Again this depends on your expectations but normally uncensored and censored data are assumed to come from the same distribution and usually left censored and right censored data are constant low and high values. So to change these assumptions would mean that you are moving away from the territory of censored data modelling so you would really need to make sure your model is reasonable. I suggest you read the link I have posted above which clearly shows you the assumptions of censored data modelling.

ps:
I would also urge you to update pymc (as you are using aesora which suggests your pymc version might be old) and use pm.Censored to minimize the risk of making a mistake when setting this up. I have not checked to see if you have used the CDFs correctly, though it seems like it (when I want to manually do it I use CDF(x) and 1-CDF(x) which is lest confusing)

2 Likes

Thank you so much for your detailed reply. I do want to use censoring. I modified the y_err for case of upper and lower limit by myself so that I can flag the censored and uncensored data. For high-y I only know the lower limit on ys and for low-y I only know the upper limit so I modified their errors so as to flag them. Still with your modification and as also can be seen from your plot, the fit is not very well taking into account the last four lower limits on y’s know? Do you think it should be like this only, if the data points beyond some x values are having lower y values then I would expect fit to be steeper know? And this is indeed what I am getting when my datapoints in simulation are less scattered.

Need to look at the fit in more detail but not near my computer now. Till then, I also suggest again trying pm.Censored. In your case since censored values are changing you can supply lower and upper as a list with length equal to number of points where the value of lower is -inf if data is not left
censored and equal to y if left censored and upper would be symmetrical (using inf for uncensored, left censored and the value itself for right censored)

With this it is easy to do tests like remove censoring by modifying lower and upper to constant values -inf and inf see if censoring is creating the effect it should. You also remove the risk of setting censoring model wrongly.

https://www.pymc.io/projects/docs/en/latest/api/distributions/censored.html

I also suggest doing a version of the plot above with a single curve and error bars using mean and HDIs from summary to double check that mean parameters are also create biased fits.

3 Likes

I am not sure if I understood your suggestion correctly, but here is how I modified my code. But the fit still not looks good. Also, what if I want to include the errors also in my y values then how to do that? I understand the errors on the lower and upper censored would be well defined in that case I will not assign -10 or -999 values to them.

import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
from pymc.distributions.dist_math import normal_lccdf, normal_lcdf
import arviz as az
from copy import copy
import xarray as xr
from scipy.optimize import curve_fit
from scipy.optimize import minimize
import scipy.stats as stats

Define your data

x = np.array([6.7, 14.9, 7.8, 11.3, 12.1, 20.3, 6.7, 23.4, 16.9, 28.8, 30.0, 29.8, 28.6, 8.0, 5.5, 12.6, 15.1, 29.0, 26.6, 15.2, 13.3, 22.6, 3.2, 9.7, 30.9, 33.9, 30.9, 32.5])
y = np.array([14.392, 13.393, 14.493, 14.898, 13.187, 13.512, 14.396, 13.489, 12.652, 13.838, 13.506, 12.796, 12.794, 14.586, 14.204, 13.107, 13.702, 12.483, 13.125, 13.392, 14.314, 14.458, 14.944, 14.777, 11.76328143, 12.30545265, 12.02984059, 11.66659327])
y_err = np.array([0.175, 0.059, 0.011, 0.417, 0.01, 0.024, 0.021, 0.013, 0.044, 0.029, 0.083, 0.049, 0.115, 0.011, 0.023, 0.078, 0.028, 0.098, 0.026, 0.034, 0.038, -1, -1, -1, -999, -999, -999, -999])
#######

id_low = (y_err < -10)
id_up = (y_err > -10) & (y_err < -0)
id_uncensored = (y_err > 0)

defining lower and upper arrays =========

lower = np.zeros(len(x)) - np.inf
lower[id_low] = y[id_low]

upper = np.zeros(len(x)) + np.inf
upper[id_up] = y[id_up]

def censored_regression_new(x, y, lower, upper):

with pm.Model() as model1:
    
    A = pm.Normal("A", mu=0, sigma=1)
    B = pm.Normal("B", mu=0, sigma=1)
    #σ = pm.HalfNormal("σ", sigma=1)

    normal_dist = pm.Normal.dist(mu= A +  x * B)   ###, sigma=σ)
    censored_normal = pm.Censored("censored_normal", normal_dist, lower=lower, upper=upper, observed=y)


with model1:
    trace1 = pm.sample()

return trace1

####==============
censored_fit_new = censored_regression_new(x, y, lower, upper)
az.plot_posterior(censored_fit_new)
####============

def pp_plot(x, y, fit, ax, clr):

##### plot data
ax.scatter(x, y, color="b", marker='*', s=100)




xi = np.arange(0.0,40,0.1)  

post = az.extract(fit)

A_alls = post["A"].values
B_alls = post["B"].values

for ii in range(len(A_alls)):
    y_ppc = A_alls[ii] + xi * B_alls[ii]
    ax.plot(xi, y_ppc, c=clr, alpha=0.05, rasterized=True)

ax.legend()
ax.set(xlabel="x", ylabel="y")

fig, ax = plt.subplots(1, 1, figsize=(10, 5), sharex=True, sharey=True)

pp_plot(x, y, censored_fit_new , ax, ‘gray’)
plt.show()

@Sapna_Mishra you can use triple backticks ``` to format your code correctly

Is it possible you are confusing left censored values with right censored ones? Based on your yerr decision rule here is the x and y values for left and right censored

y_rc= array([11.76328143, 12.30545265, 12.02984059, 11.66659327])

x_rc=array([30.9, 33.9, 30.9, 32.5])

x_lc=array([22.6,  3.2,  9.7])

y_lc=array([14.458, 14.944, 14.777])

Your left censored values are larger than your right censored values. It would be very unconventional to have such a case and is the most likely reason why your lines are not as steep as you want. Indeed if you think about it for a minute you will see that left censoring “roughly” tell your model that the fitted values can be something equal to or lower than observed and right censoring the opposite, so that will give your lines flexibility to be less steep since your left censored values occur for smaller values of x_c and right censored for larger values of x_c. In your original code when I change

left_censored = (y_err < -10)
right_censored = (y_err > -10) & (y_err < -0)

and sigma as I mentioned above I get


as opposed to of previous:


which looks more steep than before (also evident from A and B posteriors), although it also looks like it can improve. You will have noted that I have colored the censored values, which makes identifying problems easier. One of your right censored values looks like an outlier and could be the source of the bias on your fits. It could also be that censoring is not set up correctly (as would be in pm.Censored, see below).

Also, what if I want to include the errors also in my y values then how to do that?

Depends on what these errors are really measuring. I noticed that your errors are always positive (apart from your manually changed censored errors), which suggests that it is not measuring the error around your fitted curve. For uncensored observations your model assumes errors are normally distributed around your fitted curve which would mean they should generally have both positive and negative values (or you should be EXTREMELY unlucky to get all positive errors with this many data points, something like one in a million experiments unlucky).

Regarding the pm.Censored, try redoing it with fixed left and right censoring, if it does not give something identical to your own model, then that means there is something to debug.

3 Likes

Thanks a lot… I was trying to test these using various methods. My original code with your modifications gives sort of correct values which I tested by using simple tobit-reg function and doing loglike minimization in scipy. Their fitted values for A and B matches within error bars. However, both these methods fail if the number of lower-limits data points are more than the number of uncensored data-points. Any suggestion for this situation?

Coming to using pymc.censored (the code I attached in my previous reply), it is not working even for fixing the lower and upper censored values. I do not understand what is wrong.

The code for pymc censored is as follows:

def censored_regression(x, y, y_err):

    # data pre-processing
    left_censored = (y_err < -10)
    right_censored = (y_err > -10) & (y_err < -0)

    lower = [y[ind] if b else -np.inf for ind,b in enumerate(left_censored)]
    upper = [y[ind] if b else np.inf for ind,b in enumerate(right_censored)]

    with pm.Model() as model:

        A = pm.Normal('A', mu=0, sigma=10)
        B = pm.Normal('B', mu=0, sigma=10)
        σ = pm.HalfNormal("σ", sigma=1)

        normal_dist = pm.Normal.dist(mu=A * at.power(x, B), sigma=σ)
        pm.Censored("likelihood", normal_dist, lower=lower,
                    upper=upper, observed=y)

    with model:
        trace = pm.sample()

    return model, trace

The results are almost identical to the code above.

1 Like