I have been defining a model that needs to fit a circle to a point cloud within the xy-plane. The parameters required are:
- Center point coordinate x
- Center point coordinate y
- Circle radius
The data itself is rather large from approx 125k to 1M points. The data contains outliers similar to this synthetic dataset below. Also some domain knowledge is related to outliers, as most of the outliers appear on the outer side of the nominal circle.
The uncertainity quantification of the parameters is of the essence in this case so I need a robust way of tackling and identifying outliers. Using OLS will lead to shift in center point in the direction of the outliers and larger estimated radius. Using Robust regression with StudentT will fix most issues with radius, bus the centerpoint is still slighly shifted towards the overbreak. I have also tested RANSAC algorithm for outlier rejection and fitting and that works really well, but does not provide uncertainity estimates for the centerpoint, which are important in this use case.
I tried adapting the code from GLM: Robust Regression using Custom Likelihood for Outlier Classification — PyMC3 3.11.4 documentation like this:
y = np.zeros(df.shape[0])
shp = df.shape[0]
mean_x = df["0"].mean()
mean_y = df["1"].mean()
with pm.Model() as outlier_model:
# Point cloud data x,y:
observed_point = pm.Data("observed_point", df[["0","1"]].values)
# Center point random variable:
center_point = pm.Normal("center_point", mu =[mean_x.round(2),mean_y.round(2)], sigma=0.1, shape=2)
#Radius (r) random variable:
r = pm.TruncatedNormal("r", mu=1.75/2, sigma=0.05, lower=1E-6)
# Likelihood variable data points:
y_obs = pm.Data("y_obs", y)
# # Circle function
circle_function_in = pm.Deterministic("circle_function_in",
pm.math.sqr(observed_point[:,0]-center_point[0])
+pm.math.sqr(observed_point[:,1]-center_point[1])-r**2)
circle_function_out = pm.Normal("y_est_out", mu=0, sigma=5, testval=pm.floatX(0.0)) # (1, )
sigma_y_out = pm.HalfNormal("sigma_y_out", sigma=1, testval=pm.floatX(1.0)) # (1, )
inlier_logp = pm.Normal.dist(mu=circle_function_in, sigma=1).logp(y_obs)
outlier_logp = pm.Normal.dist(mu=circle_function_out , sigma=1 + sigma_y_out).logp(y_obs)
# frac_outliers only needs to span [0, .5]
# testval for is_outlier initialised in order to create class asymmetry
frac_outliers = pm.Uniform("frac_outliers", lower=0.0, upper=0.5)
is_outlier = pm.Bernoulli(
"is_outlier",
p=frac_outliers,
shape=shp,
testval=(np.random.rand(shp) < 0.4) * 1,
) # (n, )
# non-sampled Potential evaluates the Normal.dist.logp's
potential = pm.Potential(
"obs",
((1 - is_outlier) * inlier_logp).sum() + (is_outlier * outlier_logp).sum(),
)
However, the MCMC sampling with pymc3 gives either a memory error or is extremely slow with such a large dataset. Any suggestions on how to improve the model or inference method?