# Robust circle fitting to to point cloud with outliers with large dataset

I have been defining a model that needs to fit a circle to a point cloud within the xy-plane. The parameters required are:

• Center point coordinate x
• Center point coordinate y

The data itself is rather large from approx 125k to 1M points. The data contains outliers similar to this synthetic dataset below. Also some domain knowledge is related to outliers, as most of the outliers appear on the outer side of the nominal circle.

The uncertainity quantification of the parameters is of the essence in this case so I need a robust way of tackling and identifying outliers. Using OLS will lead to shift in center point in the direction of the outliers and larger estimated radius. Using Robust regression with StudentT will fix most issues with radius, bus the centerpoint is still slighly shifted towards the overbreak. I have also tested RANSAC algorithm for outlier rejection and fitting and that works really well, but does not provide uncertainity estimates for the centerpoint, which are important in this use case.

I tried adapting the code from GLM: Robust Regression using Custom Likelihood for Outlier Classification — PyMC3 3.11.4 documentation like this:

``````y = np.zeros(df.shape)
shp = df.shape
mean_x = df["0"].mean()
mean_y = df["1"].mean()
with pm.Model() as outlier_model:
# Point cloud data x,y:
observed_point = pm.Data("observed_point", df[["0","1"]].values)

# Center point random variable:
center_point = pm.Normal("center_point", mu =[mean_x.round(2),mean_y.round(2)], sigma=0.1, shape=2)

r = pm.TruncatedNormal("r", mu=1.75/2, sigma=0.05, lower=1E-6)

# Likelihood variable data points:
y_obs = pm.Data("y_obs", y)

# # Circle function
circle_function_in = pm.Deterministic("circle_function_in",
pm.math.sqr(observed_point[:,0]-center_point)
+pm.math.sqr(observed_point[:,1]-center_point)-r**2)

circle_function_out = pm.Normal("y_est_out", mu=0, sigma=5, testval=pm.floatX(0.0))  # (1, )

sigma_y_out = pm.HalfNormal("sigma_y_out", sigma=1, testval=pm.floatX(1.0))  # (1, )

inlier_logp = pm.Normal.dist(mu=circle_function_in, sigma=1).logp(y_obs)

outlier_logp = pm.Normal.dist(mu=circle_function_out , sigma=1 + sigma_y_out).logp(y_obs)

# frac_outliers only needs to span [0, .5]
# testval for is_outlier initialised in order to create class asymmetry
frac_outliers = pm.Uniform("frac_outliers", lower=0.0, upper=0.5)
is_outlier = pm.Bernoulli(
"is_outlier",
p=frac_outliers,
shape=shp,
testval=(np.random.rand(shp) < 0.4) * 1,
)  # (n, )

# non-sampled Potential evaluates the Normal.dist.logp's
potential = pm.Potential(
"obs",
((1 - is_outlier) * inlier_logp).sum() + (is_outlier * outlier_logp).sum(),
)
``````

However, the MCMC sampling with pymc3 gives either a memory error or is extremely slow with such a large dataset. Any suggestions on how to improve the model or inference method?

I attempted to solve your problem using a mixture model. The basic idea is that we assume that we have two circles with the same center but different radii. Our model assumes that each point belongs to one of the two circles, and estimates both the center and the radii.

The main problem is that for some chains, the model discovers a spurious solution because without angular coordinates included, putting a circle at the overall center-of-mass of the whole dataset gives a good fit too.

Try this and see if it works on your data. You may need to subset your original datasets if they’re too large.

``````import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
import theano.tensor as tt
import seaborn as sns

# Create simulated data for points on circles of two different radii

n_good, n_outlier = 80, 20
n_points = n_outlier + n_good

true_x, true_y = 0.2, 0.4

angles_good    = np.random.uniform(0.2*np.pi, 2*np.pi, size=n_good)
angles_outlier = np.random.uniform(0, 0.2*np.pi, size=n_outlier)
angles         = np.concatenate([angles_good, angles_outlier])

points = np.stack([np.cos(angles), np.sin(angles)], axis=1)
points[:,0] += true_x
points[:,1] += true_y

# Add small amount of Gaussian noise to the observations
points += np.random.randn(*points.shape) * 0.01

# Our model is ignorant of the angular coordinates
# and only cares about how far a point is from the estimated
# center
with pm.Model() as model:

# Place a prior that says the center should be relatively close to (0,0)
center    = pm.Normal('center', sigma=1, shape=(1,2))
distances = pm.Deterministic('distances',
pm.math.sum((points - center)**2, axis=1)**0.5)

# Place a prior on the relative proportion of outlier vs non-outlier
p               = pm.Dirichlet("p", [90,10])
sigma           = pm.HalfNormal('sigma', sd=0.05, shape=2)

# Place prior over the distance of outlier and non-outlier points from center

observed=distances)

trace = pm.sample(tune=3000, target_accept=0.95)

plt.figure()
plt.scatter(trace['center'][:,0,0],
trace['center'][:,0,1],
label='Posterior samples of center', alpha=0.5)
plt.scatter(true_x, true_y, label='True centerpoint', marker='x', s=500)

plt.scatter(points[:,0], points[:,1],label='Data points')
plt.legend(loc='lower right')
``````

Here’s an example of a bad solution - many of the samples are in the right spot, but several are close to the center-of-mass. 1 Like