Help with understanding how sum_stat="sort" works and how it is applied in "smc.py" source code

Hi,
I’ve been fiddling with the source code “smc.py” recently and discovered something which bugs me. When a summary statistic called “sum_stat=sort” is used, then np.sort is applied to observations and sim_data as follows:

self.observations = self.sum_stat(observations)
(...)
elemwise = self.distance(self.epsilon, self.observations, self.sum_stat(sim_data))

Shouldn’t sorting be done in a way which sorts both of the arrays based on one of them (on observations)? In many cases this does not make a difference, but if for example for observations = [3.04,2.99,1.5] we have corresponding sim_data = [3.02, 3.05, 1.56], then the distance function will not calculate distance between correct pairs of values.
I will greatly appreciate if someone explains if this sorting procedure is intended to work the way it does or if it is an error in the source code.

@ricardoV94 @aloctavodia ?

1 Like

Hi @mjed this was intentional and I think this should be the correct way to approximate the Wasserstein distance for empirical distributions. But I can check again or find a good reference if you are interested on the details.

3 Likes