# Reducing dimensionality of data that covary in piecewise way

I have data where the dependence of y on x is similar to a stick breaking process (as x is increased, y is lost. This occurs discretely, but the number of actual ‘breaks’ is very large (>1e6).)
The relationship might look something like this:

However, the data measured are not like the y in the image. They come as vector data, [y_0,y_1,y_2]. The reason for this is that the different components of this vector are highly correlated and have a piecewise linear covariance. This is hard to explain, so see the image below for illustration (Each of the ‘pieces’ of the segments is in a different color).

What is desired here is the vector difference sum of these datapoints (and the x values of the changepoints), but noise in the data will cause this to have error propagation.

I think the correct way to do this would be something like a multivariate spline regression [y_1,y_2,y_3]=f(x)+[\sigma_1,\sigma_2,\sigma_3] with the covariance built in. I have read some tutorials on spline regression with PyMC3, but I’m not sure how to build this kind of piecewise covariance into a model. Additionally, it would be nice to extend it to the case that the number of segments is unconstrained. Any suggestions/implementations of something similar in PyMC?

It’s hard to answer this question because the dimensionality and geometry of your setting is unclear. Does each \vec y lie on a simplex? Is the dimension of that simplex fixed across all observed points or can it vary? Are the differences ordered (y_i-y_{i-1} \geq y_{i+1}-y_i)?

The measured data are vector data in 3D space.
At some maximum value of x=x_{max}\ , our data have 0 magnitude: |\vec{y}|=0

We can define our data using spherical coordinates so that the components of our vector covary linearly up until the changepoint i.e. if x_{c}<x<x_{max} then:
y_0 = f(x)sin(\theta_1)cos(\phi_1)
y_1 = f(x)sin(\theta_1)sin(\phi_1)
y_2 = f(x)cos(\theta_1)
Where for this segment, f(x)=|\vec{y}| and \theta_1, \phi_1 represent the direction of our vector.

At our changepoint x=x_c, we define \vec{y}= \vec{y_c}. Here our direction changes so that it is described by \theta_2, \phi_2.

Then when x_d<x<x_c
y_0 = (f(x)-f(x_c))sin(\theta_2)cos(\phi_2)+ y_{c_0}
etc.

The idea is to find f, or at least the discrete values of f at the data points. f could be approximated using the general form 1- BetaCDF(\alpha, \beta) if it needs to be parameterized. In the example I gave, this would be quite simple to do by drawing a straight line through each of the segments and finding the distance along the set of line segments of each data point.

I don’t see the pattern for the (etc). If x=x_c then from (1) we have

y_0 = f(x_c)\sin(\theta_1)\cos(\phi_1)

but from (2) we have

y_0 = f(x_c)[1 + \sin(\theta_1)\cos(\phi_1) - \sin(\theta_2)\cos(\phi_2)]

If the formula carries for the other components it looks like the magnitude is discontinuous?

I think your intuition is right to want to use a spline model (parameterized on the logit/probit scale if 0 \leq f \leq 1). And when you say “covariance”, do you mean that the error terms covary

y | x = f(x, \theta, \phi) + \epsilon

with \mathrm{cov}[\epsilon_i, \epsilon_j] \neq 0

Or simply that even though the errors are independent \mathrm{cov}[\epsilon_i, \epsilon_j] =0 the marginal covariance is not \mathrm{cov}[y_i, y_j] \neq 0?

My mistake, I meant to write
y_0 = (f(x)-f(x_c))sin(\theta_2)cos(\phi_2)+ y_{c_0}
At x=x_c this gives y_0=y_{c_0} as desired.

The errors should be independent in this case. I’m measuring a vector field and estimating a spherical uncertainty from measurements in multiple orientations.