I have data where the dependence of y on x is similar to a stick breaking process (as x is increased, y is lost. This occurs discretely, but the number of actual ‘breaks’ is very large (>1e6).)
The relationship might look something like this:
However, the data measured are not like the y in the image. They come as vector data, [y_0,y_1,y_2]. The reason for this is that the different components of this vector are highly correlated and have a piecewise linear covariance. This is hard to explain, so see the image below for illustration (Each of the ‘pieces’ of the segments is in a different color).
What is desired here is the vector difference sum of these datapoints (and the x values of the changepoints), but noise in the data will cause this to have error propagation.
I think the correct way to do this would be something like a multivariate spline regression [y_1,y_2,y_3]=f(x)+[\sigma_1,\sigma_2,\sigma_3] with the covariance built in. I have read some tutorials on spline regression with PyMC3, but I’m not sure how to build this kind of piecewise covariance into a model. Additionally, it would be nice to extend it to the case that the number of segments is unconstrained. Any suggestions/implementations of something similar in PyMC?
It’s hard to answer this question because the dimensionality and geometry of your setting is unclear. Does each \vec y lie on a simplex? Is the dimension of that simplex fixed across all observed points or can it vary? Are the differences ordered (y_i-y_{i-1} \geq y_{i+1}-y_i)?
The measured data are vector data in 3D space.
At some maximum value of x=x_{max}\ , our data have 0 magnitude: |\vec{y}|=0
We can define our data using spherical coordinates so that the components of our vector covary linearly up until the changepoint i.e. if x_{c}<x<x_{max} then: y_0 = f(x)sin(\theta_1)cos(\phi_1) y_1 = f(x)sin(\theta_1)sin(\phi_1) y_2 = f(x)cos(\theta_1)
Where for this segment, f(x)=|\vec{y}| and \theta_1, \phi_1 represent the direction of our vector.
At our changepoint x=x_c, we define \vec{y}= \vec{y_c}. Here our direction changes so that it is described by \theta_2, \phi_2.
Then when x_d<x<x_c y_0 = (f(x)-f(x_c))sin(\theta_2)cos(\phi_2)+ y_{c_0}
etc.
The idea is to find f, or at least the discrete values of f at the data points. f could be approximated using the general form 1- BetaCDF(\alpha, \beta) if it needs to be parameterized. In the example I gave, this would be quite simple to do by drawing a straight line through each of the segments and finding the distance along the set of line segments of each data point.
If the formula carries for the other components it looks like the magnitude is discontinuous?
I think your intuition is right to want to use a spline model (parameterized on the logit/probit scale if 0 \leq f \leq 1). And when you say “covariance”, do you mean that the error terms covary
y | x = f(x, \theta, \phi) + \epsilon
with \mathrm{cov}[\epsilon_i, \epsilon_j] \neq 0
Or simply that even though the errors are independent \mathrm{cov}[\epsilon_i, \epsilon_j] =0 the marginal covariance is not \mathrm{cov}[y_i, y_j] \neq 0?
My mistake, I meant to write y_0 = (f(x)-f(x_c))sin(\theta_2)cos(\phi_2)+ y_{c_0}
At x=x_c this gives y_0=y_{c_0} as desired.
The errors should be independent in this case. I’m measuring a vector field and estimating a spherical uncertainty from measurements in multiple orientations.