Hi everyone! I hope this question is not out of place.
I have a problem where I have one variable (Y), which consists of N biological measurements per year over four years, and three other variables (X1, X2, X3) consisting of M satellite environmental data (M>>N) over the same four years. I’d like to see whether the yearly changes in X1, X2, and X3 could explain the corresponding changes in Y. I’ve been playing with gaussian processes, but I can’t seem to wrap my head around how due to the different sample sizes between Y, X1, X2, and X3.
I’ve successfully applied a model similar to the “islands model” presented in Bayesian Analysis with Python by @aloctavodia, where the distance matrix of the annual means in X1 models the annual means in Y. Still, I feel like I’m leaving a lot of information on the table by just using mean values.
Any suggestions or pointers are welcome .
Am I correct that this is a missing data problem? That is, X1, X2, and X3 are collected at higher frequency than Y (perhaps Y is not collected at regular intervals at all?)?
If so, you should be able to reduce everything to the lowest time frequency (paying attention to adjust for stock and flow variables), then use your model to interpolate the missing data.
Hi @jessegrabowski! Thank you very much for your response.
I missed quite some context, and, unfortunately (for me), it is not a missing data problem. Y is stable isotope data from biological tissue. In short, stable isotopes give information regarding what an animal ate (and where) three months (in my case) before sampling. For example, let’s say that I sampled 20 animals in one year during May. I’d get information about the animal’s behavior from March through May.
Moreover, my animals are pretty mobile, so getting environmental information from where they were sampled would not give enough information in space or time. For this reason, Xs are data from regional oceanographic variables extracted from three-month composite satellite images. That is to say, the only dimension that X and Y share is time.
Interesting! So is it a mix of (time, animal) longitudinal observations for the isotope data (the same animals over time), and time-only environmental factors shared by all the animals (the Xs)?
1 Like
Hi again, @jessegrabowski! I appreciate your interest, and sorry for the late response.
Yes, it is a mix of (time, delta1, delta2) longitudinal observations, where delta1 and 2 are isotope values. The animals sampled are random individuals from two co-occurring species. Also, yes, the environmental factors are from the corresponding time windows and correspond to the area where the animals were sampled.
If I were to put it in some sort of mathematical notation, it would be something like this:
[delta1, delta2]_(sp, y) ~ X1_y + X2_y + X3_y
Where sp is the species, and y is the year.
If I understand the problem well, you have a nice linear model and will just need to juggle some indices. I made a small gist with fake data here, giving an example of how you might do this. Basically you want varying effects in the response to the environmental changes, even though those changes are shared by all the individuals in the sample. The key line is here:
python
species_effects = pm.Normal('species_effect', dims=['isotope', 'species'])
time_effects = pm.Normal('time_effect', dims=['isotope', 'time'])
env_beta = pm.Normal('env_betas', dims=['isotope', 'species', 'env_factors'])
mu_long = (species_effects[i_idx, j_idx] +
time_effects[i_idx, t_idx] +
(env_beta[i_idx, j_idx, :] * X_data[t_idx, :]).sum(axis=-1)) #einsum by hand
- Give all your parameters the dimensions you wrote down in your mathematical notation (i deviated somewhat from yours, allowing the response to vary by both isotope and species)
- Make “long” indices using e.g.
pd.factorize
to map the compact (I, J) and (I, T) parameter arrays to “long” (I * J * T, )
arrays (forces everything to line up)
a. Note the use of “fancy-indexing” by passing two arrays – PyMC indexing matches Numpy
- For matrix multiplication, stretch out all indices except one, then multiply and sum away that last index.
In my example, I allowed the response to environmental change to vary by isotope-species, but you could make that anything you like (e.g., isotope-species-individual) by adding indexes.
To be honest I’m a bit confused on how to implement GPs when there are multiple dimensions of variation (hierarchical GP?), so you would need to tag in an expert on those if that’s the way you want to go. For example, I’ve read this very old thread that uses block covariance matrices to stitch together many individuals in a dataset with (individual, time), but it seems like there has to be a better way. @bwengals ?
Hi again, @jessegrabowski!
Thank you VERY much for your suggestions and gist; they go nicely with my goal. I originally proposed a GP to include the nonlinear relationship between variables and time, but your proposed model makes a lot of sense.
I made a mistake in explaining the problem, though, saying it is a longitudinal design when it is not (my bad). To clarify, for every species-isotope-year, there are n randomly sampled individuals (n varies per species/year, and the individuals most likely are not the same in any case). On the other hand, the environmental data are thousands of satellite measurements (basically maps of each variable).
Anyway, I could extract the annual mean values for every variable and then work with your gist as is; however, I think I’m missing your point?
P.S. I’m marking your answer as a solution.
Well in that case you still have three dimensions – isotope-species-year – you just have to think through how to connect the (noisy, heterogeneous) samples of individuals to the species-level features you are trying to infer from the measurements.
At the very least, I would contemplate estimating the annual mean (plus std or whatever else, depending on how the variable is distributed) for each species-year inside your model, then use those estimated means in the down-stream isotope model. That would let you propagate uncertainty about the true values, and also let you let you do some partial pooling between species if appropriate. For example, if you’re measuring size of closely related species, for example, you could assume that they should all be clustered around some grand mean with deviation, allowing you to infer something about the species with less samples. You could also include some time dynamics here if you think it’s appropriate, by putting a Gaussian random walk prior over the means.
All this is just spitballing, because I don’t have any domain knowledge about your problem. Sorry!
For the environmental images, I know next to nothing about this. I guess there should be some feature extraction techniques used in your domain that could allow you to reduce them to a more practical form? You can’t really use thousands of pixel arrays in a linear model.
As for using the gist, I would make sure to incorporate some hierarchy into the parameters, and think about if modeling the isotopes independently (like in the gist) makes sense or not. Also, use pd.factorize
to painlessly get the index maps (the things I called i_idx
, j_idx
, t_idx
) off your data instead of what I did. Factorize also gives back labels you can use in your coords
that are more beautiful than just integers.
Hi!
You helped me plenty already, so don’t worry about the other details; I was making sure we were on the same page and giving a bit of extra context for whoever happens to read this thread :).
As for my particular model, we were on the same page indeed, as right now, I’m trying to estimate the annual means for each species-year and use those in the linear model. I’m also placing hyperpriors to the species’ effects because they share some resources. For the environmental images, I’m working just with a subsample of 100 points per year, but I’ll have to find an optimal way to maximize the information.
Again, thank you very much for your suggestions! I appreciate all of the work you did.
1 Like