How best to build a model on 200k normally distributed observations without a simple vector relation (rather, a piecewise vector relation, i.e. subsets of data depend on a combination of parameters)

I’m not sure how to word this question in order to find examples that would help me (I suspect this has been done and looked at before).

I have the following problem: I have a time series of lets say 200k points, over N days, which I will index by j. I want to split a day into M bins, which I will index by i. Then the data point on day j and bin i is x_{ij}).

I want to model the data as being drawn from a normal distribution with 0 mean and a variance u_i v_j, i.e.

x_{ij} \sim N(0, u_i v_j^2)

where each day has an independent parameter v_j and each time bin has another parameter u_i (these will be Dirichlet distributed as \sum u_i = 1.

The issue is this: I suspect that if I create 200,000 distributions for each data point, I will overload my computer and this won’t be efficient. I’m trying to figure out a good way to define these relations in as simple a way as possible (i.e. in the most efficient way possible).

The simpler solution would be to consider the case where there is only one day point per bin per day. I would already be happy with that, but in reality I actually have a varying number of data points per bin per day (often 0). But does anyone know any examples that deal with this type of problem efficiently?

What I’m thinking of currently is yo build two parallel arrays with 200k elements that will in each element contain the u_i and v_j corresponding to each data point, multiply these together elementwise, and use a 200k dimensional Normal with the variance set to this array. Is this the most efficient solution? At what size of data would this become computationally intractable?


Related question which may lead to a more efficient way to define my model: am I allowed to input observed data that isn’t static? i.e., say my real (static) observations are in the vector x but I actually input the argument x / y where y is a stochastic variable in my Bayesian model, will this be valid?

I would try to use indexing. Assuming x is a data frame, create integer variables for i and j.

u = pm.Dirichlet('u', ..., shape=n_time_bins)  # or whatever prior you wish
v = pm.Dirichlet('v', ..., shape=n_days) # or whatever prior you wish
sd = pm.Deterministic('sd', (u[df.i.values] * v[df.j.values])**2)
observed = pm.Normal('observed', mu=0, sd=sd, observed=df.x)
1 Like