How best to build a model on 200k normally distributed observations without a simple vector relation (rather, a piecewise vector relation, i.e. subsets of data depend on a combination of parameters)

mar-ses · July 23, 2021, 1:20pm

I’m not sure how to word this question in order to find examples that would help me (I suspect this has been done and looked at before).

I have the following problem: I have a time series of lets say 200k points, over N days, which I will index by j. I want to split a day into M bins, which I will index by i. Then the data point on day j and bin i is x_{ij}).

I want to model the data as being drawn from a normal distribution with 0 mean and a variance u_i v_j, i.e.

x_{ij} \sim N(0, u_i v_j^2)

where each day has an independent parameter v_j and each time bin has another parameter u_i (these will be Dirichlet distributed as \sum u_i = 1.

The issue is this: I suspect that if I create 200,000 distributions for each data point, I will overload my computer and this won’t be efficient. I’m trying to figure out a good way to define these relations in as simple a way as possible (i.e. in the most efficient way possible).

The simpler solution would be to consider the case where there is only one day point per bin per day. I would already be happy with that, but in reality I actually have a varying number of data points per bin per day (often 0). But does anyone know any examples that deal with this type of problem efficiently?

What I’m thinking of currently is yo build two parallel arrays with 200k elements that will in each element contain the u_i and v_j corresponding to each data point, multiply these together elementwise, and use a 200k dimensional Normal with the variance set to this array. Is this the most efficient solution? At what size of data would this become computationally intractable?

Related question which may lead to a more efficient way to define my model: am I allowed to input observed data that isn’t static? i.e., say my real (static) observations are in the vector x but I actually input the argument x / y where y is a stochastic variable in my Bayesian model, will this be valid?

DanWeitzenfeld · July 23, 2021, 8:23pm

I would try to use indexing. Assuming x is a data frame, create integer variables for i and j.

u = pm.Dirichlet('u', ..., shape=n_time_bins)  # or whatever prior you wish
v = pm.Dirichlet('v', ..., shape=n_days) # or whatever prior you wish
sd = pm.Deterministic('sd', (u[df.i.values] * v[df.j.values])**2)
observed = pm.Normal('observed', mu=0, sd=sd, observed=df.x)

Topic		Replies	Views
Multiple Linear Regression Questions	2	664	April 26, 2019
Help with a custom Deterministic variable Questions	0	409	March 13, 2020
How to model Normally distributed but integer data? Questions	3	666	December 30, 2019
Time series analysis tutorials? Questions	3	3788	January 3, 2018
Building a model where model parameter depends on independent data Questions	2	632	May 14, 2019

How best to build a model on 200k normally distributed observations without a simple vector relation (rather, a piecewise vector relation, i.e. subsets of data depend on a combination of parameters)

Related topics