Several seconds per iteration with GPU

I am developing a bivariate gaussian mixture model, with invariant component distributions and time-variant mixing coefficients to model a spatiotemporal distribution. The model is defined as follows:

P(\mathcal{D}|\theta; t) = \displaystyle \sum_{i=1}^K a_i(t)\phi(\mu_i, \Sigma_i)

with the priors:

a(t) = (a_1(t), \dots, a_K(t)) \sim \operatorname{Dir}(1)
\mu_i \sim \mathcal{N}(0, 10)
\Sigma_i | \sigma_i\sim \operatorname{LKJ}(2, \sigma_i)
\sigma_i \sim \operatorname{Half-Cauchy}(1)

I have modeled the data for two time point, t=1 and t=2, having 6K and 13K data points respectively. The code for the model is given below:

with pm.Model() as model:
    n_comps = 5
    n_hours = 2

    packed_L = []
    L = []
    sigma = []
    mu = []
    obs = []
    w = []

    for i in range(n_comps):
        packed_L.append(pm.LKJCholeskyCov('packed_L' + str(i),
                                            n=2,
                                            eta=2.,
                                            sd_dist=pm.HalfCauchy.dist(1)))
        L.append(pm.expand_packed_triangular(2, packed_L[-1]))
        sigma.append(pm.Deterministic('sigma' + str(i),
                                      L[-1].dot(L[-1].T)))
        mu.append(pm.Normal('mu' + str(i), 0., 2., shape=2))
        obs.append(pm.MvNormal.dist(mu=mu[-1], chol=L[-1]))

    mixt = []

    for i in range(n_hours):
        w.append(pm.Dirichlet('w'+str(i), a=np.ones(n_comps)))

        x, y = df[df.pickup_hour == i][['pickup_latitude', 'pickup_longitude']].values.T
        x = (x - mean_x) / std_x
        y = (y - mean_y) / std_y

        mixt.append(pm.Mixture('mixt' + str(i), w=w[i], comp_dists=obs, observed=list(zip(*np.stack([x,y])))))

and the following sampler:

with model:
        trace = pm.sample(njobs=1)

The most I can get from this model is about 1 iteration per second, with GPU acceleration (tested on AWS p2.xlarge). Could anybody point out what I have done wrong or any suggestions to model it better?

1 iteration per second for a large model like this is actually not too bad. Did you try using CPU only?

It seems like it takes 2.6s/it with GPU, but without GPU, it wasn’t even able to sample anything. When I limit the data points to 500 for each hour, it’s running at 3.0s/it with CPU.

You can also try profiling to see which part of the model is slow. However this seems difficult to optimized, unless you are willing to write down the multivariate mixture explicitly.

Would you mind explain what you mean by writing down the multivariate mixture explicitly?

Something similar to this doc: http://docs.pymc.io/notebooks/gaussian-mixture-model-advi.html (see cell 2)