Sampling running very slowly for all models?

Hi - new to pymc3, and currently all of the tutorial examples that I run are going very slowly. As an example, I took this code, reduced the number of samples from 10,000 to 1,000, but it is still taking around 10 minutes to run (~10 draws/s.) There’s another script that my coworker can run in 15 minutes with a similar machine, but when I try to run it it was estimated to take 25 hours.

Here are the software versions I’m running:
pymc3 3.8
theano 1.0.4
Python 3.7.4

My processor is an Intel Core 15-8350U 1.7GHz 8 cores (no discrete GPU)

Here is the code I was running (takes ~10 minutes to complete.) (Note, I also wrapped it in an “if __name__ == ‘__main__’:” statement to avoid getting “RuntimeError: The communication pipe between the main process and its spawned children is broken.”)

Any help would be appreciated! There are a few different projects I’d like to use pymc3 for, but right now it’s hard to even do tutorials because everything takes to long to sample.

import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns

if __name__ == '__main__':

sns.set_context('paper')
sns.set_style('darkgrid')

import pymc3 as pm, theano.tensor as tt

# simulate data from a known mixture distribution
np.random.seed(12345) # set random seed for reproducibility

k = 3
ndata = 500
spread = 5
centers = np.array([-spread, 0, spread])

# simulate data from mixture distribution
v = np.random.randint(0, k, ndata)
data = centers[v] + np.random.randn(ndata)

plt.hist(data);

# setup model
model = pm.Model()
with model:
    # cluster sizes
    p = pm.Dirichlet('p', a=np.array([1., 1., 1.]), shape=k)
    # ensure all clusters have some points
    p_min_potential = pm.Potential('p_min_potential', tt.switch(tt.min(p) < .1, -np.inf, 0))


    # cluster centers
    means = pm.Normal('means', mu=[0, 0, 0], sd=15, shape=k)
    # break symmetry
    order_means_potential = pm.Potential('order_means_potential',
                                         tt.switch(means[1]-means[0] < 0, -np.inf, 0)
                                         + tt.switch(means[2]-means[1] < 0, -np.inf, 0))

    # measurement error
    sd = pm.Uniform('sd', lower=0, upper=20)

    # latent cluster of each observation
    category = pm.Categorical('category',
                              p=p,
                              shape=ndata)

    # likelihood for each observed value
    points = pm.Normal('obs',
                       mu=means[category],
                       sd=sd,
                       observed=data)


# fit model
with model:
    step1 = pm.Metropolis(vars=[p, sd, means])
    step2 = pm.ElemwiseCategorical(vars=[category], values=[0, 1, 2])
    tr = pm.sample(1000, step=[step1, step2])

Discovered the issue - I needed to install the m2w64-toolchain. After running “conda install m2w64-toolchain” everything is running much much faster.