Hi,
just to provide some background: I am a scientist working with EEG / MEG data. I recently came across PyMC3 and was wondering whether we can apply this to our data.
The probably most common use case is this: I collected data from N participants in a paradigm that has 2 conditions. The data is sampled from a number of sensors (102 magnetometers in my case). Because the data is very noisy, I average over something like 100 trials within each of the conditions for each participant separately.
So, what I end up with is a matrix with the shape of (n_participants, n_channels, n_samples) for each condition. Because I am interested in the difference between the conditions, I use the difference of the two matrices for further analysis.
The hypothesis I want to test is that at some channels and some samples the mean of the differences over all participants is not 0.
I can quite safely assume that my the differences follow a student-t distribution. So, I tried to come up with a model and used the models defined in the BEST toolbox as a template.
This is what I cam up with:
# normalisation seems to be needed because the values are in the range of 1e-15 and that seems to be a problem for the sampling algorithm
norm_factor = np.std(data_for_model)
data_for_model = data_for_model / norm_factor
nu_min = 2.5
nu_mean = 30
nu_mean - nu_min
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sigma=1, shape=(n_channels, n_samples)) # we want to estimate the mean for each channel and sample
std = pm.Uniform('std', lower=1e-3, upper=1e3, shape=(n_channels, 1)) # but we can assume that the std deviation is the same within each channel
nu = pm.Exponential('nu', 1 / (nu_mean - nu_min)) + nu_min
modeled_data = pm.StudentT('modeled', mu=mu, nu=nu, sigma=std, observed=data_for_model, total_size=data_for_model.shape)
trace = pm.sample(1000)
Here are my questions:
- Does this make sense to you? Or did I miss something important? This is my first attempt…
- It seems that it is important to scale the really tiny values because otherwise, I get lots of errors from the sampler. Does that make sense?
- Any way to speed this up? It takes like 15 hours on my laptop. I could send it to a quite powerful cluster that has ~200 cores. But it seems that the
cores
argument topm.sample
then only increases the number of chains, so it would not be done quicker. I already looked into providing amultiprocessing
context but I need to limit the amount of cores available to it and I could not find out, how to do this.
Any help is highly appreciated! Thanks in advance!
Thomas