How to slice an array of numbers during fitting

Hi All,

Just joined and posting for the first time. I am trying to fit a model to figure out the concentration of drugs needed to kill some bacteria. The drug will only contribute to death of bacteria if its concentration is above a certain amount. I could not figure out how to pass this condition in the model.

In the model df[‘Conc’] is a distribution [array] of drug so each single value of kill_pred depends on the calculation of this array.

with pm.Model() as model_kill:
# Priors

active = pm.Normal('m', mu = 2, sd=2)
ϵ = pm.HalfCauchy('ϵ', 5)

μ = pm.Deterministic('μ', np.sum(df['Conc'][:active])) #get error here
# TypeError: cannot do slice indexing on <class 
# 'pandas.core.indexes.range.RangeIndex'> with these indexers [n] 
# of <class 'pymc3.model.FreeRV'>
# Likelihood
kill_pred = pm.Normal('kill_pred', mu=μ, sd=ϵ, observed=y )

Thanks in advance!

Would you be able to elaborate a little more on what the concentration dataframe looks like? Is there a reason why you are summing over some of its cells? Also, it seems like the quantity that you want to model is the concentration of bacteria. Is that correct?

Hi ckrapu, thanks for asking.

df[‘Conc’] ~ a normal distribution of droplets that contain a drug (mu can be 1-5um and sd is 0.1~2um, depending on the drug).

We suspect that only the bigger droplets (say mu + n * sd) are killing the bacteria and the smaller droplets are doing nothing. That’s why we are summing over the bigger active droplets. We have the distributions of the drugs: drug 1 (mu=1um, sd=.2um), drug 2 (mu=3um, sd =1um) and number of bacteria killed kill1=10,000, kill2=20,000. And we are hoping to fir for the parameter n.

So it appears there is a need to splice. However I am guessing there is a cleverer way to do this.

Thanks in advance!
Cheers,
newerjazz

Managed to get it going by passing data in pm.Data but still stumped on how to get more than one set of data into model. Right now the inputs are one column of df_np (which contains the droplet size distribution) and one number kill (which is number of bacteria killed). I would like to pass i columns of df_np for drug(i) and i numbers of kills too. How do I do that? Thanks!

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from theano import shared
import scipy.stats as stats
from scipy.stats import gamma, norm
import pymc3 as pm
import theano.tensor as tt
import arviz as az

Make up fake data

mu = 3
sd = .5
D = norm(dia_mu, dia_sd)
d = D.rvs(100)

df_np = pd.DataFrame({‘Dia’:d})
n = 2.7 # what I am trying to solve for
active = df_np[‘Dia’] > mu + n * sd
kill = np.sum(df_np[‘Dia’][active])

with pm.Model() as model_b:
#Priors
tau = pm.Normal(‘tau’, mu = 2, sd=2)
ϵ = pm.HalfCauchy(‘ϵ’, 5)

#Observed
df_tt = pm.Data('df_tt', df_np['Dia'])
act = df_tt > mu + tau * sd
# Likelihood
μ = pm.Deterministic('μ', pm.math.sum(df_tt[act]))
kill_pred = pm.Normal('kill_pred', mu=μ, sd=ϵ, observed=kill )

trace_b = pm.sample()

pm.traceplot(trace_b);

I have now included working code for this problem. The problem is to splice some data within pymc3 or at least condition some data.

At the last line of code I get the error
Input dimension mis-match. (input[0].shape[1] = 10, input[1].shape[1] = 1)
How do I get conditioning within the model?

Thanks a lot in advance!

import numpy as np
from theano import shared
import pymc3 as pm
import theano.tensor as tt

a = np.arange(20.).reshape(2,10)
a_mu = a.mean(axis=1)
a_mu = a_mu.reshape(2,1)
a_sd = a.std(axis=1)
a_sd = a_sd.reshape(2,1)
act = a_ > a_mu + 3 * a_sd

a_ = shared(a)
a_mu_ = shared(a_mu)
a_sd_ = shared(a_sd)

with pm.Model() as model_b:
tau = pm.Normal(‘tau’, mu = 2, sd=1)
act_ = a_ > a_mu_ + tau * a_sd_

Here’s my attempt to fix the issue in your second-to-last post. I wasn’t sure if you had resolved the indexing issue. I replaced the indexing with a boolean mask to zero out contributions of droplets which were too small. Here’s the gist. With regard to the conditioning, it appears that you were already conditioning on data correctly before (though the sampler wasn’t working too smoothly). How are you intending to condition differently?

Hi ckrapu,

Thank you for the solution. It solves for the unknown tau easily.

I have updated your code to get 2 sets of droplet_sizes and two sets of kill results into the model here

For some reason the model now gets tau wrong even though the second set of data is simply a duplicate. Not sure what I am doing wrong.

Thanks a lot!

after updating the model with
μ = pm.Deterministic(‘μ’, (droplet_sizes*active_).sum(axis=1))

everything seems to work.

Thanks, ckrapu!

Okay - glad that it’s working. I am still a little confused as to how you’re aggregating. I think that I misunderstood whether you want to take the sum across the first axis (summing over droplets) or over the second axis (summing over observations) and I assumed you wanted the former.