Increasing sampling speed by changing distribution of difference between measured and calculated to interval

LaCordillera · March 7, 2021, 9:35pm

Hello,

I am fairly new to pymc3 and I have been using it to model geochemical data.
Overall, my current model looks like:

with pm.Model() as model2:

    data_arr = geochem_df.iloc[row].to_numpy()
    b_eq = data_arr * 100 
    
    x_v = []
    
    def create_priors(tup):
        mu = tup[1]
        sigma = tup[2]
        name = tup[0]
        return pm.TruncatedNormal(name, mu, sigma,lower=0,upper=100)
        
    for z in range(len(tup_main)):
        x_v.append(create_priors(tup_main[z]))
       
    x = tt.as_tensor_variable(list(x_v))

    CP = pm.Deterministic('CP_val',tt.sum(x*-CP_arr[:-15]))

    pm.Cauchy('CP',alpha=eet,beta=1,observed=CP)

    total = pm.Deterministic('total',tt.sum(x))

    total_cap = pm.Cauchy('total_cap',alpha=100,beta=.1,observed=total)
    
    eq = pm.Deterministic('eq_val',tt.dot(a_eq[:,:-15],x))
    
    diff = pm.Deterministic('diff',eq-b_eq) ## difference
    
    pm.Cauchy('eq',alpha=0,beta=1,observed=diff,shape=data_arr.shape[0]) ## diff distribution
    
    trace2 = pm.sample(1500,tune=500)

The commented lines intend to assert the difference between calculated and observed as a cauchy distribution centred at zero (in order to minimize the difference). This model takes about 10 minutes to run per row in my data, but there’s cases where we have about 10 000 rows, and for each we intend to run the sampler each time, because there’s different priors for each data row. I wonder if we set the difference to be a range of values, say between [0:5], if the sampler would speed up? Does anyone know how to do it properly? I’ve run some experiments but it doesn’t seem that I am coding it properly, as it doesn’t give me sensible results.

Additionally, is it possible to “force” the sampler to run on all available cores? Say I have an AWS instance with 30 something cores, I can’t figure out how to set the sampler to use all cores.

Many thanks for any help!

chartl · March 9, 2021, 6:26pm

Vectorization may give you a little bit of a speedup:

x = pm.TruncatedNormal('priors', tup[1], tup[2], lower=0, upper=100)

It looks like you have a “soft” constraint on the total, and are trying to get the sampler to enforce it. This probably is causing a significant slow-down. It’s probably faster to rescale a la:

total_cap = pm.Cauchy('total_cap', alpha=100, beta=.1)
x_scaled = (x/tt.nlinalg.norm(x)) * total_cap

this will have a weird impact on your posteriors, since the x values are happy being quite large and getting rescaled; and if you don’t want that you may need to do something a little bit cleverer to constrain the total. But I suspect this potential is the source of the problem. You could, for instance, use Dirichlet to represent proportions of the total_cap; and use a pm.Deterministic('x_cap', total_cap * dirich_proportions) to pull out the values for analysis.

and it probably would help simplify the graph to do

pm.Cauchy('eq', alpha=b_eq, beta=1, observed=tt.dot(a_eq[:,:-15], x))

Topic		Replies	Views
Sampling does not start or very very slow while attempting a Mixture Model tutorial Questions	4	2166	September 19, 2018
Fast but now slow sample speeds (MacOS) Questions	2	964	February 1, 2021
Bug in fast sample posterior predictive? Questions	9	1471	March 14, 2021
PyMC3-Sampling not taking place for the entire range of values between the given bounds of a Uniform prior Questions	7	1567	January 14, 2018
Low discrepancy Sampler Development development	9	1264	November 5, 2017

Increasing sampling speed by changing distribution of difference between measured and calculated to interval

Related topics