Error implementing GLM submodule to full code syntax

Hello,
I wanted to implement this small GLM sub-module into code so that I could better understand how it works.

with pm.Model() as model_glmsubmodule:
pm.GLM.from_formula(‘alquiler ~ normalize_dia + log_cantidad’, data,
family=pm.glm.families.Binomial())

  trace_glmsubmodule = pm.sample(draws=500, chains=2, tune=500, cores=1, target_accept=0.9)

this would be the code similar to the one above:

with pm.Model() as model_fullsyntax:

# define priors, use the default
intercept = pm.Flat('intercept')
b1 = pm.Normal('normalize_dia', mu=0, tau=1.0E-6)
b2 = pm.Normal('log_cantidad', mu=0, tau=1.0E-6)


# define linear model
yest = ( intercept +
         b1 * mx_ex1['normalize_dia'] +
         b2 * mx_ex1['log_cantidad']) 

## Define Binomial likelihood 

likelihood = pm.Binomial('likelihood', n=1, p=yest, observed=mx_en1['alquiler'])

trace_fullsyntax = pm.sample(500, chains=2, tune=500, init='adapt_diag', cores=1)

but I get the error in the second one:

SamplingError Traceback (most recent call last)
in
1 with model_fullsyntax:
----> 2 trace_fullsyntax = pm.sample(500, chains=2, tune=500, init=‘adapt_diag’, cores=1)

~\anaconda3\lib\site-packages\pymc3\sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, **kwargs)
487 _log.info(“Sequential sampling ({} chains in 1 job)”.format(chains))
488 _print_step_hierarchy(step)
–> 489 trace = _sample_many(**sample_args)
490
491 discard = tune if discard_tuned_samples else 0

~\anaconda3\lib\site-packages\pymc3\sampling.py in _sample_many(draws, chain, chains, start, random_seed, step, **kwargs)
537 step=step,
538 random_seed=random_seed[i],
–> 539 **kwargs
540 )
541 if trace is None:

~\anaconda3\lib\site-packages\pymc3\sampling.py in _sample(chain, progressbar, random_seed, start, draws, step, trace, tune, model, **kwargs)
603 try:
604 strace = None
–> 605 for it, (strace, diverging) in enumerate(sampling):
606 if it >= skip_first:
607 trace = MultiTrace([strace])

~\anaconda3\lib\site-packages\tqdm\std.py in iter(self)
1105 fp_write=getattr(self.fp, ‘write’, sys.stderr.write))
1106
-> 1107 for obj in iterable:
1108 yield obj
1109 # Update and possibly print the progressbar.

~\anaconda3\lib\site-packages\pymc3\sampling.py in _iter_sample(draws, step, start, trace, chain, tune, model, random_seed)
698 step = stop_tuning(step)
699 if step.generates_stats:
–> 700 point, stats = step.step(point)
701 if strace.supports_sampler_stats:
702 strace.record(point, stats)

~\anaconda3\lib\site-packages\pymc3\step_methods\arraystep.py in step(self, point)
245
246 if self.generates_stats:
–> 247 apoint, stats = self.astep(array)
248 point = self._logp_dlogp_func.array_to_full_dict(apoint)
249 return point, stats

~\anaconda3\lib\site-packages\pymc3\step_methods\hmc\base_hmc.py in astep(self, q0)
142 )
143 self._warnings.append(warning)
–> 144 raise SamplingError(“Bad initial energy”)
145
146 adapt_step = self.tune and self.adapt_step_size

SamplingError: Bad initial energy

Anyone can help me to understand that error? Thankyou so much!

EDIT: I reviewed it and the likelihood logp of the pm.Binomial is -inf, that’s the problem, but I can’t find a way to solve it.

EDIT2 : The mx_en1 and mx_ex1 I get from patsy function:

(mx_en1, mx_ex1) = pt.dmatrices(‘alquiler ~ normalize_dia + log_cantidad’, data, return_type=‘dataframe’, NA_action=‘raise’)

At first glance, it appears that you are supplying unnormalized values to the p argument of the binomial. These can fall outside the range [0,1] and then since those are invalid parameters for the binomial distribution, it fires off an error. Try applying pm.math.sigmoid to squash those values into the range of valid probabilities and see if it solves your problem. The sigmoid is an example of an inverse link function which is a crucial part of every GLM. You can read more about the standard choices for link functions here.

Okay @ckrapu I got it ! Now it goes but, when I’m running the model:

I realized that the linear formula have very strange weights!

Variable: Intercept       Mean weight in model: 115.0491
Variable: DIA             Mean weight in model: 339.0715
Variable: CANTIDAD        Mean weight in model: -1446.7739
'prob_total_tipo_movimiento =  115.05 * Intercept + 339.07 * DIA + -1446.77 * CANTIDAD'

(basically I want to calculate, later, the probability between 0 and 1 that a bank movement is a supermarket movement taking into account the day and the amount, which I have specified in my dataset)

So, this is my last question about this:
when I try to sample, I get lot of divergences:

I tried to raise the target_accepted range, increase tune but that didn’t succed, I get divergences anyway.

I was thinking and maybe it’s the way I’m defining the linear model in model_fullsyntax.

Could you tell me what I might be doing wrong about the model that makes it diverge?

EDIT (if it helps):
my dataset is -> cuenta_normal_añoentero.csv (2.5 KB)

and I make these transformations on the ‘features’ that I consider important to the model:

Thankyou so much !
Sorry if that’s a rookie question, I’m trying to learn everything about pymc3 this quarantine.

In general, it would be helpful if you could provide full code snippets like your first post that allow me to load your data and reproduce your issue. If you could provide a script that loads the data and generates the sampling issues then I would be happy to run it and see what comes out.

That said, I think you want to make sure you use the log of the CANTIDAD variable like before since it spans several orders of magnitude. I also think interpreting the day of the month as as a continuous variable isn’t very sensible. For example, treating the 0-31 values as usual numerical covariates suggests that March 31st is farther from April 1st than it is from March 10th. Does that make sense for the process you are modeling? Could you perhaps create a new categorical variable that indicates whether a day is in the first week, second week and so on?

Thankyou so much @ckrapu for your time !
Yeah, of course, here is my problematic script -> divergences_binomial.py (2.0 KB)
and my little dataset -> cuenta_normal_añoentero1.csv (4.2 KB)

mm I think the idea of dividing the column of days by weeks of the year is interesting, however the weight that will really show me whether a movement is more (or less) likely to be a supermarket movement is the quantity CANTIDAD, because the days are equally likely for all its range [1,30] to have a supermarket movement (so that indicates me that in the linear formula CANTIDAD will have strong weight compare to DIA).

I hope we can find a way to resolve these divergences.

Again, thank you very much for your help :slight_smile:

I looked at your model / data and found that it’s possible to predict the type of movement with 100% accuracy by just looking at log_cantidad. If this variable is greater than ~5, then it will always give a value of 1.

Try running the following so you can see for yourself:

pd.plotting.scatter_matrix(data[['tipo_movimiento','log_cantidad','dia_normalized']], figsize=(10,10));

When there is virtually zero stochastic noise, the sampler is going to favor making the distinction between movement / no movement as sharp as possible which means very large values for the relevant coefficients.

Also, as a minor piece of advice, you can use pm.Bernoulli instead of pm.Binomial with n=1 for binary data.

Thankyou again @ckrapu, so, correct me if I’m wrong but, the problem of having divergences in this model I presented to you is a direct cause of the dataset quantities I used for it instead of the syntax I employ, right ?

Yes, that appears to be the case.

Maybe an option is to define my linear model as
theta = (Intercept + b1 * P1(mx_ex[‘dias_normalized’]) + b2 * P2(mx_ex[‘log_cantidad’])) ?

being P1, P2 probability distributions with a shape = 83 (total number of observed values)

OK. I got it !
Doing the tranformations to the original dataset CANTIDAD and DIA as:

y_simple = mx_en[‘tipo_movimiento’]
x_n1 = ‘CANTIDAD’
x_0 = mx_ex[x_n1].values
x_c = (x_0 - x_0.mean()) / (2 * x_0.std())

x_n2 = ‘DIA’
x_1 = mx_ex[x_n2].values
x_d = (x_1 - x_1.mean()) / (2 * x_1.std())

I have no divergences with this model:

with pm.Model() as model_simple:

Intercept = pm.Flat('Intercept')
b1 = pm.Normal('β1', mu=0, sd=10)
b2 = pm.Normal('β2', mu=0, sd=10)

theta = pm.math.sigmoid( Intercept +
         b1 * x_d +               
         b2 * x_c)

y_1 = pm.Bernoulli('y_1', p=theta, observed=y_simple)

trace_simple = pm.sample(500, chains=2, tune=500, init='adapt_diag', cores=1, target_accept=0.9)

The only thing that makes me be not 100% happy is because I get some hight weights…:

I apologize if the following is repetitive or if it is already obvious to you but I would like to state precisely the reasons for the results that you are getting.

The reason you’re not getting divergences anymore is because you’re using a stronger prior with sd=10 rather than your first choice of tau=1e-6 which is equivalent to the very weak prior with sd=1000. I should clarify that the reason this model is a little pathological is because your data allows you to perfectly predict the type of movement with zero error. It has nothing to do with transformations of your variables - it is because every single outcome with value equal to 1 is associated with log_cantidad> 5 When this happens, the coefficients can zoom off to infinity without having any ill effect on the model performance and numerical instability is much more likely. In this scenario, only the strength of the prior prevents you from having divergences.

TL;DR: you have high weights because your data is weird and it has nothing to do with transformations.

2 Likes

Thankyou @ckrapu … you helped me a lot.