I’m trying to set up the world’s simplest example of a potential for use as an example, and I’m currently a bit perplexed as to the result. I have two lists of random bits X, and Y which I’ve generated as independent random values. I want to essentially learn the parameter of a Boltzmann machine like potential connecting the two, which I’ve called w below. The code executes fine, but then in the trace the posterior distribution on w is extremely far away from zero (what I expected for independent flips). Can someone explain where my confusion is coming from?
import numpy as np
import pymc3 as pm
X = np.random.randint(0,2,10000)
Y = np.random.randint(0,2,10000)
# setup model
model = pm.Model()
with model:
w = pm.Normal('weight', mu = 0, sd = 1)
pm.Potential('coupling', w*(2*X-1)*(2*Y-1))
trace = pm.sample()
pm.traceplot(trace)
I deleted this since I though I understood my mistake, but now I’m confused again. It behaves as expected if you change the code to
import numpy as np
import pymc3 as pm
X = np.random.randint(0,2,10000)
Y = np.random.randint(0,2,10000)
# setup model
model = pm.Model()
with model:
w = pm.Normal('weight', mu = 0, sd = 1)
pm.Potential('coupling', w*(2*X-1)*(2*Y-1)/10000)
trace = pm.sample()
pm.traceplot(trace)
The only change made is that I divided the coupling by the number of datapoints. I am confused as to why this is correct. Can someone help?
The gradients increase with the norm of the potential. Generally you need to use a learning rate which is inversely proportional to the largest gradient you’re going to see, or you’ll get a divergent optimization.
NUTS under the hood is doing a gradient descent with momentum; and if you have a huge potential it will diverge just like any other solver. Try passing in
Though to be sure, integrating a Hamiltonian system with too large a step size relative to the gradient has the same numerical problems as optimizing function with too large a step size relative to the gradient norm. I guess the tuning step works really really well at defining the size…
I dont think this is the case, and likely scaling the step_scale would not work. I am not completely sure you have the correct potential for Boltzmann machine written down, as I understand that the energy function should be a cross product between X and Y here? - which means you would have a 10000 x 10000 matrix and taking the sum. But if you are meaning to express that X and Y are random sample from 2 nodes, then you should take the expectation - which is why dividing by 10000 gives you the expected result (as pymc3 does the sum of the Potential term here).
Indeed the second explanation is what I was intending: X and Y are samples from two nodes. Isn’t it the case though that if we wanted to do the maximum likelihood estimation of the parameter w, we would take the product of the probabilities of each observation, each of which would have the multiplicative factor e^(w*(2*X-1)*(2*Y-1)), and thus following through the logarithms, the sum without normalization is correct to give the likelihood of the series of observations? Then, if you want to incorporate in the prior, that would be an additional term?
I’m sure I’m missing an obvious statement here right now.