Hi Jiadaren!
Many people point to Michael Betancourt’s blog post on exactly this question. You should read through the whole thing, he gives a very good overview of the problem, but the essence of it is he suggests putting 99% of the (Inverse Gamma) prior density on the lengthscale between the shortest and longest pairwise distance in your dataset. My code for doing this looks something like:
from scipy.spatial.distance import pdist
distances = pdist(X)
ℓ_l = distances[distances!=0].min()
ℓ_u = distances[distances!=0].max()
ℓ_σ = (ℓ_u-ℓ_l)/6
ℓ_μ = ℓ_l + 3*ℓ_σ
with pm.Model() as model:
ℓ = pm.InverseGamma('ℓ', mu=ℓ_μ, sigma=ℓ_σ)
η = pm.Gamma('η', alpha=2, beta=1)
...
Note that this doesn’t chop the prior off at the lower bound, but it does make it much more likely that your MAP will be above it.