I’m trying to use PyMC3 to sample from a maximum entropy distribution over binary patterns of length N, constrained by means and pairwise correlations, which is given by:
Such that the alphas and betas are known, and X is a binary multivariate variable. I’ve tried to do this using DiscreteUniform “baseline” and adding pm.potential but I got weird results (e.g., setting all alphas and betas to zero and not reproducing the DiscreteUniform behavior).
I’d appreciate any ideas regarding how (if?) can this be done.
Very sorry for the tardy response! Still stuck on this, though, so help is still welcomed.
I’ve started with the simplest example, in which the beta coefficients are zero and the alphas are not (simplest in the sense that the joint distribution factorizes into independent univariate distributions). I began with synthesizing data:
data = pm.Model()
alphas = np.array([-1,-2,0.5,3.5])
with data:
X = pm.DiscreteUniform('X', lower=0, upper=1, shape=4)
constraints = pm.math.dot(alphas,X)
potential = pm.Potential("potential", constraints.sum())
datatrace = pm.sample(4000, step=BinaryMetropolis(vars=[X]))
Then, I tried to learn these parameters from the data:
Starting to show some code defenitely helps, but only saying that the results are weird doesnt help much in getting to know your problem and what the cause of your error might be. Please describe what you expect and show a result trace of your model that is going wrong.
Cheers!
You need to have constrained of your alpha and beta, or rewrite the energy function. Otherwise, it will just blow up towards negative infinite (you can sample using Metropolis and see the trace is just kept going negative). As it is right now, the potential function gets smaller and smaller when alpha is going negative.
There is a similar discussion here you can have a look PyMC3 for physics
Regarding the constraining - I thought that putting the prior coeffs = pm.Normal('coeffs', mu=0, sd=1, shape=4) should be enough (but it obviously isn’t). I also tried using a uniform prior that includes the relevant alpha values (like in the PyMC3 for physics example you provided), but this failed, as well - all values converged to the boundaries of the prior…
Maybe parameterized alpha as a unitary vector (i.e., vector with length=1)? I am not sure what is the usual treatment but I think I saw ppl doing that.
I tried this, and indeed I get sensible alpha traces now (thanks!), but without recovering the original values.
Still, this seems like a relatively simple learning task (just 4 parameters, and as much data to learn from as I want…), so I guess there’s something I’m missing here… perhaps the potential should be written using theano variables and not pm.math functions?
That makes sense; I’ve tried, but did not succeed. If anyone has any suggestion regarding how to rewrite the potential function in a way that describes the function P(X) above, it’d be much appreciated!
I think sampling a data vector from the network above directly is quite difficult because of the partition function. One approach to circumvent that could be sampling data sequentially from the conditional distributions of the nodes given the states of the rest of the network.
The idea is rooted in an article by Strauss (1994), who described the pseudolikelihood method to estimate the parameters of lattice models (see section 3.2 in that article). It basically boils down to a form of multivariate logistic regression that possibly could be implemented using pymc3 quite easily. The main trick for estimation is setting up the proper design matrix in the logistic regression model.
However, I am also interested if it would be possible to use pymc3 to estimate the parameters of the model above directly, as I heard that the normalization (z) would not be so much of a concern when using MCMC.
Literature:
Strauss, D. (1994). The many faces of logistic regression. American Statistician, 46(4), 321–371
This is exactly what I thought. Evaluating the partition function is a very hard problem (from what I understand, usually different approximations are computed instead, using the Wang-Landau method, for example); however, since it’s just a constant, I expected it not to have any effect on the sampling… so I still don’t understand why the inference doesn’t work. :-\
I know next to nothing about the context here, so maybe this is completely off, but could it be that the problem isn’t acually the second inference part in NUTS, but the data generation using BinaryMetropolis? If that hasn’t converged, then the posterior given the data from the trace might be nowhere near the right values.
There are only 16 possible values in the data generation part, so maybe it would be better to compute all those probabilities, normalize and draw random samples directly.
Edit: Nope, that’s not it.
The problems seems to be that the normalization factor z depends on the values for alpha, so you can’t actually ignore it during sampling.
Explicitly representing the partition function $Z(\alpha, \beta)$ in the model could become complicated because the sum runs over all possible network states. For instance, with just 100 nodes we would have already $2^100$ possible network configurations. Not to speak of the 4950 possible connections between the nodes if we add the weights $\beta_{ij}$.
However, as the evaluated function $Z$ is basically scalar-valued, one could possibly write the independence model you suggested above as:
$$
p(x_1,\ldots, x_n) \propto \mbox{exp} \left{ - \sum_i^n \alpha_i x_i \right}.
$$
An alternative formulation could be:
$$
p(x_1,\ldots, x_n) \propto \prod_i^n \mbox{exp} \left{ - (\alpha_i x_i) \right}
$$
But I am not sure if this helps implementing the model using pymc3 using the potential class, as I am just starting out to learn the package. What practically works for me with reasonable speed – for not too large networks – is the pseudolikelihood procedure using the conditional distributions.
Can you provide a code snippet/ref? Maybe I can try this, instead; I’m looking for a solution that’ll work, not specifically for a potential-based one.
In the R world, there is at least one package on CRAN that can do this: IsingFit by van Borkulo et al. (2014). Reading the respective paper is recommended.
I also have an experimental R package (ludwig) on github using a related but slightly different approach.
The source for both packages is open, so just have a look.
Reference:
Borkulo, C. D. van, Borsboom, D., Epskamp, S., Blanken, T. F., Boschloo, L., Schoevers, R. A., & Waldorp, L. J. (2014). A new method for constructing networks from binary data. Scientific Reports, 4(5918), 1–10. doi:10.1038/srep05918