Nuts-rs: meaning of the parameters in the CpuLogpFunc

I’m not sure this is the right place to ask this, but I can’t find a forum that’s specifically dedicated to nuts_rs. Keep in mind that I’m a rust beginner.

What’s the meaning of the parameters of the logp function below? From the “official example”, the position seems to be the multidimensional vector encoding the N-dimensional vector of parameters of of the function we want to minimize, while the grad is the gradient at that N-dimensional point, right?

impl CpuLogpFunc for PosteriorDensity {
    type LogpError = PosteriorLogpError;

    fn logp(&mut self, position: &[f64], grad: &mut [f64]) -> Result<f64, Self::LogpError> {
        // ...
        Ok(...)
    }
}

I’m asking this because I’m getting some bizarre traces for things I’ve been playing with, so maybe my understanding of this is wrong…

1 Like

Hello :slight_smile:

I’m not sure this is the right place to ask this, but I can’t find a forum that’s specifically dedicated to nuts_rs

This is the right place.

What you are saying sounds correct to me. Given some unnormalized density function, the logp function should return the log-density of that distribution evaluated at position, and should write the gradient of the function into grad.

If you can share an example that fails for you, we can try to figure out why. An easy reason would be because that gradient could be incorrect for instance.

Thanks, it was a very stupid error in which I switched X and Y in the data.

The whole thing is part of an attempt to fit bayesian models from Elixir. I can already define models in Elixir and compile them to Stan, but to compile stan models one needs to install Stan and the whole C++ build chain, which I don’t like much… The rust toolchain seems much nicer, and the code much more comprehensible (even though I don’t know much rust, I can see myself learning rust but I definitely can’t see myself learning C++ and the C++ template language).

I did have a very complex logp density because I was stress-testing my Elixir-to-rust compiler, but the error was actually just switching X and Y.

That sounds cool!
Feel free to drop a link once you have something to share if you like :slight_smile:

I don’t have an end-to-end yet, but the idea is to be able to do something like this (this is actual working code, it’s just missing some parts to be end-to-end):

  alias Ulam.Nutex.Math
  alias Ulam.Nutex.Math.Compiler

  test "compile unnormalized normal logp" do
    use Ulam.Nutex.Math.Operators

    i = Math.v("i")
    n = Math.data("n")

    y = Math.data("y")
    x = Math.data("x")

    alpha = Math.param("alpha")
    beta = Math.param("beta")
    epsilon = Math.param("epsilon")

    result = - (y[i] - (beta * x[i] + alpha)) * (y[i] - (beta * x[i] + alpha)) / (2.0 * epsilon * epsilon) - Math.ln(epsilon)

    {_cache, expression} = Math.optimize_summation(i, n, result)

    Compiler.to_rust(expression)
  end

The code above defines the unnormalized log-likelihood for a linear regression model with gaussian error. The coefficients of the module are alpha and beta and the error is epsilon. The line use Ulam.Nutex.Math.Operators overrides the math operators so that they generate unevaluated math expressions, which actually compile into rational functions of variables, constants and elementary functions (such as the Math.ln() function).

The expression above should be summed over all i (the index for the x[i] and y[i] vector access terms) to get the full likelihood for the data. The code the attempts to separate the parts that depend on i and turn the expression into a linear combination of terms that depend on i and on the x and y data parameters.

Because the normal distribution (and many others) belong to the exponential family, we know that separating these terms will be worth it, because after separating x[i] and y[i] term (and possibly their product) we can use the associative property of addition to sum move the parameters out of the sums and cache the sums before even running the model (i.e. (0..n).into_iter().map(|i| x[i]).sum() or (0..n).into_iter().map(|i| x[i] * y[i]).sum()) and convert the expression into something that only uses the sums. This drammatically decreases the number of nodes in the Wishart tape and the number of operations required to evaluate the gradient.

The code above then generates the following rust code:

let target = target + (
            - 2.0 * (self.data.n as f64) * epsilon.ln() * epsilon * epsilon
            - (self.data.n as f64) * alpha * alpha
            - 2.0 * alpha * beta * self.cache.sum_of_x_i__0
            + 2.0 * alpha * self.cache.sum_of_y_i__3
            - beta * beta * self.cache.sum_of_x_i_x_i__1
            + 2.0 * beta * self.cache.sum_of_x_i_y_i__2
            - self.cache.sum_of_y_i_y_i__4
          ) / ((2.0 * epsilon * epsilon));

This can be used as part of the following Logp definition (which I still can’t build automatically, but that’s just a matter of bookkeeping):

#[derive(Debug, Clone)]
pub struct CacheContainer {
    sum_of_x_i__0: f64,
    sum_of_y_i__3: f64,
    sum_of_x_i_x_i__1: f64,
    sum_of_x_i_y_i__2: f64,
    sum_of_y_i_y_i__4: f64
}

#[derive(Debug, Clone)]
pub struct DataContainer {
    y: Vec<f64>,
    x: Vec<f64>,
    n: u32
}

#[derive(Debug)]
pub struct PosteriorDensityWithCache {
    cache: CacheContainer,
    data: DataContainer
}

impl CpuLogpFunc for PosteriorDensityWithCache {
    type LogpError = PosteriorLogpError;

    fn dim(&self) -> usize { 3 }

    fn logp(&mut self, position: &[f64], grad: &mut [f64]) -> Result<f64, Self::LogpError> {
        let tape = autograd::Tape::new();

        let target = tape.add_var(0.0);

        let alpha = tape.add_var(position[0]);
        let beta = tape.add_var(position[1]);
        let epsilon = tape.add_var(position[2]);

        let target = target + (
            - 2.0 * (self.data.n as f64) * epsilon.ln() * epsilon * epsilon
            - (self.data.n as f64) * alpha * alpha
            - 2.0 * alpha * beta * self.cache.sum_of_x_i__0
            + 2.0 * alpha * self.cache.sum_of_y_i__3
            - beta * beta * self.cache.sum_of_x_i_x_i__1
            + 2.0 * beta * self.cache.sum_of_x_i_y_i__2
            - self.cache.sum_of_y_i_y_i__4
          ) / ((2.0 * epsilon * epsilon));


        let target_grad = target.grad().wrt(&[alpha, beta, epsilon]);

        // This is stupid, it can be done better but my rust doesn't allow
        // for anything more advanced yet...
        grad[0] = target_grad[0];
        grad[1] = target_grad[1];
        grad[2] = target_grad[2];

        return Ok(target.val)
    }
}

The DataContainer and the CacheContainer can be initialized before even running the Logp and don’t need to be rebuilt everytime. The autograd module is a slight modification of the reverse package which implements reverse-mode autodiff for rust.

One could implement part of the above purely in the rust side by exploiting the properties of the exponential family of distributions, but by expanding all the multiplications on the Elixir side, I can simplify things even further when compared to what one could get with the properties of the exponential function alone.

So, in summary, most of the above rust code was written by hand, but I have a road map to be able to generate it from Elixir. Ideally I would do most code generation on the Rust side because it would be more portable, but implementing what is equivalent to a subset of a computer algebra system in Rust to reduce the complex expressions seems like hell (and I can’t write much rust to be honest…)

1 Like

I believe the above is quite close to what you’re doing by compiling PyMC models into rust so that you can use nuts_rs to sample them. The main difference is that PyMC has commited to only generating static compute graphs and requires compiling the model when the data changes and I’d like my system to be a little more general, in the spirit of Stan, which can use dynamic computation graphs.

I don’t think it would be too hard to generalize the above to sums over multiple indices for multidimensional arrays or matrices. There is probably even a way of optimizing matrix multiplications in the same way (for example for linear models with multiple coefficients where the number of coefficients is dynamic), but I still haven’t gotten around to doing that.

PS: with this caching scheme, if we’re feeding polynomials to a distribution of the exponential family and the model has a fixed number of parameters then the runtime of the code is independent on the size of the data (because we’re only working with cached sums)

That’s not quite true. You can define models/ logp functions for a dynamic number of datapoints, just not a dynamic number of dimensions. PyMC itself doesn’t make use of it at the moment, but other libraries/ users (like nutpie) can exploit it just fine.

Regardless, your work sounds pretty cool!

Sure, I was talking specifically aboutu PyMC, I know that nutpie (and by extension nuts_rs) can do it just fine, and as a matter of fact I have done it myself with the above example, although with all the algebraic simplifications the computation graph has actually become static!

Your work using Python (in PyMC) is also pretty cool too! And I believe you can emulate much of what I’m doing in an even easier way… If I were using Python, I’d use SymPy to perform the algebraic simplifications I’m doing in Elixir now, which I think would be possible. With a real computer algebra system you could probably leverage more advanced rewriting rules to get a more optimized expression.

For example, look at my target definition:

let target = target + (
            - 2.0 * (self.data.n as f64) * epsilon.ln() * epsilon * epsilon
            - (self.data.n as f64) * alpha * alpha
            - 2.0 * alpha * beta * self.cache.sum_of_x_i__0
            + 2.0 * alpha * self.cache.sum_of_y_i__3
            - beta * beta * self.cache.sum_of_x_i_x_i__1
            + 2.0 * beta * self.cache.sum_of_x_i_y_i__2
            - self.cache.sum_of_y_i_y_i__4
          ) / ((2.0 * epsilon * epsilon));

Note that the first term of the numerator of the rational function (- 2.0 * (self.data.n as f64) * epsilon.ln() * epsilon * epsilon) is a multiple of epsilon * epsilon, which is also a term in the denominator. It would be more numerically precise and more efficient to move that term out of the fraction, but in the general case that would require performing polynomial division, which I haven’t implemented in Elixir.

In the general case, the best way of rewriting an algebraic expression such as this is to first reduce it into an rational fraction where the numerator is rewritten to take advantage of the cached sums and then apply something such as simmulated annealing with an appropriate cost function. IDK if that would take too long, though, but given the fact that the problem space is small, you could probably do it with a fast cooling schedule. I also don’t know if anyone is using simulated annealing to rewrite expressions in order to make them more efficient.

2 Likes

Have you ever had the chance to look at PyTensor and its mathematical rewrite machinery?

https://pytensor.readthedocs.io/

Obviously there are many systems out there to operate symbolically on mathematical expressions but because you’re here and we use it I couldn’t help but ask.

We do very analogous stuff (for different goals) like lifting indexing operations or partial constant folding of graphs. On the more bayesian side we make use of it for deriving probability densities/mass or marginalizing variables.

I’ve looked into it a long time ago. I actually don’t like Python very much… I have a really hard time reasoning about efficiency and sometimes even correctness with all the weird things that Python can do under the hood at runtime. Elixir can do some pretty freaky things at compile-time, but fortunately not at runtime. And I also don’t love the API of most python packages…

Obviously I have to use python for serious stuff due to lack of libraries everywhere else, but I’d rather use Elixir if it had access to the libraries I need.

1 Like

Fair enough :slight_smile:

Ok, it turns out I was excessively optimistic regarding how much I could leverage the exponential family’s structure to implement this… These algebraic tricks only work if the sufficient statistic is a polynomial on the independent variable, which is the case for the normal distribuition but it not the case for the typical distribution. So I’m abandoning this whole approach and focus on how to minimize the number of nodes in my gradient.

@tcapretto has done some interesting work on exploiting summary statistics in the likelihood, maybe he can share something

It was also my impression that this trick only works if there are multiple observations for the same point. But I think in that case we can always get the same or a very similar effect by replacing those multiple values by their mean and adjusting sigma?
I’d be curios to hear if that is actually the case though, or if I’m just imagining things :slight_smile:

I think we could even do that automatically in the logp function of eg the normal distribution without much trouble.
If there are multiple observations for the same mu and sigma values (and in most applications I think we know if that’s the case statically in the logp function simply from the shape of mu and sigma), we can define the logp in terms of the sufficient statistics of the observed values.

1 Like

Yes that should be easy. Could adapt something like Implement specialized MvNormal density based on precision matrix by ricardoV94 · Pull Request #7345 · pymc-devs/pymc · GitHub for specialized logp implementations (not to have to fill the logp function with if/else statements), but need to know if it’s not a case where we want the elemwise logp (or is it still Elemwise?).

1 Like

You don’t need something as strong as a single mean for all the datapoints for my caching trick. Let’s say you want to fit something like this (written in weird stan-like pseudocode):

params : all parameters of the model

data x: array[n, m]
data y: array[n]

for i in 0..n {
  y[i] ~ normal(f1(params)*g1(x[i, ...]) + f2(params)*g2(x[i, ...]) + ... fk(params)*g3(x[i, ...]), expression_that_does_not_depend_on_i)
}

where gj(x) does not depend on the params and fi(params) does not depend on x for j in [0, ..., k] and expression_that_does_not_depend_on_i does not depend on i. The idea is that if you should be able to treat the x[i, j] as variables and all the aprameters as constants and the means of the normal could be seen as multivariable polynomials on the elements of x. Then, you can expand the lognormal density of the normal and turn that into polynomials of higher degree, where the x[i,j] are separated from the parameters. And finally, you can sum each product over i and use the distributive property to turn that sum into a linear combinaton of f1(params) * sum(f(x[i, ...]), which would allow you to treat sim(f(x[i, ...])) as a constant that would be cached and dramtically decrease the number of gradients you’d have to keep (and also the total number of operations).

This is only possible because of the shape of the normal distribution, which allows for that separation. Other distributions such as the lognormal do not allow this in general.

For anyone interested in this, since PyMC also compiles to nuts-rs, after some testing, I think that for distributions of the exponential distribution family where the parameters are scalars, which is defined as something like the following (in stan code):

data {
  int<lower=0> N;
  array[N] real y;
}

parameters {
  real mu;
  real sigma;
}

model {
  x ~ normal(mu, sigma);
}

it makes sense to cache the sufficient statistics of the distribution, which for the normal distribution are \sum_i x[i] and \sum_i (x[i])^2. These cached values can be evaluated one (with no gradients, because we are just summing floats) and reused for the NUTS sampling iterations. This decreases the amount of nodes for gradient calculations from N to a small constant.

For the rust gradient library I’m using, it makes sense to ue a cache for this purpose, but not for more complex situations, unless I really implement a computer algebra system that runs at compile-time with very clever heuristics.

As a striking visual example, I get the following speedups for the equivalent of the stan code above (the logp function is still written manually in rust, I don’t have a functional compiler yet):

We don’t need to do anything special in PyMC for caching, if a graph has a constant component it will be constant folded and inlined in the compiled final function.

The thing we need is to write the logp in a way that it uses the summary statistics instead of the general expression we use by default (when that’s appropriate).

Yes, because the graphs in PyMC are inherently static constant folding is much easier. And just a small correction: if I understand you correctly, it’s not “the summary statistics” (those are usually the median, mode, st dev, etc.) it’s the “sufficient statistics”, which may not have anything to do with any summary statistics.