Run model with float32s on CPU

I have a model that I’m trying to run and performance seems to be about the same between float32 and float64 (running on CPU, Ubuntu 18.04, openblas).

Am I doing something wrong?

import numpy
import os
import time
os.environ['THEANO_FLAGS']='device=cpu,floatX=float32,warn_float64=warn,blas.ldflags=-lblas -lgfortran'

import pymc3

N = 581012
D = 55

x = numpy.random.rand(N, D).astype(numpy.float32)
beta = numpy.random.rand(D).astype(numpy.float32)
y = numpy.random.randint(0, 1, N).astype(numpy.int32)

basic_model = pymc3.Model()

with basic_model:
    pymc_beta = pymc3.Normal('beta', mu = numpy.float32(0), sd = numpy.float32(1), shape = D)
    pymc_y = pymc3.Bernoulli('y', p = pymc3.math.sigmoid(pymc3.math.dot(x, pymc_beta)), observed = y)

    step = pymc3.NUTS(max_treedepth = 10, 
                      adapt_step_size = True) 
    
    t0 = time.time()
    pymc_fit = pymc3.sample(40, tune = 40, step = step, chains=1)
    
pymc_time = time.time() - t0
pymc3_total_leapfrog_steps = numpy.sum(pymc_fit.get_sampler_stats("tree_size"))

print("Total time (s): ", pymc_time)
print("Total leapfrog steps: ", pymc3_total_leapfrog_steps)
print("Time per leapfrog step (s): ", pymc_time / pymc3_total_leapfrog_steps)

I’m running single threaded with OMP_NUM_THREADS=1. Thanks!

Not sure if I understand - are you expecting better performance in float32?
In my experience, if your problem is conditioned well (all the parameter in more or less the same range, and the parameter in the latent space more or less bell shape), running in float32 or float64 should not give much difference.

Yeah, I figured since floats are half the memory of doubles things might be twice as fast?

I don’t know how theano works, but a matrix-vector multiply like this should be memory bound so that was the thinking at least.

I see. I guess it depends on the BLAS library you are running? My impression is that Intel does something quite smart in optimization so you might not see the differences in this case, but I also have no experience. Maybe @aseyboldt has some additional insight.

You can see a bit better what’s going on behind the scenes if you look at the logp function directly:

func = basic_model.logp_dlogp_function(profile=True)
func.set_extra_values({})
# Choose a N(0, 1) point in the transformed space
x0 = np.random.randn(func.size)
%timeit func(x0)

func.profile.summary()

For me float32 is faster than float64, something like 50ms vs 90ms. I’m using blis on an AMD processor though, so that might be different.
How does your profile output look like?

3 Likes

Oh, and a model like this sounds like the perfect one for a GPU. So if you feel like experimenting (running pymc3 on GPU isn’t supported or tested):

I get 1.5ms if I set device=gpu.
We can also explicitly transfer the matrix to the gpu first:

import numpy
import os
import time
os.environ['THEANO_FLAGS']='contexts=dev0->cuda0,floatX=float32,warn_float64=warn,blas.ldflags=-lblas -lgfortran'

import pymc3
import theano
import theano.tensor as tt

dtype = numpy.float32

N = 581012
D = 55

x = numpy.random.rand(N, D).astype(dtype)
x = theano.shared(x, target='dev0')

beta = numpy.random.rand(D).astype(dtype)
y = numpy.random.randint(0, 1, N).astype(numpy.int32)
y = theano.shared(y, target='dev0')

basic_model = pymc3.Model()

with basic_model:
    pymc_beta = pymc3.Normal('beta', mu = dtype(0), sd = dtype(1), shape = D)
    pymc_beta = pymc_beta.transfer('dev0')
    pymc_y = pymc3.Bernoulli('y', p = pymc3.math.sigmoid(pymc3.math.dot(x, pymc_beta)), observed = y)

I get some warnings however, seems like there is no default context.
I hate that they stopped developing theano, that multigpu support is pretty neat.

3 Likes

Oh I didn’t know about the profile thing. I’ll try that. That looks like a better way to get the per-gradient timings. I was getting quite a bit of variation in the timings the way I was doing them.

I am using openblas. I checked while the program was running that it’s linking to the right library.

I also often run the gradient evaluation in a loop and then look at the output of perf record -p <> and perf report (on linux, I don’t think that exists on mac). You should be able to see there in which symbols it is spending the time, and if it is in fact using dgemv or sgemv. If openblas really isn’t faster for float32 that would be quite confusing…