Try a different optimizer, and/or adjust the learning rate. For optimizers with momentum parameters, you can play around with those as well.
As I mentioned, we don’t have fancier tools like gradient clipping/learning rate scheduling/retry on nan (yet!), so help wanted.