ImageNet in 4 Minutes? What the paper really shows us

ImageNet has been a deep learning benchmark data set since it was created.  It was the competition that showed that DL networks could outperform non-ML techniques and it’s been used by academics as a standard for testing new image classification systems.  A few days ago an exciting paper was published on arxiv for training ImageNet in four minutes.  Not weeks, days or hours but minutes.  This is on the surface a great leap forward but it’s important to dig beneath the surface.  The Register sub headline says all you need to know:

So if you don’t have a relaxed view on accuracy or thousands of GPUs lying around, what’s the point? Is there anything that can be taken from this paper?

Firstly, there are some great points on the impact of batch size in clustered training on accuracy.  As the paper explains, increasing batch size allows the model to take larger step size and progress the optimisation faster at the expense of test accuracy.  Part of the art in getting such systems working well is finding an appropriate trade off for speed and accuracy, particularly on systems that require regular retraining.  Jia and Song et al have tested several strategies with a 64k minibatch size to prevent loss of accuracy when training AlexNet and ResNet-50 on the ImageNet data set. Their strategies are something that can be applied more generally and I’d love to see more industrial take up of this.

It’s been a year since You et al released the paper introducing the Layer-wise Adaptive Rate Scaling (LARS) based algorithm to scale up AlexNet to 8k batches and Resnet-50 to 32k without loss of accuracy, which was built on the earlier work of Goyal et al, who did not use LARS to get their larger mini batch results1.  This was immediately a challenge to the community to do better and I’m surprised it’s taken this long to get to 64k.  At its heart, mixing the precision of the single precision (FP32) and half precision (FP16) with LARS is the intuitive leap that allowed Jia and Song et al to increase the mini batch size to 64k while increasing scalability and maintaining accuracy.

(c) xkcd

Secondly, there are ways to overcome the problems with large scale distributions. As anyone who has tried to scale deep learning horizontally knows, there are problems in the bandwidth and coordination of information causing bottle necks and latency that limit the efficacy of simply throwing more GPUs at something.  More GPU power doesn’t always scale, but increasing this efficiency can give a far better trade off between cost of extra GPUs and return in speed.    Much as we’d all love to have our own Nvidia DGX-1 to work on, most companies just can’t justify this expense.  Better techniques using larger numbers of older cards could give smaller companies a step up in speed to progress their solutions.  The input pipeline technique they mention to use CPUs more efficiently is something I’ve been using for years and if I’d known wasn’t a “thing”  that everyone did I’d’ve published a paper on myself2.

However, to deal with the communication efficiency issue with ring all-reduce in the GPU cluster they propose tensor-fusion to reduce latency of lots of smaller messages.  As tensors are ready for sending, they are buffered until there is a large enough set of data to make best use of the bandwidth and then it’s sent to all-reduce.  If you’ve had any experience with message queue systems then you’ll know that volume can be far more of a problem than the size of the message (to a point) so the tensor -fusion technique should not come as a surprise.  There are a lot of the techniques from general large scale AI productionisation that are now bleeding into academia.  To all you companies out there doing great things – get those papers published now.  These techniques to improve efficiencies will come out soon, so you might as well get the credit for them.

One of the other points that they note in the paper is that they adjust the AlexNet architecture as well.  changing regularisation of parameters affected overall accuracy for the same number of epochs.  This level of tinkering is something that most AI teams step over when using existing networks because in industry it’s all about the data and there isn’t the time to fine tune pre-existing networks.  Thank you to all in academia who are doing these experiments3!

In an ideal world – none of your AI assets should be sitting idle – your CPUs, GPUs and humans should all be working and not waiting.  Companies doing AI without large financial investment can take the ideas in this paper (and its references) and sweat their assets to get the best results as fast as they can4.

Overall, much as it pains me to say it, the stark statement of training ImageNet in 4 minutes is the title this paper needed to get noticed.  Don’t dismiss it because accuracy is poor or it requires thousands of GPUs – there are some interesting approaches applicable to all AI techniques in here.

  1.   Both of these papers are well worth a read if you want to know more about scaling hardware and overcoming accuracy issues.
  2.   Maximising CPU usage is one of the many things I’ll be covering in my Minds Mastering Machines talk later this year.
  3.   Although not to be complacent, we should be doing this level of rigour for our own networks but when you’re spinning limited resources you go with your gut instinct and if it’s good enough then you move on to the next thing.
  4. or come to my M cubed talk where I’ll tell you how to do it 😉

Published by

janet

Dr Janet is a Molecular Biochemistry graduate from Oxford University with a doctorate in Computational Neuroscience from Sussex. I’m currently studying for a third degree in Mathematics with Open University. During the day, and sometimes out of hours, I work as a Chief Science Officer. You can read all about that on my LinkedIn page.