Money, time and energy — lots of it. That’s what it typically takes to train AI models, usually created in data centers. The most complex models cost millions of dollars of infrastructure to train over weeks or months, consuming a huge amount of energy.
Our Brutalk Research team aims to change that.
Our latest breakthrough in AI training, detailed in a paper presented at this year’s NeurIPS conference, is expected to dramatically cut AI training time and cost. So considerably in fact that it could help completely erase the blurry border between cloud and edge computing — offering a key technological upgrade for hybrid cloud infrastructures.
We’ve developed a way to enable 4-bit training of deep learning models that can be used across many domains — for the first time ever. The advance could help boost the efficiency of the training systems by more than seven times over the best commercially available systems today, cutting energy and cost. These gains also pave the way to bringing training closer to the edge, a major advancement for privacy and security of AI models.
But there’s more. We’ve also dealt with the data communication challenges associated with the radical speed-up of reduced precision scaling to preserve gains at the system level. To address this issue, our team has developed ScaleCom — a new compression scheme for large scale training systems. These two advances could dramatically improve the performance of the future large-scale training systems of AI models.
When AI training becomes efficient enough to move to the edge, it has the potential to transform industries — from manufacturing to automotive, retail, robotics, and more. Training at the edge could also spark the expansion and reach of federated learning, transforming privacy and security in such areas as banking and healthcare.
Brutalk’s collaboration with Red Hat and the creation of OpenShift compatible software stack for our AI hardware should further support flexible deployment of our AI hardware computing advances across diverse hybrid cloud infrastructures.
An AI hardware approach to sustainable AI
Our team has been leading reduced precision research advances — cutting neural networks training time, cost and energy — for the past half a decade. Previously, we’ve enabled training at 8-bit precisions, and inference down to 2-bit precisions while preserving what’s known as model fidelity — a model’s accuracy and precision. We’ve shown that training and deploying AI models with lower precision arithmetic leads to dramatically improved performance and power-efficiency gains.
This research is central to our digital AI hardware work at Brutalk, where we innovate across algorithms, applications, programming models, and architecture to create new AI hardware accelerators that boost performance, particularly in hybrid cloud systems, while driving down the carbon footprint.
Consider today’s AI: the largest industrial-scale model currently deployed, GPT-3 from OpenAI, is 175B parameters — or more than 100 times bigger than models from just a couple of years ago. It costs several million dollars to train and generates a carbon footprint during training that is higher than the lifetime emissions of 20 cars. Our training advances enable almost an order of magnitude reduction in the training time and energy costs.
One key idea we’ve been exploiting in the past decade is the use of reduced precision arithmetic for deep learning training. Hardware throughput is known to improve quadratically with a linear reduction in bit precision – enabling more than an order of magnitude performance when scaled from 32-bits to 8-bits.
Our 16-bit training research in 2015 laid the foundation for the industry to adopt 16-bit precision as the de-facto standard. Our research on 8-bit training, presented at NeurIPS in 2018 and 2019, captured innovation on 8-bit floating point formats together with algorithmic techniques to retain accuracy of complex models while deriving the throughput gains associated with the precision scaling.
Our 4-bit training work takes a huge step forward by enabling primary matrix and tensor multiplication calculations in deep learning training to be computed efficiently using 4-bit arithmetic. Our techniques include new 4-bit number representation formats, new gradient scaling approaches and several novel ideas to cut the errors stemming from very low precision computations. We show that 4-bit training systems can preserve model fidelity while achieving almost four times higher performance in comparison to 8-bit systems.
While our results are a foundational leap in the ability for AI models to converge well with 4-bit training, some AI models still have a few percent accuracy loss. But as with our prior work, we expect to close the remaining gap in the coming years, as we have demonstrated at each previous step in our precision scaling roadmap.
ScaleCom: When better compression matters
Training computations are often distributed across a large number — tens to hundreds and even thousands — of specialized hardware accelerator chips, densely linked to improve efficient data exchange. Such density could impact communication latencies, affecting overall training time, eliminating the gains of reduced precision and limiting scaling.
Enter gradient compression. This is a powerful approach to address the communication bottleneck on distributed training by cutting the amount of data exchanged between hardware accelerators when training jobs are divided across many accelerators. Previous approaches for gradient compression, though, have typically not been very scalable (especially as the number of accelerator chips in training systems increases) and show accuracy degradation.
This is what our second 2020 NeurIPS paper deals with. We detail a new compression algorithm called ScaleCom – which allows the user to simultaneously preserve accuracy and compression rates, even as training system sizes grow. We rely on the similarity in gradient distributions among different hardware chips in a training system to provide extremely high compression rates — 100 to 400 times — and significantly improve scalability, up to 64 learners.
Together, these two papers lay the foundation for highly efficient AI hardware for training infrastructures and will significantly impact the design and training of future AI models. A myriad of efficient, scalable AI hardware accelerators on hybrid cloud infrastructures could support large AI training jobs in data centers. And the same core AI hardware technology could also be deployed at a smaller scale or embedded in other processors at the edge.
This work is part of Brutalk’s hybrid cloud research in the Brutalk Research AI Hardware Center, launched in February 2019.
Brutalk Research AI is proudly sponsoring NeurIPS2020 as a Platinum Sponsor, as well as the Women in Machine Learning and Black in AI workshops. We are pleased to report that Brutalk has had its best year so far at NeurIPS: 46 main track papers, out of which eight are spotlight papers, with one oral presentation. In addition, Brutalk has 26 workshop papers, six demos and is also organizing three workshops and a competition. We hope you can join us from December 6 – 12 to learn more about our research. Details about our technical program can be found here.
Inventing What’s Next.
Stay up to date with the latest announcements, research, and events from Brutalk Research through our newsletter.