BRN Discussion Ongoing

Diogenese

Top 20
July 16, 2024
Emily Cerf, UC Santa Cruz
A lit lightbulb laying on its side on a desktop next to an open laptop. Sparkles shimmer around the lightbulb.

Credit: iStock/Kriangsak Koopattanakij
By eliminating the most computationally expensive element of a large language model, engineers drastically improve energy efficiency while maintaining performance.
Large language models such as ChaptGPT have proven to be able to produce remarkably intelligent results, but the energy and monetary costs associated with running these massive algorithms is sky high. It costs $700,000 per day in energy costs to run ChatGPT 3.5, according to recent estimates, and leaves behind a massive carbon footprint in the process.
In a new preprint paper, researchers from UC Santa Cruz show that it is possible to eliminate the most computationally expensive element of running large language models, called matrix multiplication, while maintaining performance. In getting rid of matrix multiplication and running their algorithm on custom hardware, the researchers found that they could power a billion-parameter-scale language model on just 13 watts, about equal to the energy of powering a lightbulb and more than 50 times more efficient than typical hardware.
Even with a slimmed-down algorithm and much less energy consumption, the new, open source model achieves the same performance as state-of-the-art models like Meta’s Llama.
“We got the same performance at way less cost — all we had to do was fundamentally change how neural networks work,” said Jason Eshraghian, an assistant professor of electrical and computer engineering at the Baskin School of Engineering and the paper’s lead author. “Then we took it a step further and built custom hardware.”

Understanding the cost​

Until now, all modern neural networks, the algorithms that power large language models, have used a technique called matrix multiplication. In large language models, words are represented as numbers that are then organized into matrices. Matrices are multiplied by each other to produce language, performing operations that weigh the importance of particular words or highlight relationships between words in a sentence or sentences in a paragraph. Larger scale language models have trillions of these numbers.
“Neural networks, in a way, are glorified matrix multiplication machines,” Eshraghian said. “The larger your matrix, the more things your neural network can learn.”
For the algorithms to be able to multiply matrices together, the matrices need to be stored somewhere, and then fetched when it comes time to compute. This is solved by storing the matrices on hundreds of physically-separated graphics processing units (GPUs), which are specialized circuits designed to quickly carry out computations on very large datasets, designed by the likes of hardware giant Nvidia. To multiply numbers from matrices on different GPUs, data must be moved around, a process which creates most of the neural network’s costs in terms of time and energy.
Eliminating matrix multiplication
The researchers came up with a strategy to avoid using matrix multiplication using two main techniques. The first is a method to force all the numbers within the matrices to be ternary, meaning they can take one of three values: negative one, zero, or positive one. This allows the computation to be reduced to summing numbers rather than multiplying.
From a computer science perspective the two algorithms can be coded the exact same way, but the way Eshraghian’s team’s method works eliminates a ton of cost on the hardware side.
“From a circuit designer standpoint, you don't need the overhead of multiplication, which carries a whole heap of cost,” Eshraghian said.
This strategy was inspired by a paper produced by Microsoft that showed it was possible to use ternary numbers in neural networks, but did not go as far as to get rid of matrix multiplication, or open-sourcing their model to the public. To do this, the researchers adjusted the strategy of how the matrices communicate with each other.
Instead of multiplying every single number in one matrix with every single number in the other matrix, as is typical, the researchers devised a strategy to produce the same mathematical results. In this approach, the matrices are overlaid and only the most important operations are performed.
“It’s quite light compared to matrix multiplication,” said Rui-Jie Zhu, the paper’s first author and a graduate student in Eshraghian’s group. “We replaced the expensive operation with cheaper operations.”
Although they reduced the number of operations, the researchers were able to maintain the performance of the neural network by introducing time-based computation in the training of the model. This enables the network to have a “memory” of the important information it processes, enhancing performance. This technique paid off — the researchers compared their model to Meta’s state-of-the-art algorithm called Llama, and were able to achieve the same performance, even at a scale of billions of model parameters.

Custom chips​

The researchers designed their neural network to operate on GPUs, as they have become ubiquitous in the AI industry, allowing the team’s software to be readily accessible and useful to anyone who might want to use it.
On standard GPUs, the researchers saw that their neural network achieved about 10 times less memory consumption and operated about 25 percent faster than other models. Reducing the amount of memory needed to run a powerful large language model could provide a path forward to enabling the algorithms to run at full capacity on devices with smaller memory like smartphones.
Nvidia, the dominant producer of GPUs worldwide, designs their hardware to be highly optimized to perform matrix multiplication, which has enabled them to dominate the industry and launched them to be one of the most profitable companies in the world. However, this hardware is not fully optimized for ternary operations.
To push the energy savings even further, the team collaborated with Assistant Professor Dustin Richmond and Lecturer Ethan Sifferman in the Baskin Engineering Computer Science and Engineering department to create custom hardware. Over three weeks, the team created a prototype of their hardware on a highly-customizable circuit called a field-programmable gate array (FPGA). This hardware enables them to take full advantage of all the energy-saving features they programmed into the neural network.
With this custom hardware, the model surpasses human-readable throughput, meaning it produces words faster than the rate a human reads, on just 13 watts of power. Using GPUs would require about 700 watts of power, meaning that the custom hardware achieved more than 50 times the efficiency of GPUs.
With further development, the researchers believe they can further optimize the technology for even more energy efficiency.
“These numbers are already really solid, but it is very easy to make them much better,” Eshraghian said. “If we’re able to do this within 13 watts, just imagine what we could do with a whole data center worth of compute power. We’ve got all these resources, but let’s use them effectively.”


Close, but
1721538034411.png
 
  • Haha
  • Sad
Reactions: 2 users

Bravo

If ARM was an arm, BRN would be its biceps💪!

Maybe it deserves a half a cigar 🚬because at the 17.20 minute mark of the podcast with Sean Hehir, Dr Jason Eshraighan states:

"One thing that will probably come out at the same time as this podcast getting released is a Matrix Multiply Free Language Model so, yeah, I'm very excited to see how these things can intercept with what Brainchip has got going on."

21aed033941e9e5cb6ca0e5b52642f95.gif
 
  • Like
  • Love
Reactions: 5 users

Diogenese

Top 20
Maybe it deserves a half a cigar 🚬because at the 17.20 minute mark of the podcast with Sean Hehir, Dr Jason Eshraighan states:

"One thing that will probably come out at the same time as this podcast getting released is a Matrix Multiply Free Language Model so, yeah, I'm very excited to see how these things can intercept with what Brainchip has got going on."

View attachment 66842
As long as you smoke outside with one of Roger Miller's old Stogies.

The Eshraghian paper uses GPU, but does some mathematical fiddling with sparsity,.

https://arxiv.org/pdf/2406.02528
ABSTRACT
...
We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model’s memory consumption can be reduced by more than 10× compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of.

As you say, he has no doubt had his eyed opened by learning about Akida. In the paper, they talk about "binarized activations", so the equivalent of the first engineering samples before Akida 1 moved to 4 bits.

Section 2 Related Works
...
MatMul-free Transformers: The use of MatMul-free Transformers has been largely concentrated in the domain of SNNs. Spikformer led the first integration of the Transformer architecture with SNNs [18, 19], with later work developing alternative Spike-driven Transformers [20, 21]. These techniques demonstrated success in vision tasks. In the language understanding domain, SpikingBERT [22] and SpikeBERT [23] applied SNNs to BERT utilizing knowledge distillation techniques to perform sentiment analysis. In language generation, SpikeGPT trained a 216M-parameter generative model using a spiking RWKV architecture. However, these models remain constrained in size, with 2 SpikeGPT being the largest, reflecting the challenges of scaling with binarized activations. In addition to SNNs, BNNs have also made significant progress in this area. BinaryViT [24] and BiViT [25] successfully applied Binary Vision Transformers to visual tasks. Beyond these approaches, Kosson et al. [26] achieve multiplication-free training by replacing multiplications, divisions, and non-linearities with piecewise affine approximations while maintaining performance.

I wonder whether they can adapt their system to installed cloud-based GPU LLM processors as their system does provide significant power savings. However, they did build a special purpose FPGA.
 
  • Like
Reactions: 1 users
Top Bottom