Oooopsies! SpiNNaker overheated after its cooling system failed.
I asked ChatGPT if BrainChip's technology could help address such overheating challenges and the response was very interesting.
View attachment 83942
Brain-inspired neuromorphic computer SpiNNaker overheated when coolers lost their chill
Too much hot air brings down Manchester Uni based neural network project
Tue 6 May 2025 // 11:29 UTC
Exclusive The brain-inspired SpiNNaker machine at Manchester University in England suffered an overheating incident over the Easter weekend that will send a chill down the spines of datacenter administrators.
According to Professor Steve Furber, now retired (although he told
El Reg "SpiNNaker is still seen as my baby!"), a failure with the cooling on April 20 led to a rise in temperatures until the servers were manually shut down the following day.
The
SpiNNaker (Spiking Neural Network Architecture) project is all about simulating a brain by connecting hundreds of thousands of Arm cores. While a human brain presents a huge challenge, Furber, one of the designers of the original Arm processor, reckoned a mouse brain was possible.
During a talk earlier this month to celebrate the
40th anniversary of the switch-on of the first Arm processor, Furber told the audience that the hope was to model "one whole mouse" at the required level of detail.
Assuming the hardware survived its baking.
"SpiNNaker," he told
The Register, "is hosted in the Kilburn Building, which was completed in 1972 as a purpose-built computer building and, as such, has a plant room that supplies chilled water as a utility to all the central machine rooms.
"The SpiNNaker room was built to house the machine in 2016 in what used to be the mechanical workshop, and is cooled by circulating hot air from the back of the cabinets through a plenum chamber into chillers at either end that blow the air through a cooling system using the building's chilled water."
The problem was with the chilled water supply. Furber said, "If the chilled water isn't actually chilled, the chiller fans are adding to the problem rather than helping solve it."
And so the temperatures began to rise inexorably. Without an automatic shutdown, the servers struggled on. Furber told us that he believed there was an automatic over-temperature shutdown on the individual SpiNNaker boards, and said, "This may have protected the SpiNNaker hardware from damage," but even with the hard-to-replace boards off, the network switches and power supplies remained powered up.
The latter two component types suffered some damage, and without them, the SpiNNaker boards cannot all be tested, "so there may be more issues hidden behind the ones we know about."
Furber added, "We have had a few issues with the cooling system in the nine years that the machine has been operational, but these have not previously led to any damage." He reckoned that the long Easter weekend (in the UK, where Easter Friday and Easter Monday are both public holidays) might have contributed to the length of time it took to contain the temperature rise.
"We are looking into ways to fully automate the shutdown process in the future!"
As for the system's current state, Furber told us, "The machine is back up for internal users at around 80 percent of full capacity but still undergoing tests."
The good news is that the software is designed to work around partial hardware failures. The bad news is that replacing the failed parts will likely require further shutdowns. ®
Exclusive: Too much hot air brings down Manchester Uni based neural network project
www.theregister.com
In this exclusive it describes how the brain-inspired SpiNNaker machine at Manchester University in England suffered an overheating incident over the Easter weekend that will send a chill down the spines of datacenter administrators. What are the implications of this and could BrainChip's neuromorphic technology be utlitized to help reduce heating or else alert to over-heating before damage occurs?
ChatGPT said:
The overheating incident involving the SpiNNaker supercomputer at the University of Manchester underscores the critical importance of thermal management in large-scale neuromorphic systems. SpiNNaker, designed to emulate the parallel processing of the human brain using over a million ARM cores, operates with significant power demands. Any failure in its cooling infrastructure can lead to substantial operational disruptions and potential hardware damage.
Research Explorer+1scieng.manchester.ac.uk+1
Implications of the Overheating Incident
This event highlights several key considerations for neuromorphic computing systems:
- Thermal Management: Even systems designed for energy efficiency, like SpiNNaker, can experience overheating if cooling systems fail or are inadequate.
- System Reliability: Overheating can compromise the reliability and longevity of hardware components, leading to increased maintenance costs and downtime.
- Monitoring and Alerts: The necessity for real-time monitoring systems that can detect and respond to thermal anomalies before they escalate.
Potential Role of BrainChip's Neuromorphic Technology
BrainChip's Akida™ neuromorphic processor offers features that could be instrumental in addressing such challenges:
Wevolver
- Low Power Consumption: Akida operates with ultra-low power requirements, reducing the thermal output and the burden on cooling systems.
- On-Chip Learning: Its ability to learn and adapt in real-time allows for the development of intelligent thermal management systems that can predict and mitigate overheating risks.
- Event-Based Processing: Akida's architecture processes data only when necessary, further conserving energy and minimizing heat generation.
Integrating Akida into data center infrastructure could enhance thermal monitoring and management, providing early warnings and adaptive responses to potential overheating scenarios.
Conclusion
The SpiNNaker incident serves as a cautionary tale about the vulnerabilities of high-performance computing systems to thermal issues. Leveraging advanced neuromorphic technologies like BrainChip's Akida could offer viable solutions to enhance system resilience through improved energy efficiency and intelligent thermal management.