Last Tuesday 12:52, @Fullmoonfever brought up something by MaximThe following covers the Aiot market and does not mention Brainchip by name but there are two very interesting paragraphs which I have emboldened and partitioned to make easy to locate:
What’s a Neural microcontroller?
MAY 30, 2022 BY JEFF SHEPARD
FacebookTwitterLinkedInEmail
The ability to run neural networks (NNs) on MCUs is growing in importance to support artificial intelligence (AI) and machine learning (ML) in the Internet of Things (IoT) nodes and other embedded edge applications. Unfortunately, running NNs on MCUs is challenging due to the relatively small memory capacities of most MCUs. This FAQ details the memory challenges of running NNs on MCUs and looks at possible system-level solutions. It then presents recently announced MCUs with embedded NN accelerators. It closes by looking at how the Glow machine learning compiler for NNs can help reduce memory requirements.
Running NNs on MCUs (sometimes called tinyML) offers advantages over sending raw data to the cloud for analysis and action. Those advantages include the ability to tolerate poor or even no network connectivity and safeguard data privacy and security. MCU memory capacities are often limited to the main memory of hundreds of KB of SRAM, often less, and byte-addressable Flash of no more than a few MBs for read-only data.
To achieve high accuracy, most NNs require larger memory capacities. The memory needed by a NN includes read-only parameters and so-called feature maps that contain intermediate and final results. It can be tempting to process an NN layer on an MCU in the embedded memory before loading the next layer, but it’s often impractical. A single NN layer’s parameters and feature maps can require up to 100 MB of storage, exceeding the MCU memory size by as much as two orders of magnitude. Recently developed NNs with higher accuracies require even more memory, resulting in a widening gap between the available memory on most MCUs and the memory requirements of NNs (Figure 1).
Figure 1: The available memory on most MCUs is much too small to support the needs of the majority of NNs. (Image: Arxiv)
One solution to address MCU memory limitations is to dynamically swap NN data blocks between the MCU SRAM and a larger external (out-of-core) cash memory. Out-of-core NN implementations can suffer from several limitations, including: execution slowdown, storage wear out, higher energy consumption, and data security. If these concerns can be adequately addressed in a specific application, an MCU can be used to run large NNs with full accuracy and generality.
One approach to out-of-core NN implementation is to split one NN layer into a series of tiles small enough to fit into the MCU memory. This approach has been successfully applied to NN systems on servers where the NN tiles are swapped between the CPU/GPU memory and the server’s memory. Most embedded systems don’t have access to the large memory spaces available on servers. Using memory swapping approaches with MCUs can run into problems using a relatively small external SRAM or an SD card, such as lower SD card durability and reliability, slower execution due to I/O operations, higher energy consumption, and safety and security of out-of-core NN data storage.
Another approach to overcoming MCU memory limitations is optimizing the NN more completely using techniques such as model compression, parameter quantization, and designing tiny NNs from scratch. These approaches involve varying tradeoffs between model accuracy and generality, or both. In most cases, the techniques used to fit an NN into the memory space of an MCU result in the NN becoming too inaccurate (< 60% accuracy) or too specialized and not generalized enough (the NN can only detect a few object classes). These challenges can disqualify the use of MCUs where NNs with high accuracy and generality are needed, even if inference delays can be tolerated, such as:
NN accelerators embedded in MCUs
- NN inference on slowly changing signals such as monitoring crop health by analyzing hourly photos or traffic patterns by analyzing video frames taken every 20-30 minutes
- Profiling NNs on the device by occasionally running a full-blown NN to estimate the accuracy of long-running smaller NNs
- Transfer learning includes retraining NNs on MCUs with data collected from deployment every hour or day
Many of the challenges of implementing NNs on MCU are being addressed by MCUs with embedded NN accelerators. These advanced MCUs are an emerging device category that promises to provide designers with new opportunities to develop IoT node and edge ML solutions. For example, an MCU with a hardware-based embedded convolutional neural network (CNN) accelerator enables battery-powered applications to execute AI inferences while spending only microjoules of energy (Figure 2).
Figure 2: Neural MCU block diagram showing the basic MCU blocks (upper left) and the CNN accelerator section (right). (Image: Maxim)
*******************************************************************************************************************************************************
The MCU with an embedded CNN accelerator is a system on chip combining an Arm Cortex-M4 with a RISC-V core that can execute application and control codes as well as drive the CNN accelerator. The CNN engine has a weight storage memory of 442KB and can support 1-, 2-, 4-, and 8-bit weights (supporting networks of up to 3.5 million weights). On the fly, AI network updates are supported by the SRAM-based CNN weight memory structure. The architecture is flexible and allows CNNs to be trained using conventional toolsets such as PyTorch and TensorFlow.
*********************************************************************************************************************************************************
Another MCU supplier has pre-announced developing a neural processing unit integrated with an ARM Cortex core. The new neural MCU is scheduled to ship later this year and will provide the same level of AI performance as a quad-core processor with an AI accelerator but at one-tenth the cost and one-twelfth the power consumption.
*********************************************************************************************************************************************************
Additional neural MCUs are expected to emerge in the near future.
Glow for smaller NN memories
Glow (graph lowering) is a machine learning compiler for neural network graphs. It’s available on Github and is designed to optimize the neural network graphs and generate code for various hardware devices. Two versions of Glow are available, one for Ahead of Time (AOT) and one for Just in Time (JIT) compilations. As the names suggest, AOT compilation is performed offline (ahead of time) and generates an object file (bundle) which is later linked with the application code, while JIT compilation is performed at runtime just before the model is executed.
MCUs are available that support AOT compilation using Glow. The compiler converts the neural networks into object files, which the user converts into a binary image for increased performance and a smaller memory footprint than a JIT (runtime) inference engine. In this case, Glow is used as a software back-end for the PyTorch machine learning framework and the ONNX model format (Figure 3).
Figure 3: Example of an AOT compilation flow diagram using Glow. (Image: NXP)
The Glow NN complier lowers a NN into a two-phase, strongly-typed intermediate representation. Domain-specific optimizations are performed in the first phase, while the second phase performs optimizations focused on specialized back-end hardware features. NNs on MCUs are available that combine support for Arm Cortex-M cores and Cadence Tensilica HiFi 4 DSP support, accelerating performance by utilizing Arm CMSIS-NN and HiFi NN libraries, respectively. Its features include:
Summary
- Lower latency and smaller solution size for edge inference NNs.
- Accelerate NN applications with CMSIS-NN and Cadence HiFi NN Library
- Speed time to market using the available software development kit
- Flexible implementation since Glow is open source with Apache License 2.0
Running NNs on MCUs is important for IoT nodes and other embedded edge applications, but it can be challenging due to MCU memory limitations. Several approaches have been developed to address memory limitations, including out-of-core designs that swap blocks of NN data between the MCU memory and an external memory and various NN software ‘optimization’ techniques. Unfortunately, these approaches involve tradeoffs between model accuracy and generality, which result in the NN becoming too inaccurate and/or too specialized to be of use in practical applications. The emergence of MCUs with integrated NN accelerators is beginning to address those concerns and enables the development of practical NN implementations for IoT and edge applications. Finally, the availability of the Glow NN compiler gives designers an additional tool for optimizing NN for smaller applications”
Hi D,
Appreciate your thoughts on the following new recent release. Couple things caught my eye particularly the weights but not tech enough to be comfortable this is similar to Akida?
Edit. Not saying Akida involved but curious to any similarities.
https://au.mouser.com/ProductDetail/Maxim-Integrated/MAX78000EXG+?qs=yqaQSyyJnNigS5t/Kz0nhQ==
MAX78000 Artificial Intelligence Microcontroller with Ultra-Low-Power Convolutional Neural Network Accelerator
Artificial intelligence (AI) requires extreme computational horsepower, but Maxim is cutting the power cord from AI insights. The MAX78000 is a new breed of AI microcontroller built to enable neural networks to execute at ultra-low power and live at the edge.
www.maximintegrated.com
In my reply, I noted that Maxim referred to 1-, 2- , 4- , and 8-bit weights (They don't mention how many bits in the activations).
#19,425
While Maxim don't refer to spikes, they do refer to 1-bit weights. They also refer to 8-bit weights in the same breath, so they have gone for additional accuracy as an option cf Akida's optional 4-bit weights/actuations.
Maxim have several NN patents, mainly directed to CNN using the now fashionable in-memory-compute, eg:
US2020110604A1 ENERGY-EFFICIENT MEMORY SYSTEMS AND METHODS
Priority: 20181003.
...
Now I'm not saying it's impossible for the Akida IP to stretch to 8-bits, but we have not been told that it does. Similar to the 4-bit Akida, an 8-bit Akida would have even greater accuracy than the initial 1-bit Akida at the expense of speed and power.
Maxim also dabbled in analog and Frankenstein (analog/digital) NNs.
An analog NN on its own would struggle with accuracy to compete with a multibit digital NN.
Interestingly, Maxim is now part of Analog Devices:
https://www.maximintegrated.com/en.html
Maxim have had an AI chip since 2020. It uses ARM Cortex M4
https://www.maximintegrated.com/en/products/microcontrollers/artificial-intelligence.html
Artificial intelligence (AI) is opening up a whole new universe of possibilities, from virtual assistants to self-driving cars, automated factory equipment, and voice recognition in consumer devices. But the computational horsepower to enable these possibilities is extreme, and requires expensive, power-hungry, and big processors. In embedded devices, this functionality isn't really available–embedded microcontrollers are too slow to effectively process images and make decisions in real time.
Enter Maxim's new line of Artificial Intelligence microcontrollers. They run AI inferences hundreds of times faster and lower energy than other embedded solutions. Our built in neural network hardware accelerator practically eliminates the energy spent on audio and image AI inferences. Now small machines like thermostats, smart watches, and cameras can deliver the promise of AI—embedded devices can see and hear like never before.
Get started today with the MAX78000FTHR for only $25.
https://datasheets.maximintegrated.com/en/ds/MAX78000.pdf
Artificial intelligence (AI) requires extreme computational horsepower, but Maxim is cutting the power cord from AI insights. The MAX78000 is a new breed of AI microcontroller built to enable neural networks to execute at ultra-low power and live at the edge of the IoT. This product combines the most energy-efficient AI processing with Maxim's proven ultra-low power microcontrollers. Our hardware-based convolutional neural network (CNN) accelerator enables battery-powered applications to execute AI inferences while spending only microjoules of energy. The MAX78000 is an advanced system-on-chip featuring an Arm® Cortex®-M4 with FPU CPU for efficient system control with an ultra-low-power deep neural network accelerator. The CNN engine has a weight storage memory of 442KB, and can support 1-, 2-, 4-, and 8-bit weights (supporting networks of up to 3.5 million weights). The CNN weight memory is SRAM-based, so AI network updates can be made on the fly. The CNN engine also has 512KB of data memory. The CNN architecture is highly flexible, allowing networks to be trained in conventional toolsets like PyTorch® and TensorFlow®, then converted for execution on the MAX78000 using tools provided by Maxim. In addition to the memory in the CNN engine, the MAX78000 has large on-chip system memory for the microcontroller core, with 512KB flash and up to 128KB SRAM. Multiple high-speed and low-power communications interfaces are supported, including I2S and a parallel camera interface (PCIF).
Neural Network Accelerator
• Highly Optimized for Deep Convolutional Neural Networks • 442k 8-Bit Weight Capacity with 1,2,4,8-Bit Weights
• Programmable Input Image Size up to 1024 x 1024 pixels
• Programmable Network Depth up to 64 Layers
• Programmable per Layer Network Channel Widths up to 1024 Channels
• 1 and 2 Dimensional Convolution Processing
• Streaming Mode
• Flexibility to Support Other Network Types, Including MLP and Recurrent Neural Networks
This is a one-to-one fit for the first highlighted paragraph.
The MCU with an embedded CNN accelerator is a system on chip combining an Arm Cortex-M4 with a RISC-V core that can execute application and control codes as well as drive the CNN accelerator. The CNN engine has a weight storage memory of 442KB and can support 1-, 2-, 4-, and 8-bit weights (supporting networks of up to 3.5 million weights). On the fly, AI network updates are supported by the SRAM-based CNN weight memory structure. The architecture is flexible and allows CNNs to be trained using conventional toolsets such as PyTorch and TensorFlow.