TinyML computer vision is turning into reality with microNPUs (µNPUs)
June 14, 2023
Elad Baram
Ubiquitous ML-based vision processing at the edge is advancing as hardware costs decrease, computation capability increases significantly, and new methodologies make it easier to train and deploy models. This leads to fewer barriers to adoption and increased use of computer vision AI at the edge.
Advertisement
Computer vision (CV) technology today is at an inflection point, with major trends converging to enable what has been a cloud technology to become ubiquitous in tiny edge AI devices. Technology advancements are enabling this cloud-centric AI technology to extend to the edge, and new developments will make AI vision at the edge pervasive.
There are three major technological trends enabling this evolution. New, lean neural network algorithms fit the memory space and compute power of tiny devices. New silicon architectures are offering orders of magnitude more efficiency for neural network processing than conventional microcontrollers (MCUs). And AI frameworks for smaller microprocessors are maturing, reducing barriers to developing tiny machine learning (ML) implementations at the edge (tinyML).
As all these elements come together, tiny processors at milliwatt scale can have powerful neural processing units that execute extremely efficient convolutional neural networks (CNNs)—the ML architecture most common for vision processing—leveraging a mature and easy-to-use development tool chain. This will enable exciting new use cases across just about every aspect of our lives.
The promise of CV at the edge
Digital image processing—as it used to be called—is used for applications ranging from semiconductor manufacturing and inspection to advanced driver assistance systems (ADAS) features such as lane-departure warning and blind-spot detection, to image beautification and manipulation on mobile devices. And looking ahead, CV technology at the edge is enabling the next level of human machine interfaces (HMIs).
HMIs have evolved significantly in the last decade. On top of traditional interfaces like the keyboard and mouse, we have now touch displays, fingerprint readers, facial recognition systems, and voice command capabilities. While clearly improving the user experience, these methods have one other attribute in common—they all react to user actions. The next level of HMI will be devices that understand users and their environment via contextual awareness.
Context-aware devices sense not only their users, but also the environment in which they are operating, all in order to make better decisions toward more useful automated interactions. For example, a laptop visually senses when a user is attentive and can adapt its behavior and power policy accordingly. This is already being enabled by Synaptics’ Emza Visual Sense technology, which OEMS can use to optimize power by adaptively dimming the display when a user is not watching it, reducing display energy consumption (figure 1). By tracking on-lookers’ eyeballs (on-looker detect) the technology can also enhance security by alerting the user and hiding the screen content until the coast is clear.
There are also endless use cases for visual sensing in industrial fields, ranging from object detection for safety regulation (i.e., restricted zones, safe passages, protective gear enforcement) up to anomaly detection for manufacturing process control. In agritech, crop inspections, and status and quality monitoring enabled by CV technologies, are all critical.
Whether it’s in laptops, consumer electronics, smart building sensors or industrial environments, this ambient computing capability is enabled when tiny and affordable microprocessors, tiny neural networks, and optimized AI frameworks make devices more intelligent and power efficient.
Neural-network vision processing evolves
2012 marked the turning point when CV started to shift from heuristic CV methods to deep convolutional neural networks (DCNN), with the publication of
AlexNet by Alex Krizhevsky and his colleagues. There was no turning back after the DCNN won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) that year.
Since then, teams across the globe have continued to seek higher detection performance, but without much concern about the efficiency of the underlying hardware. So CNNs continued to be data- and compute-hungry. This focus on performance was fine for applications running in the cloud infrastructure.
In 2015, ResNet152 was introduced. It had 60 million parameters, required more than 11 gigaflops for single-inference operation and demonstrated 94% top-5 accuracy for the ImageNet data set. This continued to push the performance and accuracy of CNNs. But it wasn’t until 2017, with the publication of MobileNets by a group of researchers from Google, that we saw a push toward efficiency.
MobileNets—aimed at smartphones—was significantly lighter than existing neural network (NN) architectures at that time. MobileNetV2, as an example, had 3.5 million parameters and required 336 Mflops. This drastic reduction was achieved initially through hard labor—manually identifying layers in the deep-learning network that did not add much to accuracy. Later, automated architecture-search tools allowed further improvement in the number and organization of layers. Roughly 20x “lighter” than ResNet192, both in memory and computational load, MobileNetV2 demonstrated top-5 accuracy of 90%. A new set of mobile-friendly applications could now use AI.
And hardware evolves
With smaller NNs and with a clear understanding of the workloads involved, developers could now design optimized silicon for tiny AI. This led to the micro neural processing unit (micro NPU). By tightly managing memory organization and data flow, while exploiting massive parallelism, these small, dedicated cores can execute NN inference 10x or 100x faster than the unaided CPU in a typical MCU. An example is the
Arm Ethos U55 micro NPU.
Let’s look at a specific example of the impact of microNPUs (µNPUs). One of the fundamental tasks in CV is object detection. Object detection in essence requires two tasks: localization, which determines where an object is located within the image, and classification, which identifies the detected object (figure 2).
Emza has implemented a face detection model on an Ethos U55 µNPU, training an object detection and classification model that is a lightweight version of the single shot detector, optimized by Synaptics for detecting just the class of faces. The results astonished us with model execution times of less than 5 milliseconds: this is comparable to the execution speed on a powerful smartphone application processor, like the Snapdragon 845. When executing this same model on the Raspberry Pi 3B using four Cortex A53 cores, the execution time is six times longer.
AI frameworks & democratization
Widespread adoption of any technology as complex as ML requires good development tools. TensorFlow Lite for Microcontrollers (
TFLM) by Google is a framework designed for easier training and deployment of AI for tinyML. For a subset of the operators covered by the full TensorFlow, TFLM emits microprocessor C code for an interpreter and a model to run on a µNPU. The
PyTorch Mobileframework and Glow compiler from Meta are also targeting this area. In addition, there are today quite a few AI automation platforms (known as AutoML) that can automate some aspects of AI deployment for tiny targets. Examples are
Edge Impulse,
Deeplite,
Qeexo, and
SensiML.
But to enable execution on specific hardware and µNPUs, compilers and tool chains must be modified. Arm has developed the
Vela compiler that optimizes CNN model execution for the U55 µNPU. The Vela complier removes the complexities of a system that contains both a CPU and a µNPU by automatically splitting the model execution task between them.
More broadly, the
Apache TVM is an open-source, end-to-end ML compiler framework for CPUs, GPUs, NPUs and accelerators. And TVM micro is targeting microcontrollers with the vision of running any AI model on any hardware. This evolution of AI frameworks, AutoML platforms, and compilers makes it easier for developers to leverage the new µNPUs for their specific needs.
Ubiquitous AI at the edge
The trend toward ubiquitous ML-based vision processing at the edge is clear. Hardware costs are decreasing, computation capability is increasing significantly, and new methodologies make it easier to train and deploy models. All of this is leading to fewer barriers to adoption, and to increased use of CV AI at the edge.
But even as we see increasingly ubiquitous tiny edge AI, there is still work to do. To make ambient computing a reality, we need to serve the long tail of use cases in many segments that can create a scalability challenge. In consumer products, factories, agriculture, retail and other segments, each new task requires different algorithms and unique data sets for training. The R&D investments and skillset needed to solve for each use case continue to be a major barrier today.
This gap can best be filled by AI companies up-levelling the software around their NPU offerings by developing rich sets of model examples—”model zoos”—and applications reference code. In doing so, they can enable a wider range of applications for the long tail while ensuring design success by having the right algorithms optimized to the target hardware to solve specific business needs, within the defined cost, size, and power constraints.