JDelekto
Regular
As I understand it, the NPU (which is easily confused for neuromorphic because of the 'N') is another dedicated processor that parallelizes computational operations with the CPU and GPU at lower power requirements.Some more on Qualcomm's Hexagon AI which is pretty promiscupos, sharing the AI workload selectively between GPU, CPU, and NPU:
6 Heterogeneous computing:
Leveraging all the processors for generative AI Generative AI models suitable for on-device execution are becoming more complex and trending toward larger sizes, from one billion to 10 billion to 70 billion parameters. They are increasingly multi-modal, meaning that they can take in multiple inputs — such as text, speech, or images — and produce several outputs.
Further, many use cases concurrently run multiple models. For example, a personal assistant application uses voice for input and output. This requires running an automatic speech recognition (ASR) model for voice to text, an LLM for text to text, and a text-to-speech (TTS) model for a voice output. The complexity, concurrency, and diversity of generative AI workloads require harnessing the capabilities of all the processors in an SoC. An optimal solution entails:
1. Scaling generative AI processing across cores of a processor and across processors
2. Mapping generative AI models and use cases to one or more cores and processors
Choosing the right processor depends on many factors, including use case, device type, device tier, development time, key performance indicators (KPIs), and developer expertise. Many tradeoffs drive decisions, and the target KPI could be power, performance, latency, or accessibility for different use cases. For example, an original equipment manufacturer (OEM) making an app for multiple devices across categories and tiers will need to choose the best processor to run an AI model based on SoC specs, end-product capabilities, ease of development, cost, and graceful degradation of the app across device tiers.
As previously mentioned, most generative AI use cases can be categorized into on-demand, sustained, or pervasive. For on-demand applications, latency is the KPI since users do not want to wait. When these applications use small models, the CPU is usually the right choice. When models get bigger (e.g., billions of parameters), the GPU and NPU tend to be more appropriate. For sustained and pervasive use cases, in which battery life is vital and power efficiency is the critical factor, the NPU is the best option.
Another key distinction is identifying whether the AI model is memory bound — performance is limited by memory bandwidth — or compute bound — performance is limited by the speed of the processor. Today’s LLMs are memory bound for the text generation, so focusing on memory efficiency on the CPU, GPU, or NPU is appropriate. For LVMs, which could be compute or memory bound, the GPU or NPU could be used, but the NPU provides the best performance per watt.
A personal assistant that offers a natural voice user interface (UI) to improve productivity and enhance user experiences is expected to be a popular generative AI application. The speech recognition, LLM, and speech models must all run with some concurrency, so it is desirable to split the models between the NPU, GPU, CPU, and the sensor processor. For PCs, agents are expected to run pervasively (always-on), so as much of it as possible should run on the NPU for performance and power efficiency.
As we know, Akida can multi-task, running different models on the one SoC.
Qualcomm's Hexagon NPU is designed to offload those computations from the other two and is optimized for vector, matrix, and tensor processing (basically a lot of matrix math). I ran across an interesting thread on Hacker News, where someone benchmarked the NPU and found that it was not quite as good as the CPU itself. Again, the NPU intends to achieve performance through parallelization at lower power. They have the code for their benchmarks here on GitHub.
I believe Akida could still be a strong competitor to Qualcomm's AI offerings, or even potentially a replacement for Hexagon for better real-time processing on power-constrained devices.