When you find yourself in a hole ... keep digging:Some more on Qualcomm's Hexagon AI which is pretty promiscuous, sharing the AI workload selectively between GPU, CPU, and NPU:
6 Heterogeneous computing:
Leveraging all the processors for generative AI Generative AI models suitable for on-device execution are becoming more complex and trending toward larger sizes, from one billion to 10 billion to 70 billion parameters. They are increasingly multi-modal, meaning that they can take in multiple inputs — such as text, speech, or images — and produce several outputs.
Further, many use cases concurrently run multiple models. For example, a personal assistant application uses voice for input and output. This requires running an automatic speech recognition (ASR) model for voice to text, an LLM for text to text, and a text-to-speech (TTS) model for a voice output. The complexity, concurrency, and diversity of generative AI workloads require harnessing the capabilities of all the processors in an SoC. An optimal solution entails:
1. Scaling generative AI processing across cores of a processor and across processors
2. Mapping generative AI models and use cases to one or more cores and processors
Choosing the right processor depends on many factors, including use case, device type, device tier, development time, key performance indicators (KPIs), and developer expertise. Many tradeoffs drive decisions, and the target KPI could be power, performance, latency, or accessibility for different use cases. For example, an original equipment manufacturer (OEM) making an app for multiple devices across categories and tiers will need to choose the best processor to run an AI model based on SoC specs, end-product capabilities, ease of development, cost, and graceful degradation of the app across device tiers.
As previously mentioned, most generative AI use cases can be categorized into on-demand, sustained, or pervasive. For on-demand applications, latency is the KPI since users do not want to wait. When these applications use small models, the CPU is usually the right choice. When models get bigger (e.g., billions of parameters), the GPU and NPU tend to be more appropriate. For sustained and pervasive use cases, in which battery life is vital and power efficiency is the critical factor, the NPU is the best option.
Another key distinction is identifying whether the AI model is memory bound — performance is limited by memory bandwidth — or compute bound — performance is limited by the speed of the processor. Today’s LLMs are memory bound for the text generation, so focusing on memory efficiency on the CPU, GPU, or NPU is appropriate. For LVMs, which could be compute or memory bound, the GPU or NPU could be used, but the NPU provides the best performance per watt.
A personal assistant that offers a natural voice user interface (UI) to improve productivity and enhance user experiences is expected to be a popular generative AI application. The speech recognition, LLM, and speech models must all run with some concurrency, so it is desirable to split the models between the NPU, GPU, CPU, and the sensor processor. For PCs, agents are expected to run pervasively (always-on), so as much of it as possible should run on the NPU for performance and power efficiency.
As we know, Akida can multi-task, running different models on the one SoC.
Running AI on CPU or GPU necessarily entails the use of software.
So Qualcomm's Hexagon NPU evolved from a DSP:
Building our NPU from a DSP architecture was the right choice for improved programmability and the ability to tightly control scalar, vector, and tensor operations that are inherent to AI processing. Our design approach of optimized scalar, vector, and tensor acceleration combined with large local shared memory, dedicated power delivery systems, and other hardware acceleration differentiates our solution. Our NPU mimics the neural network layers and operations of the most popular models, such as convolutions, fully-connected layers, transformers, and popular activation functions, to deliver sustained high performance at low power.
So naturally there's a side tunnel to DSPs:

Digital signal processor - Wikipedia
Such performance improvements have led to the introduction of digital signal processing in commercial communications satellites where hundreds or even thousands of analog filters, switches, frequency converters and so on are required to receive and process the uplinked signals and ready them for downlinking, and can be replaced with specialised DSPs with significant benefits to the satellites' weight, power consumption, complexity/cost of construction, reliability and flexibility of operation. For example, the SES-12 and SES-14 satellites from operator SES launched in 2018, were both built by Airbus Defence and Space with 25% of capacity using DSP.[6]
I wonder if Airbus intends to swap its DSPs for SNNs?