"We’re proving that on-chip AI, close to the sensor, has a sensational future, for our customers’ products, as well as the planet."
This Qualcomm patent application relates to a large split NN over 2 or more SoCs because the weights are too large for the on SoC memory of a single NN SoC.
US2020250545A1 SPLIT NETWORK ACCELERATION ARCHITECTURE
Priority: 20190206
[0022] As noted, an artificial intelligence accelerator may be used to train a neural network. Training of a neural network generally involves determining one or more weights associated with the neural network. For example, the weights associated with a neural network are determined by hardware acceleration using a deep learning accelerator. Once the weights associated with a neural network are determined, an inference may be performed using the trained neural network, which computes results (e.g., activations) by processing input data based on the weights associated with the trained neural network.
[0023] In practice, however, a deep learning accelerator has a fixed amount of memory (e.g., static random access memory (SRAM) with a capacity of 128 megabytes (MB)). As a result, the capacity of a deep learning accelerator is sometimes not large enough to accommodate and store a single network. For example, some networks have weights of a larger size than the fixed amount of memory available from the deep learning accelerator. One solution to accommodate large networks is to split the weights into a separate storage device (e.g., a dynamic random access memory (DRAM)). These weights are then read from the DRAM during each inference. This implementation, however, uses more power and can result a memory bottleneck.
[0024] Another solution to accommodate large networks is splitting the network into multiple pieces and passing intermediate results from one accelerator to another through a host. Unfortunately, passing intermediate inference request results through the host consumes host bandwidth. For example, using a host interface (e.g., a peripheral component interconnect express (PCIe) interface) to pass intermediate inference request results consumes the host memory bandwidth. In addition, passing intermediate inference request results through the host (e.g., a host processor) consumes central processing unit cycles of the host processor and adds latency to an overall inference calculation.
[0025] One aspect of the present disclosure splits a large neural network into multiple, separate artificial intelligence (AI) inference accelerators (AIIAs). Each of the separate AI inference accelerators may be implemented in a separate system-on-chip (SoC). For example, each AI inference accelerator is allocated and stores a fraction of the weights or other parameters of the neural network. Intermediate inference request results are passed from one AI inference accelerator to another AI inference accelerator independent of a host processor. Thus, the host processor is not involved with the transfer of the intermediate inference request results.
The system passes partial results from one partial NN SoC to another NN SoC.
Now, I don't know how his differs from having 2 or more Akida 1000s connected up.
But, if Qualcomm think they've invented it, that suggests that 2 years ago, they were not planning to use Akida.
Our patent has a priority of 20181101 which pre-dates Qualcomm's priority by 3 months.
Hi Sirod69,I have to ask something about Qualcomm again. What exactly do we know about this "an AI accelerator chip"? Can someone say something about that?
The Snapdragon Ride Flex was first mentioned during Qualcomm’s Automotive investor day in September 2022, but more details are available now. The original Ride platform was based around a two-chip solution with an ADAS SoC and an AI accelerator chip.
![]()
Qualcomm Announces Next-Generation Snapdragon Ride Flex For Automotive Central Compute At CES
This week at CES, Qualcomm is officially launching the next generation of its automotive system-on-a-chip (SoC), the Snapdragon Ride Flex.www.forbes.com
This Qualcomm patent application relates to a large split NN over 2 or more SoCs because the weights are too large for the on SoC memory of a single NN SoC.
US2020250545A1 SPLIT NETWORK ACCELERATION ARCHITECTURE
Priority: 20190206
[0022] As noted, an artificial intelligence accelerator may be used to train a neural network. Training of a neural network generally involves determining one or more weights associated with the neural network. For example, the weights associated with a neural network are determined by hardware acceleration using a deep learning accelerator. Once the weights associated with a neural network are determined, an inference may be performed using the trained neural network, which computes results (e.g., activations) by processing input data based on the weights associated with the trained neural network.
[0023] In practice, however, a deep learning accelerator has a fixed amount of memory (e.g., static random access memory (SRAM) with a capacity of 128 megabytes (MB)). As a result, the capacity of a deep learning accelerator is sometimes not large enough to accommodate and store a single network. For example, some networks have weights of a larger size than the fixed amount of memory available from the deep learning accelerator. One solution to accommodate large networks is to split the weights into a separate storage device (e.g., a dynamic random access memory (DRAM)). These weights are then read from the DRAM during each inference. This implementation, however, uses more power and can result a memory bottleneck.
[0024] Another solution to accommodate large networks is splitting the network into multiple pieces and passing intermediate results from one accelerator to another through a host. Unfortunately, passing intermediate inference request results through the host consumes host bandwidth. For example, using a host interface (e.g., a peripheral component interconnect express (PCIe) interface) to pass intermediate inference request results consumes the host memory bandwidth. In addition, passing intermediate inference request results through the host (e.g., a host processor) consumes central processing unit cycles of the host processor and adds latency to an overall inference calculation.
[0025] One aspect of the present disclosure splits a large neural network into multiple, separate artificial intelligence (AI) inference accelerators (AIIAs). Each of the separate AI inference accelerators may be implemented in a separate system-on-chip (SoC). For example, each AI inference accelerator is allocated and stores a fraction of the weights or other parameters of the neural network. Intermediate inference request results are passed from one AI inference accelerator to another AI inference accelerator independent of a host processor. Thus, the host processor is not involved with the transfer of the intermediate inference request results.
The system passes partial results from one partial NN SoC to another NN SoC.
Now, I don't know how his differs from having 2 or more Akida 1000s connected up.
But, if Qualcomm think they've invented it, that suggests that 2 years ago, they were not planning to use Akida.
Our patent has a priority of 20181101 which pre-dates Qualcomm's priority by 3 months.