Nvidia don't get digital SNNs.
Accelerator(s) 1314 in this specification would be a candidate for replacement by Akida:
US2020364508A1 USING DECAY PARAMETERS FOR INFERENCING WITH NEURAL NETWORKS
[0146] In at least one embodiment, GPU(s) 1308 may include an integrated GPU (alternatively referred to herein as an “iGPU”). In at least one embodiment, GPU(s) 1308 may be programmable and may be efficient for parallel workloads. In at least one embodiment, GPU(s) 1308 , in at least one embodiment, may use an enhanced tensor instruction set. In on embodiment, GPU(s) 1308 may include one or more streaming microprocessors, where each streaming microprocessor may include a level one (“L1”) cache (e.g., an L1 cache with at least 96 KB storage capacity), and two or more of streaming microprocessors may share an L2 cache (e.g., an L2 cache with a 512 KB storage capacity). In at least one embodiment, GPU(s) 1308 may include at least eight streaming microprocessors. In at least one embodiment, GPU(s) 1308 may use compute application programming interface(s) (API(s)). In at least one embodiment, GPU(s) 1308 may use one or more parallel computing platforms and/or programming models (e.g., NVIDIA's CUDA).
[0147] In at least one embodiment, one or more of GPU(s) 1308 may be power-optimized for best performance in automotive and embedded use cases. For example, in on embodiment, GPU(s) 1308 could be fabricated on a Fin field-effect transistor (“FinFET”). In at least one embodiment, each streaming microprocessor may incorporate a number of mixed-precision processing cores partitioned into multiple blocks. For example, and without limitation, 64 PF32 cores and 32 PF64 cores could be partitioned into four processing blocks. In at least one embodiment, each processing block could be allocated 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, two mixed-precision NVIDIA TENSOR COREs for
deep learning matrix arithmetic, a level zero (“L0”) instruction cache, a warp scheduler, a dispatch unit, and/or a 64 KB register file. In at least one embodiment, streaming microprocessors may include independent
parallel integer and floating-point data paths to provide for efficient execution of workloads with
a mix of computation and addressing calculations. In at least one embodiment, streaming microprocessors may include independent thread scheduling capability to enable finer-grain synchronization and cooperation between parallel threads. In at least one embodiment, streaming microprocessors may include a combined L1 data cache and shared memory unit in order to improve performance while simplifying programming.
…
0152] In at least one embodiment, one or more of SoC(s)
1304 may include one or more
accelerator(s) 1314 (e.g., hardware accelerators, software accelerators, or a combination thereof). In at least one embodiment, SoC(s)
1304 may include a hardware acceleration cluster that may include optimized hardware accelerators and/or large on-chip memory. In at least one embodiment, large on-chip memory (e.g., 4 MB of SRAM), may enable hardware acceleration cluster to accelerate neural networks and other calculations. In at least one embodiment,
hardware acceleration cluster may be used to complement GPU(s) 1308 and to off-load some of tasks of GPU(s)
1308 (e.g., to free up more cycles of GPU(s)
1308 for performing other tasks). In at least one embodiment,
accelerator(s) 1314 could be used for targeted workloads (e.g., perception, convolutional neural networks (“CNNs”), recurrent neural networks (“RNNs”), etc.) that are stable enough to be amenable to acceleration. In at least one embodiment, a CNN may include a region-based or regional convolutional neural networks (“RCNNs”) and Fast RCNNs (e.g., as used for object detection) or other type of CNN.
[0154] In at least one embodiment, DLA(s) may perform any function of GPU(s) 1308 , and by using an
inference accelerator, for example, a designer may target either DLA(s) or GPU(s) 1308 for any function. For example, in at least one embodiment, designer may focus
processing of CNNs and floating point operations on DLA(s) and leave other functions to GPU(s) 1308 and/or other accelerator(s) 1314 .
[0155] In at least one embodiment,
accelerator(s) 1314 (e.g., hardware acceleration cluster) may include
a programmable vision accelerator(s) (“PVA”), which may alternatively be referred to herein as a computer vision accelerator. In at least one embodiment, PVA(s) may be designed and configured to
accelerate computer vision algorithms for advanced driver assistance system (“ADAS”) 1338 , autonomous driving, augmented reality (“AR”) applications, and/or virtual reality (“VR”) applications. PVA(s) may provide a balance between performance and flexibility. For example, in at least one embodiment, each PVA(s) may include, for example and without limitation, any number of
reduced instruction set computer (“RISC”) cores, direct memory access (“DMA”), and/or any number of vector processors.