Fullmoonfever
Top 20
Funny bring up Apple.Following on from this, is it safe to assume that Apple are using us in their macs and perhaps tablets?
An article I was reading makes me believe the possibility.
ARM vs Intel Processors: What’s the Difference?
Originally published Oct. 16, 2021, by Darien Graham-SmithUpdated Jan. 16, 2022, by Steve Larner When choosing a smartphone or tablet, you'll notice thatwww.alphr.com
The last paragraph from that article:
View attachment 23089
From the article that @Pmel just published:
View attachment 23090
Is my dot joining close to the money? Please let it be so.
Found this yesterday but hadn't posted yet.
Erik Engheim
Follow
Jan 15, 2021
·
9 min read
The Secret Apple M1 Coprocessor
Developer Dougall Johnson has through reverse engineering, uncovered a secret powerful coprocessor dubbed AMX: Apple Matrix coprocessor inside the M1 chip.
Stories about the Apple Matrix coprocessor (AMX) are already out there. But not exactly discussed in a beginner friendly manner. And that is what I try to do here. Bring you the story buried under thick layers of technical jargon without treating you like an idiot.
To tell this story we need to clarify the basics such as what is a coprocessor? What is a matrix? And why should you even care about any of this?
More importantly why do none of the Apple slides talk about this coprocessor? Why is it seemingly a secret? If you have read about the Neural Engine inside the M1 System-on-a-Chip (SoC) you may be confused about what makes Apple’s Matrix coprocessor (AMX) is different.
Before we get to the big question, let me start with the basic concepts such as what a matrix and a coprocessor is.
What is a Matrix Anyway?
A matrix is basically just a table of numbers. If you have worked with spreadsheets such as Microsoft Excel, you have basically worked with something very similar to matricies. The key difference is that in math such tables of numbers have a laundry list of operations they support and specific behavior. A matrix can come in different flavors as you see here. A matrix with such a row, is usually called a row vector. If one a column, we call it a column vector.We can add, subtract, scale and multiple matrices. Addition is pretty easy. You just add every element separately. Multiplication is a bit more involved. I am just showing the simple case here.
More in depth: Why Does Matrix Multiplication Work the Way it Does?
Using matrices to rotate and scale: Explaining Affine Rotation (this is pretty math geeky).
Why Do We Care About Matrices?
The reason matrices are important is because they are heavily used in:- Image processing
- Machine learning
- Speech and handwriting recognition
- Face recognition
- Compression
- Multimedia: audio and video
You could spend your silicon real-estate (transistors) on more CPU cores or by adding specialized hardware.
On any given chip, Apple has a max number of transistors to spend building different kinds of hardware. They could add more CPU cores but that really just speeds up regular tasks, which already run fast enough. Thus they have chosen to spend transistors to make specialized hardware to tackle image processing, video decoding and machine learning. This specialized hardware is the coprocessor and accelerators.
More talk about coprocessors and accelerators: Apple M1 foreshadows Rise of RISC-V.
How is Apple’s Matrix Coprocessor Different From the Neural Engine?
If you have read about the Neural Engine, you will know that it also does matrix operations to help with machine learning tasks. So why do we need the Matrix coprocessor? Or are they actually just the same thing? Am I just confused? No, let me clarify how Apple’s Matrix Coprocessor differ from the Neural Engine and why we need both.The main processor (CPU), coprocessors and accelerators can usually exchange data over a shared data bus. The CPU usually controls memory access while an Accelerator such as a GPU often has its own dedicated memory.
I admit that in past stories I often use the term coprocessor and accelerator interchangeably but they are not the same. A GPU as found in your Nvidia graphics card and the Neural Engine are both a type of accelerator.
In both cases you have special areas of memory which the CPU has to fill up with data it wants processed as well as another part of memory which it fills up with a list of instructions that accelerator should perform. It is time consuming for a CPU to setup this kind of processing. There is a lot of coordination, filling in data, and then waiting to get results back.
Thus this only pays off for larger tasks. For smaller tasks the overhead will be too high.
Coprocessors unlike accelerator spy on the stream of instructions read from memory into the main processor. Accelerators in contrast don’t observe the instructions the CPU is pulling from memory.
This is where coprocessors are a benefit over accelerators.
Edit November 2nd 2021: More recent info on the AMX suggest the description I am giving below is not correct. There is no AMX per core, so it cannot spy on the instruction stream to a core. However the advantage relative to the Neural Engine is similar to what I describe. AMX access memory more like a CPU than a GPU or Neural Engine which are optimized for processing large but slow batches of data.
Coprocessors sit and spy on the stream of machine code instructions being fed from memory (or cache more specifically) into the CPU. Coprocessor are made to react to particular instructions they were made to process. The CPU meanwhile has been made to mostly ignore these instructions or help facilitate the handling of them by a coprocessor.
What we gain from this is that instructions carried out by the coprocessor can be placed inside your regular code. This is different from say a GPU. If you have done GPU programming you know that shader programs are placed into separate buffers of memory, and you have to explicitly transport these shader programs to the GPU. You cannot place GPU specific instruction inside your regular code. Thus for smaller workloads involving matrix processing AMX will be better than the Neural Engine.
What is the catch? You need to actually define the instructions in the instruction-set architecture (ISA) of your microprocessor. Thus you need much tighter integration with the CPU when using a coprocessor than when using an accelerator.
ARM Ltd. creators of the ARM instruction-set architecture (ISA) has long resisted adding custom instructions to their ISA. This is one of the advantages of RISC-V: What Is Innovative About RISC-V?
However due to pressure from customers ARM relented and announced in 2019 that they would allow extensions. EE Times reports:
This may help explain why AMX instructions are not described in official documentation. ARM Ltd. expects Apple to keep these kinds of instructions inside libraries provided by the customer (Apple in this case).The new instructions are interleaved with standard Arm instructions. To avoid software fragmentation and maintain a coherent software development environment, Arm expects customers to use the custom instructions mostly in called library functions.
How is a Matrix Coprocessor Different From a SIMD Vector Engine?
It is easy to confuse something like a matrix coprocessor with a SIMD vector engine, which you find inside most modern processors today including ARM processors. SIMD stands for Single Instruction Multiple Data.Single Instruction Single Data (SISD) vs Single Instruction Multiple Data (SIMD)
SIMD is a way of getting higher performance when you need to perform the same operation on multiple elements. This is closely related to matrix operations. In fact SIMD instructions such as ARM’s Neon instructions or Intel x86 SSE or AVX are often used to speed up matrix multiplications.
Read more: RISC-V Vector Instructions vs ARM and x86 SIMD.
However a SIMD vector engine is part of a microprocessor core. Just like the ALU (Arithmetic Logic Unit) and FPU (Floating Point Unit) is part of the CPU. Inside the microprocessor there is an instruction decoder which will pick apart an instruction and decide what functional unit to activate (gray boxes).
Inside a CPU you got the ALU, FPU as well as SIMD vector engines (not shown) as separate parts activated by the instruction decoder. A coprocessor is external.
A coprocessor in contrast is external to a microprocessor core. In fact one of the early ones, Intel’s 8087 was a physically separate chip designed to speed up floating point calculations.
Intel 8087. One of the early coprocessors used for performing floating point calculations.
Now you may wonder why anyone would want to complicate CPU design by having a separate chip like this which has to sniff on the data flowing from memory to the CPU, to see if anything is a floating point instruction.
The reason was simple, the original 8086 CPU in the first PCs contained 29,000 transistors. The 8087 in contrast was far more complex at 45,000 transistors. It was really hard to make anything with that many transistors. Combining these two chips into one would have been really hard and expensive.
But as manufacturing technology improved, it was not a problem to put floating point units (FPUs) inside the CPU. Thus FPUs replaced the floating point coprocessors.
Why the AMX is not simply a part of the Firestorm cores on the M1 is not clear to me. They are all on the same silicon die anyway. I can only offer some speculations. By being a coprocessor, it may be easier for the CPU to continue running in parallel. Apple may also have liked to keep non-standard ARM stuff outside of their ARM CPU cores.
Why Is the AMX a Secret?
If AMX is not described in official documentation, how do we even know about it? Thanks to developer Dougall Johnson, who has done an amazing job reverse engineering the M1 to discover this coprocessor. His efforts are described here. For matrix related math operations Apple has special libraries or frameworks such as Accelerate, which is made up of:- vImage — higher level image processing, such as converting between formats, image manipulation.
- BLAS — a sort of industry standard for linear algebra (what we call the math dealing with matricies and vectors).
- BNNS — is used for running neural networks and training.
- vDSP — digital signal processing. Fourier transformations, convolution. These are mathematical operations important in image processing or any signal really including audio.
- LAPACK — higher level linear algebra functions, e.g. for solving linear equations.
But why doesn’t Apple document this and let us use these instructions directly? As mentioned earlier, this is something ARM Ltd. would like to avoid. If custom instructions are widely used it could fragment the ARM ecosystem.
However more importantly, this is an advantage to Apple. By only letting their libraries use these special instructions Apple retains the freedom to radically change how this hardware works later. They could remove or add AMX instructions. Or they could let the Neural Engine do the job. Either way they make the job easier for developers. Developers only need to use the Accelerate framework and can ignore how Apple specifically speeds up matrix calculations.
This is one of the big advantages Apple has by being vertically integrated. By controlling both the hardware and the software, they can pull these kinds of tricks. So the next question is how big a deal is this? What does this buy Apple in terms of performance and capabilities?
What Are the Advantages of Apple’s Matrix Coprocessor?
Nod Labs is a company that does machine interaction, intelligence and perception. Fast matrix operations are naturally in their interest. They have written a highly technical blog post of doing performance tests of AMX: Comparing Apple’s M1 matmul performance — AMX2 vs NEON.What they are doing is comparing performance of doing similar code using AMX with doing it using the Neon instructions, which are officially supported by ARM. Neon is a type of SIMD instructions.
What Nod Labs found was that by using AMX they were able to get twice the performance of Neon instructions for matrix operations. It doesn’t mean AMX is better for everything, but at least for machine learning and high performance computing (HPC) type of work, we can expect that AMX gives an edge over the competition.