RISC Is Fundamentally Unscalable
Today, there was an announcement about a new RISC-V chip, which has got a lot of people excited. I wish I could also be excited, but to me, this is just a reminder that RISC architectures are fundamentally unscalable, and inevitably stop being RISC as soon as they need to be fast. People still call ARM a “RISC” architecture despite ARMv8.3-A adding a
The reason this keeps happening is because the laws of physics ensure that no RISC architecture can scale under load. The problem is that a modern CPU is so fast that just accessing the L1 cache takes anywhere from 3-5 cycles. This is part of the reason modern CPUs rely so much on register renaming, allowing them to have hundreds of internal registers that are used to make things go fast, as opposed to the paltry 90 registers actually exposed, 40 of which are just floating point registers for vector operations. The fundamental issue that CPU architects run into is that the speed of light isn’t getting any faster. Even getting an electrical signal from one end of a CPU to the other now takes more than one cycle, which means the physical layout of your CPU now has a significant impact on how fast operations take. Worse, the faster the CPU gets, the more this lag becomes a problem, so unless you shrink the entire CPU or redesign it so your L1 and L2 caches are physically closer to the transistors that need them, the latency from accessing those caches can only go up, not down. The CPU might be getting faster, but the speed of light isn’t.
Now, obviously RISC CPUs are very complicated architectures that do all sorts of insane pipelining to try and execute as many instructions at the same time as possible. This is necessary because, unless your data is already loaded into registers, you might spend more cycles loading data from the L1 cache than doing the actual operation! If you hit the L2 cache, that will cost you 13-20 cycles by itself, and L3 cache hits are 60-100 cycles. This is made worse by the fact that complex floating-point operations can almost always be performed faster by encoding the operation in hardware, often in just one or two cycles, when manually implementing the same operation would’ve taken 8 or more cycles. The
FJCVTZS instruction mentioned above even sets a specific flag based on certain edge-cases to allow an immediate jump instruction to be done afterwards, again to minimize hitting the cache.
All of this leads us to single instruction multiple data (SIMD) vector instructions common to almost all modern CPUs. Instead of doing a complex operation on a single float, they do a simple operation to many floats at once. The CPU can perform operations on 4, 8, or even 16 floating point numbers at the same time, in just 3 or 4 cycles, even though doing this for an individual float would have cost 2 or 3 cycles each. Even loading an array of floats into a large register will be faster than loading each float individually. There is no escaping the fact that attempting to run instructions one by one, even with fancy pipelining, will usually result in a CPU that’s simply not doing anything most of the time. In order to make things go fast, you have to do things in bulk. This means having instructions that do as many things as possible, which is the exact opposite of how RISC works.
Now, this does not mean CISC is the future. We already invented a solution to this problem, which is VLIW - Very Large Instruction Word. This is what Itanium was, because researchers at HP anticipated this problem 30 years ago and teamed up with Intel to create what eventually became Itanium. In Itanium, or any VLIW architecture, you can tell the CPU to do many things at once. This means that, instead of having to build massive vector processing instructions or other complex specialized instructions, you can build your own mega-instructions out of a much simpler instruction set. This is great, because it simplifies the CPU design enormously while sidestepping the pipelining issues of RISC. The problem is that this is really fucking hard to compile, and that’s what Intel screwed up. Intel assumed that compilers in 2001 could extract the instruction-level parallelism necessary to make VLIW work, but in reality we’ve only very recently figured out how to reliably do that. 20 years ago, we weren’t even close, so nobody could compile fast code for Itanium, and now Itanium is dead, even though it was specifically designed to solve our current predicament.
With that said, the MILL instruction set uses VLIW along with several other innovations designed to compensate for a lot of the problems discussed here, like having deferred load instructions to account for the lag time between requesting a piece of data and actually being able to use it (which, incidentally, also makes MILL immune to Spectre because it doesn’t need to speculate). Sadly, MILL is currently still vaporware, having not materialized any actual hardware despite it’s promising performance gains. One reason for this might be that any VLIW architecture has a highly unique instruction set. We’re used to x86, which is so high-level it has almost nothing to do with the underlying CPU implementation. This is nice, because everyone implements the same instruction set and your programs all work on it, but it means the way instructions interact is hard to predict, much to the frustration of compiler optimizers. With VLIW, you would very likely have to recompile your program for every single unique CPU, which is a problem MILL has spent quite a bit of time on.
Every. Single. Time.