💾 Archived View for dioskouroi.xyz › thread › 29417885 captured on 2021-12-03 at 14:04:38. Gemini links have been rewritten to link to archived content
➡️ Next capture (2021-12-04)
-=-=-=-=-=-=-
________________________________________________________________________________
Direct links to some of the latest specs:
- Scalar crypto:
https://github.com/riscv/riscv-crypto/releases
- Vectors:
https://github.com/riscv/riscv-v-spec/releases
- Bitmanip:
https://github.com/riscv/riscv-bitmanip/releases
Hypervisor seems to be covered here:
https://github.com/riscv/riscv-isa-manual/blob/master/src/hy...
I don't see why crypto can't just be a peripheral. Here's a block of memory and a key. Tell me when you're done.
There are lots of good reasons to make cryptographic operations instructions instead of a memory mapped peripheral, but I prefer something like VIA padlock which implemented cipher modes instead of just implementing the round function as instruction. Any implementation could even trap those and implement them in a peripheral. The problem with memory mapped peripherals is that access to them has to be multiplexed and their state preserved by context switches. Specialized instruction on existing registers avoid this problem. VIA padlock solved it by piggybacking on the existing x86 REP prefix for interruptible string instructions and only cached the cipher round keys in the crypto unit reloading them from memory (or repeating the key schedule) after a context switch.
In lots of places this makes sense. E.g. lots of embedded ARM platforms have a separate AES / ECC accelerator peripheral.
The trouble comes when you need to share access to a memory mapped peripheral among multiple threads/processes/users etc. It can be done, but it's usually easier to manage CPU registers than peripheral devices for things like crypto operations in larger systems. Plus, you have to do access control to the peripheral (so other processes don't try and steal your key), if its all within the security boundary of a "normal" process, you get that (mostly) for free.
All of the above has caveats and exceptions, but generally (ARM, SPARC, x86, now RISC-V) take this approach.
Latency? Probably depends on the type of crypto.
Huh, I'd heard that the Bitmanip extension would have a conditional move but I don't see it in this version.
That, and other operations requiring three input registers -- therefore a LOT of encoding space -- has been postponed to a possible future extension.
Full GREV and my lovely GORC have also gotten lost, though the encodings for the specific REV and ORC instructions that are included are upwardly compatible with the proposed general versions.
I don't find any hint that B got any attention at all.
I don't see the J Extension on here. Does anyone know what's the state of work of that group?
(J Extension is about dynamic languages acceleration; stuff like code caches, and maybe providing GCs some help. I guess that's new territory so it's not as straightforward compared to say, the bitmanip extension)
J extension is, as you suspected, not anywhere near ready.
The reasoning is simple. It is indeed relatively new territory. The research needs to be done, and standards should be based on solid research. This will likely take a significant amount of time.
Experimentation can be done using custom extensions, but only what's mature and proven belongs in RISC-V official extensions.
All I needed to know is that my favorite methods to optimize algorithms instead radically pessimize them on all published RISC-V profiles.
Even with more extensions ratified, that does not mean they will be available on common target hardware, thus requiring, at best, build-system gymnastics unlikely in normal software distribution. This is especially true for any not required in published profiles.
Unless you are the system architect for a massive-volume embedded application, the extensions are more like a cruel joke: "You _could have had_ this feature if we let you, but we didn't. Peasant."
Is there a full list of what was ratified?
Wikipedia only lists 6 as frozen, so where did the others come from?
https://en.wikipedia.org/wiki/RISC-V#Design
https://wiki.riscv.org/display/TECH/Recently+Ratified+Extens...
Updated versions of the Privileged and Unprivileged Spec PDFs will be posted to riscv.org/specifications soon.
For convenience:
* PMP Enhancements for memory access and execution prevention on Machine mode (Smepmp) * RISC-V Base Cache Management Operation ISA Extensions * RISC-V Bit-Manipulation ISA-extensions * RISC-V Count Overflow and Mode-Based Filtering Extension * RISC-V Cryptography Extensions Volume I: Scalar & Entropy Source Instructions * RISC-V State Enable Extension * RISC-V "stimecmp / vstimecmp" Extension * RISC-V Vector Extension * The RISC-V Instruction Set Manual Volume II: Privileged Architecture * "Zfh" and "Zfhmin" Standard Extensions for Half-Precision Floating-Point * "Zfinx", "Zdinx", "Zhinx", "Zhinxmin": Standard Extensions for Floating-Point in Integer Registers
I'd love to hear what people have to say about the vector instructions. I've always found that SIMD on x86was quite clunky and I heard risc-v vectors are very different from that. Is that true?
Very different. RISC-V's vectors (RVV) are "variable length", so the programmer can request a length and the machine tells you what it can give you. Different machine versions can change the underlying vector size and the code Will Just Work.
This is different from "fixed-width SIMD" which has a hard-coded vector length. To make things more challenging for the programmer/compiler, I believe most x86 SIMD versions also don't provide a "mask" register, so you're stuck with using all vector elements (AVX512 added masks).
Each has its advantages and disadvantages (esp. on the design complexity vs programmer/compiler interface complexity).
RVV also provides a mechanism to reconfigure the register file, ganging logical registers together to get longer effective vector lengths.
Has anyone actually used these in anger? I see the potential over fixed width SIMD, but what's it like to actually program in C++?
Who is going to write all the documentation and snippets for them? RISC-V docs seem to be mostly pdf based which isn't great.
I'm not an expert but I've seem them characterised as "the return of Cray vectors", so maybe yes
Cray's were programmed in assembly though.
So is risc-v…?
No it's obviously not going to be. You can write X86 SIMD code extremely effectively from a high level language. I _want_ to write RISC-V/V code in C++, but if it ends up as carting around fixed width vectors then that's a loss.
I’m horribly confused. x86 SIMD has the fixed width vectors, not RISC-V or Cray.
Can you change the "shape of the vectors? e.g. 1x16 vs 4x4 to support vectors and matrices?
You have widening operations e.g. 16x16->32 bit multiplications and can reduce number of available registers to get longer vectors, but among the really interesting ones are fault only first load and masked instructions that enable the vector unit to work on things like null terminated strings. The specification includes vectorized strlen/strcmp/strcpy/strncpy implementations as examples. Most existing (packed) SIMD instruction sets aren't useful for these common functions.
Given the number of implementations of str* routines in
https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/m...
, maybe you might want to revisit your last statement. PCMP/MOVMSK work well enough for finding the trailing NUL.
Now compare how many different versions of the functions are required for the dozens of possible x86 extensions (and combinations of them) and all the prologue/epilogue code required to watch out for page boundaries and unaligned pointers and as well as the length of the inner loop to handle all the packing/unpacking and cobbeling together horizontal operations to the required masks and turn somehow use them for flow control where needed. It's enough code to put painful pressure on the instruction cache and requires wide OoO superscalar CPU cores to be worth the overhead compare the code in the RISC V vector spec with this strcmp
https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/m...
and tell me it's a clean and straightforward implementation using the instruction set as intended and not an ugly hack around its limitations.
I'm not going to dispute that x86's approach leads to a lot of duplication for each vector size, but your statement was that the fixed-size vector approach isn't "useful for these common functions," which implies to me that it couldn't be used at all.
The extension is agnostic with respect to the actual width of the chip's registers, and you also won't have to separately account for the "last iteration" where you have not enough elements to fill a register, or at least it will be more convenient. It also has strided load and store as well as scatter and gather.
This is all I remember, there is probably more.
Yes it is. On x86, SSE is 128-bit and AVX is 256-bit and AVX-512 is 512-bit. RISC-V V extension handles all vector lengths uniformly: vector add is the same instruction no matter vector length.
What about a vector of 1 element
Yes, no problem.
Machines with any size vector registers handle code specifying vector length of 1 (or 0!) no problem.
If you really want to make a machine with vector registers that hold only one element then that will work too, except for a handful of instructions that simply don't make sense in that case (unless you use the LMUL feature): vector permute register, slide up, slide down.
CPUs intended to run standard operating systems with shrink-wrapped software are constrained in the RVA22 profile to provide vector registers of at least 128 bits and no more than 65536 bits. But if you're doing some custom embedded custom CPU then you can make the vector registers the same size as the integer registers (32 or 64 bits). Note that if you do that, you can still usefully do vector operations on chars and shorts, and you can also set LMUL=8 to give you effectively four vector registers of 256 or 512 bits each (which might or migth not be processed serially).
Oh boy, give it a few more years and the RISC-V architecture is going to have as many extensions as XMPP! Yay for interoperability!
Was on a call ahead of the RISC-V Summit last night where the topic came up.
Not to name drop but here's what David Patterson had to say (he's vice chair of RISC-V BoD among other things).
"One of brilliant features of RISC-v is modularity. Everyone wants an ecosystem that is adaptable but runs standard software. Defining profiles and platforms is the next thing on their slate. Binary compatibility is not the overwhelming thing in the SoC world that it was with microprocessors. Flexibility is one of the various attractive features of RISC-V."
The idea with profiles is that you create groupings of modules aimed at a specific use case.
So, yes, there needs to be some balancing of flexibility and compatibility/interoperability and there are concerns around this. (One of the processor analysts brought this up.) But people are aware and thinking about it.
When they say R64GC, the C is compressed while the G is short for I, M, A, F, D, Z, icsr, and Zifencei.
ARM does something similar. They have TONS of extensions, but then group them into 8.0, 8.1, 8.2, etc then also group them with the A, R, and M designators too.
My memory is failing me... Is the scalar cryptography extension include the one that has the bitwise manipulation (rotations, etc) or is it that a separate spec?
Yoe maybe interested in the just ratified "RISC-V Bit-Manipulation ISA-extensions"
https://github.com/riscv/riscv-bitmanip/releases/download/1....
There is some overlap. There's the "Zbkb" (horrible name, I know) extension which contains a subset of instructions from the larger bitmanip extensions which are very useful for cryptography.
The more general bitmanip extensions contain other things useful for e.g. address arithmetic. These are somewhat orthogonal to scalar crypto.
I have some mixed feelings about most of these. As Jim Keller said, "most of the performance comes from just six instruction and RISC-V has all of those". Adding more instructions will cost area, power, design & verification time, all of which could go to making the existing code go faster.
The beauty of a modular instruction set architecture like RISCV's is that you _don't_ have to implement all of it, only the extensions that make sense for your use case.
Aside, Keller's quote is probably partly in jest. If you are in a constrained micro-controller environment something like the ZFinx extension is probably helpful beyond the "just six instructions" for code density. If you are crypto heavy, the crypto extension are going to be more helpful than "just six instructions". If your workload is parallelisable and regular, vectorisation helps you more than "just six instructions" and so on.
One size doesn't fit all.
I think you misunderstood what he said and I know he wasn't joking, but I didn't point out that the implied context was for Tenstorrent's usage, thus data center. He didn't mean that you just need six instructions (eg. Turing tarpit), he meant (and he's right) that the bulk of [integer] performance comes from a very small set of instructions, most critically loads and conditional branches.
All of the discussed extensions helps _specific_ workloads, but unless your workload is, say, 100% encryption all the time, then the crypto extension will only provide a trivial improvement on the _overall_ performance.
Vector is a little bit different, but it (like AVX2/512) comes at a _very_ significant cost and you better have software that can take advantage of it.
The whole point of RISC-V is to be a universal architecture used for everything. The idea is to have profiles for different verticals and application. In these profiles you define what extensions you need.
If there is really a significant win for a certain type of server workloads, that community will make its own profile and hopefully be able to get chips that utilize that.
The problem is that there are also many mixed workloads and having lots of general compute can work pretty well if you want to run a broad set of extinctions.
RISC-V is sort of a fluid spectrum from highly specialized to highly general depending on the use case.
How does the modular instruction set work? If someone proposes an extension, is the onus on them to also provide a minimal RISCV implementation of that functionality? Or is it just accepted that some binaries won't work on all devices?
It's best to think of RISCV not as a single ISA (= instruction set architecture), but a parametric ISA. The extensions are parameters.
RISCV offers lots of official extensions to choose from, such as M, A, F, D, P, V, ....
In addition you have the 32 vs 64 bit data width parameter. Any specific ISA will have to instantiate those parameters, like e.g. so: _RISCV32MFP_
or
_RISCV64MAF_.
Any implementation of e.g. _RISCV64MAF_ will have to implement in silicon exactly those assembly command (and supporting features) that the M, A and F extension demand, with 64 bit register width.
Like in OO-programming the class constructors take arguments that parameterise the created object.
------
Regarding an implementation, given that RISCV is an ISA, not an ISA implementation, you need to provide a functional model. The official standard is [1] but it's a bit behind the ratified extensions. For example [2] defines the (ISA-visible)
registers, while [3] gives you the instruction decoding and execution clause for the most base instruction set. [4] describes part of one of the available address translation modes (for the 32 bit variant of the ISA). Note: in modern processors page-table walks are hardware accelerated, so OS and processor need to use the same format here, which is why this is part of the ISA.
[1]
https://github.com/riscv/sail-riscv/tree/master/model
[2]
https://github.com/riscv/sail-riscv/blob/master/model/riscv_...
[3]
https://github.com/riscv/sail-riscv/blob/master/model/riscv_...
[4]
https://github.com/riscv/sail-riscv/blob/master/model/riscv_...
The way it works is that there are profiles. The idea behind profiles is that different use cases define profiles with the instruction extensions the require or are optional and so on.
So the major Linux distros agree on a set of instructions and that's called a profile. Same for embedded and others eventually.
You can add your own extensions for yourself if you want. You can also make extentions and try to make it a sudo standard. Or you can attempt to make it into a standard extention.
To be a standard extension it has to go threw a long process and it will likely be tapped out multiple times before it is ever ratified. Once its ratified it will find its way into profiles.
So for example standard Linux distros now use RV64GC, likely the next version of the Linux profile will include more of the new instructions.
But yes, the goal is not to create a 'universal binary'. But a reasonable compromise between reuse and specialization.
> Or is it just accepted that some binaries won't work on all devices?
Yes. Just like you cannot run Pentium code on a 386 because they added new extensions. Or how Scheme isn't really a programming language but more like a _family_ of very nearly compatible languagues. RISCV has multiple targets and so so they have very different needs from embedded automotive to desktop. But with a common core is easier to develop and share tooling.
I have no idea for Riscv specifically, but x86/amd64 have a lot of optional instructions (I'm mostly aware of vector stuff like SSE, AVX but I'm sure there are other stuff).
On the programming side, you can detect at runtime feature support and use specific code path accordingly, or decide at compile time that you require a specific CPU feature and then your binary will just not work on CPUs without the feature.
> One size doesn't fit all.
True, but a standard that is too malleable isn't really a standard at all.
If you're building a chip for a server, workstation, laptop, smartphone, then you'll want to adhere to a platform spec profile.
RVA22[0] is the first such profile, and among other important things which go a long way to ease cross-vendor software compatibility, it does require RVA22U and RVA22S, which in turn require a set of extensions.
[0]:
https://github.com/riscv/riscv-platform-specs/blob/main/risc...
In probably any “open” ISA, vendors/manufacturers are likely to “fork it” and show up with their own extensions anyway. By embracing extensions as a first-class concept, it would seem RISC-V is trying to embrace variance rather than to repeat the mistakes of architectures like amd64 (which has multiple “microarchitecture levels” and only the lowest level is truly portable).
To a certain extent yes they embrace variance but to a certain extent they don't.
The idea is that what is dominates is software. If you add your own extensions, literally all software in the world wont support it. You will need to provide a huge amount of stuff to fully take advantage of that.
The availability of software both open and commercial on top of standardized profiles targets should be what manufacturers target.
Early on of course, manufactures have provided things that are not standard yet. However over time, does it really make sense to supply your own bit manipulation extension? As the standard grows the waste majority of application should not require or be really improved by proprietary extensions.
Of course if somebody comes along and makes a chip that is just vastly better then what anybody else has with some extensions. That could break that paradigm and people might embrace it.
A fairly high proportion of extensions (both existing, and simply possible in future in general) are so specialised that you wrap the special instructions inside a function (often within a loop inside that function) and then put that function in a library.
You just choose whether to use that version of the library or another one that uses normal instructions.
It's no exaggeration to say that many of those extension instructions might exist in only one function in one library on your entire Linux (or Android, FreeBSD, whatever) system.
To some extent the Vector extension can be like that. For most programs they'll just pick up vectorised versions of memcpy, strlen and so forth. In other programs (generally ones you compile yourself) you might want to use the vector extension directly -- maybe with auto-vectorisation in time. LLVM can do a bit of that already.
Only a few of the extensions have instructions that can profitably weave their way into every part of your code. The Bitmanip extension is like that. You _really_ want to know whether your target processor has B or not.
These particular extensions come across as "long-tail" things that are probably worth standardizing, IMO. Not every core needs cryptographic acceleration, but the ones that do need it tend to _really_ need it for those cases. Similarly if you need hypervisor mode support, there are basically no alternatives to just having it, and it requires enough software support to the point you probably have to standardize it, if there's any hope of it working. There's also the advantage that these give a baseline for vendors and software to target instead of rolling their own, within sensibility (though they may choose not to). Some of the other drafted extensions not mentioned here are perhaps more questionable...
All three of these are complex enough to definitively increase the design/verification time for any core that implements them, though, that's for sure. (A net effect of this is that while there are tons of simple in-order cores, actual "production" RISC-V cores with features like this will remain rare...)
Your argument is general, but these are very specific areas being served.
* RISC-V Vector instructions seem like a huge win for all forms of HPC. x86 is getting vector instructions & the wins have been immense. Rather than a wide range of specific SIMD instructions, vector instructions seem like a far more general & easier to scale up & down implementation strategy. Not everyone has to implement!
* RISC-V Hypervisor specifications seem required for modern computing, where VM's are commonplace. Have to have this specification. Not everyone has to implement!
* RISC-V Scalar Cryptography specifications providing accelorated cryptography seems like another have to have modern in data-centers.
Worth re-iterating what's been said already: extensions are just that: extensions. They're not required. I'm not sure what the current state is, of code detecting & use the accelerated implementation when available, using soft-fallbacks otherwise. For things like cryptography, usually it's a library, openssl or someone, where the library is the reference implementation, with special paths written in for using harware where available.