💾 Archived View for dioskouroi.xyz › thread › 29388213 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Open source RISC-V GPGPU

Author: 1ntEgr8

Score: 205

Comments: 54

Date: 2021-11-30 02:35:44

Web Link

________________________________________________________________________________

raphlinus wrote at 2021-11-30 04:36:23:

This is a research project from Georgia Tech. There's a homepage at [0] and a paper at [1]. It is specialized to run OpenCL, but with a bit of support for the graphics pipeline, mostly a texture fetch instruction. It appears to be fairly vanilla RISC-V overall, with a small number of additional instructions to support GPU. I'm very happy to see this kind of thing, as I think there's a lot of design space to explore, and it's great that some of that is happening in academic spaces.

[0]:

https://vortex.cc.gatech.edu/

[1]:

https://vortex.cc.gatech.edu/publications/vortex_micro21_fin...

hajile wrote at 2021-11-30 16:58:19:

Intel's Larabee/Xeon Phi shows that there's a ton of potential here.

Intel's big issue is that x86 is incredibly inefficient. Implementing the base instruction set is very difficult. Trying to speed it up at all starts drastically increasing core size. This means that the SIMD to overhead ratio is pretty high

RISC-V excels at tiny implementations and power efficiency. The ratio of SIMD to the rest of the core should be much higher resulting in overall better efficiency.

The final design (at a high level) seems somewhat similar to AMD's RDNA with a scalar ALU doing the flow control while a very wide SIMD does the bulk of the calculations.

unsigner wrote at 2021-11-30 07:12:55:

We should really have another word for “chip that runs OpenCL but has no rasterizer”.

I see the title was edited to call it a “GPGPU”, or a “general-purpose GPU” but it’s not a thing; GPGPU was an early moniker for when people tried to do non-graphics work on GPUs many years ago, but it was a word for techniques, never for a specific type of hardware. Plus it feels to me that “general purpose” should be something more than a GPU, while this is strictly less.

raphlinus wrote at 2021-11-30 16:52:59:

I don't really agree. I think it's completely valid to explore a GPU architecture in which rasterization is done in software, with a perhaps a bit of support in the ISA. That's what they've done here, and they do demonstrate running OpenGL ES.

The value of this approach depends on the workload. If it's mostly rasterizing large triangles with simple shaders, then a hardware rasterizer buys you a lot. However, as triangles get smaller, a pure software rasterizer can win (as demonstrated by Nanite). And as you spend more time in shaders, the relative amount of overhead from software rasterization decreases; this was shown in the cudaraster paper[1].

Overall, if we can get simpler hardware with more of a focus on compute power, I think that's a good thing, and I think it's completely fine to call that a GPU.

[1]:

https://research.nvidia.com/publication/high-performance-sof...

zbendefy wrote at 2021-11-30 16:58:37:

OpenCL has a category called CL_ DEVICE_ TYPE_ ACCELERATOR for that, so something like 'Accelerator' seems to fit

avianes wrote at 2021-11-30 10:49:10:

That term is "SIMT architecture."

Modern GPUs (or GPGPUs) are based on the SIMT programming model that requires an SIMT architecture

zozbot234 wrote at 2021-11-30 16:08:51:

"SIMT" is not an architecture, it's just a programming model that ultimately boils down to wide SIMD instructions with conditional execution. Add that to a barrel processor that can hide memory latency across a sizeable amount of hardware threads, and you've got the basics of a GPU "core".

avianes wrote at 2021-11-30 16:48:24:

SIMT is a programming model, you are right.

But in the literature the term "SIMT architecture" is used to describe architectures optimized for the SIMT programming model.

Just search for "SIMT architecture" in google scholar or any other search engine dedicated to academic research, you will see that it's indeed a term used for this kind of architecture.

nine_k wrote at 2021-11-30 07:19:00:

Vector processors? Follow the early Cray nomenclature.

avianes wrote at 2021-11-30 11:01:55:

The terminology "vector processor" refers to a completely different type of architecture.

Using it for SIMT architecture would be confusing

dahart wrote at 2021-11-30 15:03:52:

What definition of vector processor are you thinking of? Wikipedia’s definition appears to agree with the parent, and even states “Modern graphics processing units […] can be considered vector processors”

https://en.wikipedia.org/wiki/Vector_processor

avianes wrote at 2021-11-30 16:43:00:

Yes, the definition matches.

But that doesn't mean that the architecture and the micro-architecture used is similar.

So.. Yes! We can say that these architectures are some kind of "vector processors", but it will be ambiguous regarding the programming model and the architecture used.

dahart wrote at 2021-11-30 17:06:09:

I’m interested to hear what you mean by “vector processor”. What does that imply to the lay person, and how is it different enough to be confusing when applied to GPUs? What does the term imply to you in terms of architecture?

avianes wrote at 2021-11-30 19:08:31:

The term "vector processor" generally refers to a processor with a traditional programming model, but which features a "vector" unit capable of performing operations on large vectors of fixed or variable size. It can occupy the vector unit for a significant amount of cycles.

The RISC-V Vector extension is a good example of what makes a vector processor.

However, and this is a source of confusion, the standard definition is abstract enough so that many other architecture can be called "vector processor".

Regarding modern GPGPU (with SIMT architecture) we are dealing with a programming model named SIMT (Single-Instruction Multiple-Thread) in which the programmer must take into account that there is one code for multiple thread (a block of thread), each instructions will be executed by several "core" simultaneously.

This has implications, the hardware has a limited number of "core" so it must split the block of threads into sub-blocks called wraps (1 wrap = 32 threads on Nvidia machines).

When we offload compute to a GPU we send him several blocks of wrap. And all wraps will be executed progressively, the GPU scheduler's job is to pick a ready wrap, execute one instruction from the wrap, then pick a new wrap and repeat.

This means that wraps from a block have the ability to get out of sync. With a classical vector processor this kind of situation is not possible (or not visible architecturally), it is not possible for a portion of the vector to be 5 instructions ahead for example.

Therefore, GPU includes instructions to resynchronize wraps from a group, while vector processors don't need this. But it also means that you expose much more unintentional dependency between data with a vector processor.

dahart wrote at 2021-11-30 22:27:55:

> the standard definition is abstract enough

It seems like you’re making the case that the term “vector processor” should be interpreted as something general, and not something specific? Since the Cray vector processor predates RISC-V by ~35 years, isn’t the suggestion above to use it the way Cray did fairly reasonable? It doesn’t seem like it’s really adding much confusion to include GPUs under this already existing umbrella term...

> With a classical vector processor […] it is not possible for a portion of the vector to be 5 instructions ahead

Just curious here, the WP article talks about how one difference between “vector” processing and SIMD is that vector computers are about variable length vectors by design, where SIMD vectors are usually fixed length. How does that square up with what you’re saying about not having any divergence?

This feels like it’s comparing apples to oranges a little… a SIMT machine has different units ahead of others because they’re basically mini independent co-processors. If you have a true vector processor according to your definition, but simply put several of them together, then you would end up with one being ahead of the others. That’s all a modern GPU SIMT machine is: multiple vector processors on one chip, right? It seems like time and scale and Moore’s Law would inevitably have turned vector processors into a machine that can handle divergent and/or independent blocks of execution.

BTW, not sure if it was just auto-correct, but you mean “warp” and not “wrap”, right?

avianes wrote at 2021-12-01 00:11:10:

> BTW, not sure if it was just auto-correct, but you mean “warp” and not “wrap”, right?

Oh, sorry I totally meant "warp" not "wrap", I don't know how I introduced that typo.

> It seems like you’re making the case that the term “vector processor” should be interpreted as something general, and not something specific?

Not exactly, I am in favor of using a specific term, and in particular keeping the use of the term "vector processor" for machines similar to the Cray ones. But I admit that the term is used in a more abstract way. For instance the Intel AVX extension means "Advanced Vector Extensions" while it is definitely a SIMD extension.

Computer architecture lacks of accurate/strict definitions, probably because there are often many possible implementations of the same idea. Then we sometimes find ourselves using words that are a bit disconnected from their original idea.

The architectures that Cray's engineers came up with have not much to do with the modern SIMT architecture. That's why I find it confusing.

> vector computers are about variable length vectors by design, where SIMD vectors are usually fixed length. How does that square up with what you’re saying about not having any divergence?

Not sure if I understood the question correctly.

But after execution of a vector or SIMD instruction, the vector or SIMD register is seen as containing the outcome of the operation, it's not possible to observe in the register a temporary value or an old value because it has not been processed yet. While with a SIMT programming model and architecture it is possible to observe it if we omit synchronization.

This is a very clear difference in observable architectural states.

> If you have a true vector processor according to your definition, but simply put several of them together, then you would end up with one being ahead of the others.

Of course you can reproduce a model similar to SIMT with a lot of vector or scalar processors by changing the programming model and the architecture significantly.

But then it seems reasonable to me to call that an SIMT programming model & architecture

> That’s all a modern GPU SIMT machine is: multiple vector processors on one chip, right?

Sort of.. But splitting the compute into groups and warps is not negligible, it implies big differences in the architecture, in uarch, in design, and programing model.

So it makes sense to give a different name when there are many significant changes.

eqvinox wrote at 2021-11-30 07:36:05:

VPU? With network processors being called NPU these days...

zmix wrote at 2021-11-30 08:07:47:

> VPU?

Already taken: _Video Processing Unit_.

techdragon wrote at 2021-11-30 10:11:30:

What’s your point? NPU was mentioned earlier and I routinely see NPU used as an acronym for “neural processing unit” in modern embedded hardware containing hardware either on chip or on module for accelerating neural networks for various edge computing applications of machine learning models.

I’m pretty sure Vector Processing Unit may outdate Video Processing Unit given that people were developing vector processing hardware for high performance computing back before even the Amiga was released which is the earliest thing I can think of that had serious video related hardware. I leave room for someone to have called their text mode display driver chips a “video processing unit” but I don’t think it would have been common given the nominal terminology at the time was to call the screen a “display” and the hardware was typically called a “display adaptor” or “display adapter” … at least in my experience, which I will admit is limited since I didn’t live through it merely learned after the fact by being interested in retro computing.

bullen wrote at 2021-11-30 06:22:08:

I think there is another project doing RISC-V GPU:

https://www.pixilica.com/graphics

Also the latest announced RVB-ICE should have a OpenGL ES 3+ capable Vivante GC8000UL GPU (did not manage to find documentation for this exact version but all GC8000 seem to):

https://www.aliexpress.com/item/1005003395978459.html

Disclaimer: Expensive if you don't know if it's vapourware and how the drivers and linux works!

akmittal wrote at 2021-11-30 03:27:05:

It great to see RISC-V making a lot of progress.

A lot of research is coming from China because of US bans, but hopefully this will be good for whole world.

zucker42 wrote at 2021-11-30 03:53:37:

Which U.S. bans are you talking about? Is there anywhere I can read more about this?

bee_rider wrote at 2021-11-30 04:09:29:

We occasionally ban companies that make HPC parts (Intel, NVIDIA, AMD) from selling to Chinese research centers, generally citing concerns that they could be used for weapons R&D (nuclear weapons simulation for example).

2015:

https://spectrum.ieee.org/us-blacklisting-of-chinas-supercom...

2019:

https://www.yahoo.com/now/trump-bans-more-chinese-tech-21140...

2021:

https://www.bloomberg.com/news/articles/2021-04-08/u-s-adds-...

monocasa wrote at 2021-11-30 04:20:52:

And then made it's own domestic supercomputing cluster that topped the charts when it came online.

jpgvm wrote at 2021-11-30 08:27:11:

Yeah. People forget that capital allocation works differently in China. You ban something they want, they simply make it themselves.

I think banning their access to EUV lithography via the export ban of ASML machines is going to backfire horribly on the US. China has started allocating absolutely ridiculous amounts of money to hard science and have also changed the way they fund and prioritize projects to make them more commercially targeted. The end result of this is they now have a very large number of very smart scientists with nearly infinite funding being told to solve for the rest of the fab pipeline.

ASML might retain their monopoly for a few more years but I think this move will eventually result in Chinese building even better machines and likely more practical and for less money - as is the Chinese way.

phkahler wrote at 2021-11-30 15:18:04:

>> ASML might retain their monopoly for a few more years but I think this move will eventually result in Chinese building even better machines and likely more practical and for less money - as is the Chinese way.

The complexity of ASML EUV light sources is incredible. China might just use a synchrotron for that. Sure there are issues to resolve but it seems like it has to be simpler in the end.

sitkack wrote at 2021-11-30 13:53:22:

I think the bans are strategic in that we need a capable foe. By banning the right things, it ensures that they are at parity with us.

A smart colonial power doesn't cut off access 100% but rations and controls access to resources.

I can't wait to buy a refrigerator sized fab on alibaba for 25k in five years.

hajile wrote at 2021-11-30 17:08:52:

Those cores are 28nm and I believe they were made on TSMC.

Today, the manufacture ban has China still on 28 or maybe 22nm as their most advanced node.

They had some old 14nm stuff, but last I heard, the guys that owned it greatly over-promised to the Chinese Government.

China has no modern fabrication processes and no way forward toward designing one. At present, they are headed to a state of perpetually being 5 node (one decade) behind.

cyounkins wrote at 2021-11-30 04:58:12:

I hadn't heard about this:

https://en.wikipedia.org/wiki/Sunway_TaihuLight

lincpa wrote at 2021-11-30 04:47:48:

And then made it's own computer architecture, Its mathematical prototype is the simple, classic, vivid, and widely used in social production practice, elementary school mathematics "water inflow/outflow of the pool". My theory rebuilt the theoretical foundation of the IT industry, It makes the computer theory system fully & perfectly related to mathematics in a simple and unified way: from hardware integrated circuits and computer architecture, to software programming methodology, architecture, programming language and so on. It solve the most fundamental and core major problems in the IT industry: The foundation and core of the IT theory lack mathematical support.

It will surely replace the "von Neumann architecture" and become the first architecture in the computer field, and it is the first architecture to achieve a unified software and hardware. Because "von Neumann architecture" lacks mathematical model support, it is impossible to prove its scientificity.

It has a wide range of applications, from SOC to supercomputer, from software to hardware, from stand-alone to network, from application layer to system layer, from single thread to distributed & heterogeneous parallel computing, from general programming to explainable AI, from manufacturing industry to IT industry, from energy to finance, from cash flow to medical angiography, from myth to Transformers, from the missile's "Fire-and-Forget" technology to Boeing aircraft pulse production line technology.

The Math-based Grand Unified Programming Theory: The Pure Function Pipeline Data Flow with Principle-based Warehouse/Workshop Model

https://github.com/linpengcheng/PurefunctionPipelineDataflow

grawlinson wrote at 2021-11-30 03:59:41:

It'll probably be something like this[0] and this[1]. I think there are more export restrictions than these two examples.

[0]:

https://en.wikipedia.org/wiki/Export_of_cryptography_from_th...

[1]:

https://edition.cnn.com/2020/12/18/tech/smic-us-sanctions-in...

zackmorris wrote at 2021-11-30 16:39:07:

I want the opposite of this - a multicore CPU that runs on GPU or FPGA. Vortex looks really cool, but if they jump over a level of abstraction by only offering an OpenCL interface instead of access to the underlying cores, then I'm afraid I'm not interested.

I just need a chip that can run at least 256 streams of execution, each with their own local memory (virtualized to appear contiguous). This would initially be for running something like Docker, but would eventually run a concurrent version of something like GNU Octave (Matlab), or languages like Julia that at least make an attempt to self-parallelize. If there is a way to do this with Vortex, I'm all ears.

I've gone into this at length in my previous comments. The problem is that everyone jumped on the SIMD bandwagon when what we really wanted was MIMD. SIMD limits us to a very narrow niche of problems to solve like neural nets and rasterization. But it prevents us from discovering the emergent behavior of large stochastic networks running stuff like genetic algorithms or the elegant/simple algorithms like ray tracing. That's not handwaving, I'm being very specific here in what I'm saying, and feel that this domination of the market by a handful of profit chasers like Nvidia has set computing back at least 20 years.

JonChesterfield wrote at 2021-11-30 21:48:27:

I think this is available now. The waves/wavefronts on a GPU run independently. Communication between them isn't great, independent is better.

Given chips from a couple of years ago have ~64 compute units, each running ~32 wavefronts, your 256 target looks fine. It's one block of contiguous memory, but using it as 256 separate blocks would work great.

I don't know of a ready made language targeting the GPU like that.

klelatti wrote at 2021-11-30 17:10:30:

I may be missing something here but what do you mean by a CPU that runs on a GPU?

Also how does "256 streams of execution, each with their own local memory (virtualized to appear contiguous)" differ in practice from one of the recent CPUs with lots of cores - e.g. AMD / AWS Arm?

ksec wrote at 2021-11-30 10:54:56:

Nice, instead of trying to tackle the CPU space RISC-V should really be doing more work on GPGPU space with open source drivers.

Current GPU are the biggest blackbox and mystery in modern computing.

sitkack wrote at 2021-11-30 14:20:25:

RISC-V (with no vectors) was the base ISA to support the real goal of making Vector processors, it was supposed to be a short side quest. Much longer than expected but 1000% worth it.

RVV (RISC-V Vector Extension) is the real coup and ultimately what the base ISA is there to support.

https://youtu.be/V7fuE1yXUxk?t=104

https://www.youtube.com/watch?v=oTaOd8qr53U

GPUs might be complex beasts, ultimately it is lots of FMAs (Fused Multiply Add) that do most of our calculations.

https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...

NotCamelCase wrote at 2021-11-30 14:35:38:

This is an amazing project considering the scope of work required on both sides of the aisle -- HW and SW.

I find choice of RISC-V pretty interesting for this use case as it's a fixed-size ISA and there is a significant amount of of auxiliary data usually passed from drivers to HW in typical GPU settings, even for GPGPU scenarios alone. If you look at one of their papers, it shows how they pass extra texture parameters via CSRs. I think this might come to be bottleneck and limiting factor in the design for future expansions. I am currently doing a similar work (>10x smaller in comparison) on a more limited feature set, so I am really curious how it'll turn out to be.

zozbot234 wrote at 2021-11-30 16:11:09:

RISC-V is not "fixed size", the encoding has room for larger instructions (48-bit, 64-bit or more).

NotCamelCase wrote at 2021-11-30 19:24:39:

I guess you're referring to variable-length encodings support? It's fixed as in they only implement RV32IMF subset here. Even then, code density may be source of bottlenecks along the way.

pabs3 wrote at 2021-11-30 07:35:28:

OpenCL seems to be kind of dying (eg Blender abandoned it), I wonder what is going to replace it.

my123 wrote at 2021-11-30 07:45:46:

CUDA is what ended up replacing it, or rather, OpenCL had always failed to make a dent over the long term.

(with AMD ROCm being a CUDA API clone, without the PTX layer)

DeathArrow wrote at 2021-11-30 08:26:17:

But is there anyone using ROCm in production? Is ROCm up to par with CUDA?

my123 wrote at 2021-11-30 08:38:53:

No. It isn’t up to par. But that’s AMD’s problem.

No standard spec would solve a lack of software development investment of a hardware vendor, especially for a device as complex as a GPU.

(meanwhile, on the Intel side, oneAPI looks to be very serviceable, but has a problem for now: where is the fast hardware to run it on?)

d_tr wrote at 2021-11-30 16:14:43:

The two supported FPGA families are a blessing for this kind of project, since they have hardware floating-point units. Unfortunately they are quite expensive, like the Xilinx ones with this feature...

R0b0t1 wrote at 2021-11-30 15:27:59:

I've tried looking up the hardware they run on. Anyone have a price?

detaro wrote at 2021-11-30 15:32:13:

Expensive. Exact parts aren't clear, but hundreds of dollars for a single chip and 5k+ for a devkit from a quick look?

But running on FPGA is really only the testing stage for putting it in an ASIC if something like this wants to be competitive in any way.

R0b0t1 wrote at 2021-11-30 15:32:47:

I thought so. Hundreds for the chip isn't insane (depending on how many) but $5k for the dev kit, oof.

detaro wrote at 2021-11-30 15:36:20:

Yeah. From all I know FPGA pricing is very weird in that prices for singles are _way_ worse than if you buy a lot, even more so than for other chips.

chalcolithic wrote at 2021-11-30 08:32:33:

Wow! Just add NaNboxing support (for JavaScript and possibly other dynamic languages) and it'll be a CPU I dreamed about.

sitkack wrote at 2021-11-30 14:23:38:

For those unfamiliar with Nan-boxing

https://anniecherkaev.com/the-secret-life-of-nan

> One use is NaN-boxing, which is where you stick all the other non-floating point values in a language + their type information into the payload of NaNs. It’s a beautiful hack.

nynx wrote at 2021-11-30 13:06:37:

It’s a GPU.

chalcolithic wrote at 2021-11-30 14:50:34:

Yes and I wanted a GPU-style CPU that could handle all the tasks in the system so no host CPU is necessary

throwaway81523 wrote at 2021-11-30 04:34:18:

A GPGPU in an FPGA. Interesting, but 100x slower than a commodity AMD or NVidia card.

fahadkhan wrote at 2021-11-30 04:44:43:

It's a research project. It's open source. FPGA are often used for developing hardware. If it gets good enough for someone's use case, they will print the chips.

gumby wrote at 2021-11-30 04:41:28:

Perfect way to prototype hardware.