💾 Archived View for dioskouroi.xyz › thread › 29403494 captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-03)

🚧 View Differences

-=-=-=-=-=-=-

WebGPU computations performance in comparison to WebGL

Author: ArtWomb

Score: 142

Comments: 97

Date: 2021-12-01 13:38:18

Web Link

________________________________________________________________________________

pjmlp wrote at 2021-12-01 14:16:56:

We could have had compute in WebGL, given that they are part of GL ES, but Chrome refused to support them and since Chrome == Web, now WebGPU is the only way to get compute shaders.

https://www.khronos.org/registry/webgl/specs/latest/2.0-comp...

https://www.khronos.org/assets/uploads/developers/presentati...

https://bugs.chromium.org/p/chromium/issues/detail?id=113199...

greggman3 wrote at 2021-12-01 14:56:46:

That's pretty uncharitable spin. The bug linked says in the first issue

> macOS' OpenGL implementation never supported compute shaders

which is a pretty valid reason to drop it. Nothing to do with Chrome. Further, OpenGL is dead, every graphics developer knows this. All the browser vendors are working on WebGPU. No reason to waste time on implementing outdating things. There are finite resources. No other browser vendor named any plans to support webgl-compute. So stop with the ridiculous spin

pjmlp wrote at 2021-12-01 15:10:00:

Not at all, after all Apple now supports WebGL 2.0 on top of Metal, and WebGL has always been on top of DirectX on Windows.

Nothing on the standard requires OpenGL backend, only that the semantics are preserved, GL ES, Metal and DirectX 11 all have compute shaders support.

OpenGL is not dead until Khronos comes up with an API that is actually usable without an EE degree on GPU design and shader compilers.

WebGPU is years away to become usable, even if it happens to be released during 2022, it will be a 1.0 MVP, years behind of what Metal, Vulkan, DX 12, NVN and LibGNM are capable of in 2021.

> In order to reclaim code space in Chromium's installer that is needed by WebGPU, the webgl2-compute context must be removed.

I bet the Chrome team wouldn't bother with code space installer for other Google critical features that are part of Project Fugu.

jonahrd wrote at 2021-12-01 18:39:09:

Disclaimer: I work on the team at Google responsible for the open source translator engine that runs under WebGL (ANGLE)

WebGL on Linux requires the OpenGL backend until eventually Vulkan is supported well enough.

Apple's WebGL2-on-Metal approach involves the same translator engine I work on, and was very much a collaborative effort between Google, Apple, and external contributors. Chrome will adopt this in the future after integration work is finished.

I can confirm that browser GPU programmers are definitely spread pretty thin just ensuring compatibility and reliability across all platforms

pjmlp wrote at 2021-12-02 08:13:04:

Unfortunely to the outside world it looks like WebGL has been put on ice because WebGPU is going to sort out everything, someday.

No wonder the focus on making streaming work for 3D rendering with current APIs on modern hardware.

greggman3 wrote at 2021-12-01 15:34:59:

> WebGPU is years away

It would be even further away if all the devs working on it were instead spending years on WebGL-Compute. It's not like you snap your fingers and it works. It takes years of cross platform work and testing. That time is better spent moving forward.

As further proof, the only names on the compute spec are Intel and one Google name. Compare to the WebGPU spec which clearly has buy in from all the browsers.

So if Chrome had shipped it you'd have the gallery of comments bitching about Chrome going too fast, implementing too many things no other browser plans to implement.

BTW: WebGL2 in Safari does not run on Metal (yet)

om2 wrote at 2021-12-01 18:37:36:

WebGL 2 in Safari does run on Metal. WebGL 1 also runs on Metal. Apple contributors added support to ANGLE to use a Metal back end and it’s enabled by default in the latest Safari.

ncmncm wrote at 2021-12-01 15:44:29:

Raph says Metal fails to support an important synchronization primitive that is therefore also not available in WebGPU, and limits performance in his font rendering pipeline.

vlovich123 wrote at 2021-12-01 21:22:25:

If it's not available in Metal, how would it be magically available in WebGL running on top of Metal?

ncmncm wrote at 2021-12-01 23:36:20:

Evidently it is also not available in WebGL. Like many other things.

pjmlp wrote at 2021-12-01 16:53:09:

Here is a tip, doesn't need to be the same team.

greggman3 wrote at 2021-12-01 17:28:37:

Here's a fact, there are a limited number of developers period. You seem to magically have conjured a team out of thin air.

pjmlp wrote at 2021-12-02 08:14:59:

On the contrary, if there was interest the team would have been ramped up, plenty of candidates on the games industry, period.

jms55 wrote at 2021-12-01 15:29:45:

> WebGPU is years away to become usable

As a counterpoint, I've been using WebGPU (through wgpu-rs) for the past 1.5 years. It's been a pleasure to use. For instance, here's the CPU-side code for a glow post-process shader using 4 render passes

https://github.com/JMS55/sandbox/blob/master/src/glow_post_p...

.

pjmlp wrote at 2021-12-01 16:54:00:

It isn't the browser and WGSL is half way specified.

dangerbird2 wrote at 2021-12-02 01:35:21:

But as far as a OpenGL replacement on native, it absolutely is.

pjmlp wrote at 2021-12-02 08:15:38:

Middleware engines already solved that problem for about 20 years now, no need for WebGPU outside the browser.

royjacobs wrote at 2021-12-02 10:44:07:

Not everyone wants to use existing engines. Or are you saying you want to embrace vendor lock-in in that regard?

pjmlp wrote at 2021-12-02 13:53:09:

Yes, definitly, that is the approach taken by AAA game studios.

A plugin framework for 3D APIs is probably the easiest part to implement on a game engine, specially when API like OpenGL and Vulkan already require implementing one due to extension spaghetti.

dahart wrote at 2021-12-01 15:40:17:

> OpenGL is not dead until Khronos comes up with an API that is actually usable without an EE degree on GPU design and shader compilers.

Why would this ever happen? It seems like there is nothing in the works, nothing on the horizon, and no demand for a higher-level, less-performant API to become a new standard. Even OpenGL itself has been getting lower level and more detailed ever since version 1. People can build/use a wrapper API or game engine or something else if they want easy. It seems weird to say this right after defending Apple’s use of Metal to implement WebGL. Apple’s moves to get rid of OpenGL from the Mac ecosystem are one of the strongest forces pushing OpenGL out.

badsectoracula wrote at 2021-12-01 17:42:49:

> Apple’s moves to get rid of OpenGL from the Mac ecosystem are one of the strongest forces pushing OpenGL out.

FWIW as someone who exclusively uses OpenGL for 3D graphics, this actually makes me push Apple out :-P

dahart wrote at 2021-12-01 17:49:50:

Oh yeah, same here. I loved my Mac right up to the point that I could no longer run any of my own code on it with reasonable effort.

pjmlp wrote at 2021-12-01 16:57:23:

On platforms that support OpenGL, it is the Python 2 of Khronos APIs.

Regarding Metal, indeed those that leave OpenGL are more likely to move into middleware than forcing Vulkan upon themselves.

Hence why Khronos started ANARI, as most visualisation products and CAD/CAM people couldn't care less that Vulkan on its present state exists.

anthk wrote at 2021-12-01 19:13:05:

Now Zink (GL over Vulkan) runs faster than OpenGL itself on supported platforms.

jabl wrote at 2021-12-01 21:55:53:

That's something of a sweeping generalization. Zink has managed to beat a native OpenGL driver in some particular benchmarks.

In many other benchmarks, it loses. That being said, it still manages decent performance, which is extremely impressive for a one man project using only the Vulkan interface. It wouldn't surprise me if it eventually becomes the default OpenGL driver in the open source driver stack (for HW capable enough to support a Vulkan driver, obviously).

jabl wrote at 2021-12-01 21:47:03:

Autodesk is using Vulkan for some products (including using MoltenVK for the Mac version).

As for using middleware, GPU's are vastly more capable and complicated today than 30y ago when OpenGL 1 appeared. In most cases it makes sense to use a higher level interface, specialized for the particular type of application you're writing, be it ANARI, a game engine, some scenegraph library, or whatever.

bsder wrote at 2021-12-01 21:37:18:

> OpenGL is not dead until Khronos comes up with an API that is actually usable without an EE degree on GPU design and shader compilers.

That isn't going to happen because everyone is, in fact, moving away from the idea of a "graphics API" altogether and simply allowing the compute systems to calculate everything.

See: the Nanite renderer from EPIC:

https://www.youtube.com/watch?v=eviSykqSUUw

To a first and second order approximation, no one cares about graphics that aren't related to games.

BiteCode_dev wrote at 2021-12-01 14:39:17:

A great comment to refer to the next time some young coder that never lived through IE6 says it's great to have one single engine everywhere.

magicalist wrote at 2021-12-01 18:00:27:

> _A great comment to refer to the next time some young coder that never lived through IE6 says it's great to have one single engine everywhere._

OP is suggesting that Chrome should have spearheaded a spec that no one but Intel and Google ever showed any interest in. How would that not have been literally "they like a standard (or propose one), they implement it, use it on their projects like google doc, and let the competition deal with the mess"[1]?

Instead they're implementing a spec with broad involvement and support across browsers, it's just taking longer. Seems like exactly how we'd want things to go.

[1]

https://news.ycombinator.com/item?id=29405716

dmitriid wrote at 2021-12-01 18:19:44:

> they like a standard (or propose one), they implement it, use it on their projects like google doc, and let the competition deal with the mess

Yes. Exactly what they've been doing with plenty of other "standards"

magicalist wrote at 2021-12-01 19:13:37:

Which is what the OP is suggesting Google should have done in this case

KarlKemp wrote at 2021-12-01 16:30:54:

As someone who has lived through the browser wars, I have no idea what you’re trying to say? The situation today is vastly better than it was back then. It’s not hyperbole to say that porting JS from, say, IE to Mozilla took just as much time as it took to write it in the first place. Today, it’s expected (and often the truth) that something you developed in one browser works in the others.

Also, none of specific reasons people opposed Microsoft’s policies with IE, like it’s attempts to lock people into windows-only APIs (ActiveX etc) apply today.

BiteCode_dev wrote at 2021-12-01 16:41:26:

> I have no idea what you’re trying to say?

In this case, google can pull off something like OP says because it's dominating the market.

> Also, none of specific reasons people opposed Microsoft’s policies with IE, like it’s attempts to lock people into windows-only APIs (ActiveX etc) apply today.

Chrome implements plenty of chrome only API. They like a standard (or propose one), they implement it, use it on their projects like google doc, and let the competition deal with the mess.

skybrian wrote at 2021-12-01 17:13:06:

Sure, there is a long tail of web API's like this but they tend to be very specialized. They are easily avoided and ignored by most web developers, who likely have no need for them in the first place. (Both WebGL and WebGPU are arguably in this category - you are unlikely to need them for your website.)

This is nothing like the situation was with IE6. Back then basic functionality was half-broken and it was hard to get anything done without a pile of hacks.

BiteCode_dev wrote at 2021-12-01 17:50:39:

> This is nothing like the situation was with IE6. Back then basic functionality was half-broken and it was hard to get anything done without a pile of hacks.

Causes are not 1 to 1, but consequences are the same: monopoly leads to abuse, abuse lead to some site not working with Firefox, consommers needs being ignored by google and them abusing their dominant position to pass what they want as standard, or just destroy API they don't like (see the latest adblock scandal).

flohofwoe wrote at 2021-12-01 15:18:24:

Compute shaders (and a lot of other useful features) are part of GLES 3.1, while WebGL2 stopped at GLES3.0 (those computer shader experiments were just that: experiments - my guess is that the effort wasn't continued because WebGPU was already on the horizon).

edit: GLES3.2 => GLES 3.1

pjmlp wrote at 2021-12-01 15:28:59:

Intel showed it working, Chrome abandoned it, because they didn't want to spend resources implementing it, and "In order to reclaim code space in Chromium's installer that is needed by WebGPU, the webgl2-compute context must be removed.".

WebGPU is still on the horizon during the next couple of years.

modeless wrote at 2021-12-01 16:45:02:

Apple abandoned OpenGL and refused to implement newer versions that would include compute shaders. WebGL implementations were based on OpenGL at the time. Intel's prototype did not and could not work on Mac. WebGL on Metal was not even started and there was no indication that Apple would ever work on it.

Now, years later, Apple actually implemented WebGL on Metal, so today we could think about implementing WebGL compute on Mac. However WebGPU is now in origin trials. It's very unlikely that Apple would put any effort into a compute shader implementation for WebGL now. And Chrome is not going to go implement a major WebGL feature that has no prospect of ever being supported in Safari.

pjmlp wrote at 2021-12-01 16:59:10:

So for some APIs, Google does whatever they feel like it, e.g. Project Fungus, Houdini and PWAs.

But for WebGL it matters what Apple does?

modeless wrote at 2021-12-01 20:30:42:

If Apple was completely blocking progress in web graphics then maybe Chrome would have tried to do something about it. But that's not the case at all. Everyone is aligned on WebGPU as the path forward. It's unfortunate that Apple delayed WebGL for years but there's nothing to do about it now.

pjmlp wrote at 2021-12-02 08:18:25:

Why doesn't Google drop PWAs given the same reasoning?

modeless wrote at 2021-12-03 04:42:52:

... that would only be an analogous situation if Apple was collaborating in a W3C working group with Google and Mozilla and Microsoft and others to make a more capable standard to replace PWAs, and was already in the process of implementing it. The situations really couldn't be more different.

moffkalast wrote at 2021-12-01 14:46:26:

Yeah well I don't see Firefox supporting it either, with their usual practice of not even supporting what Chrome bothers to do.

liminal wrote at 2021-12-01 22:59:34:

Firefox supports CSS subgrid. I wish Chrome did, since it would solve some layout problems I have.

BenoitEssiambre wrote at 2021-12-01 14:47:16:

Hey I was investigating the use of GPUs just out of curiosity since it's one of the few areas of computing I wasn't very familiar with.

The above benchmarks are about matrix multiplication. Matrix multiplication seems to be a common theme in gpu api documentation. Now for someone who rarely needs to multiply matrices, are there other good applications the GPU pipeline tends to be useful for? Are there guidelines on how complex a program you can run on GPU cores and still reap benefit? What are (vague, order of magnitude) limits on program size and size of the different memories you have access to? Can you run machine learning algorithms that are not based on matrix multiplication, gradients or derivatives. Can you do monte carlo simulations, decision trees etc?

Const-me wrote at 2021-12-01 16:50:45:

> are there other good applications the GPU pipeline tends to be useful for?

Pretty much everything compute-bound. SpaceX presentation about rocket engines:

https://www.youtube.com/watch?v=vYA0f6R5KAI

Unreal Engine 5 presentation about rendering many millions of small triangles using compute shaders:

https://www.youtube.com/watch?v=TMorJX3Nj6U

> Are there guidelines on how complex a program you can run on GPU cores and still reap benefit?

Arbitrarily complex. Many real-world GPGPU programs are split over multiple compute shaders / kernels. When implemented properly, they even run in parallel. This is critically important on Windows with DirectCompute, because 2 seconds TDR timeout. When a single compute shader takes more than 2 seconds, the OS concludes the GPU’s hang, resets the hardware and reloads the driver.

> What are (vague, order of magnitude) limits on program size and size of the different memories you have access to?

Technically unlimited because you can stream data from system memory, disk, or even internets. Practically, most modern GPUs come with 6-12 GB VRAM, for optimal performance you’d want your data to fit there. When doing micro-optimizations, another important number is amount of on-chip memory, the order of magnitude is 64kb per core.

> Can you do monte carlo simulations, decision trees etc?

Monte Carlo is a great fit for GPUs, just don’t forget to seed your RNG with SV_DispatchThreadID or an equivalent.

Trees are tricky. GPUs share instruction decoder and instruction pointer over 32-64 hardware threads. A straightforward implementation of binary trees gonna be suboptimal due to divergence of these threads. Often possible to do something else instead.

dragontamer wrote at 2021-12-01 18:20:43:

> Trees are tricky. GPUs share instruction decoder and instruction pointer over 32-64 hardware threads. A straightforward implementation of binary trees gonna be suboptimal due to divergence of these threads. Often possible to do something else instead.

The GPU-straightforward implementation is to use a SIMD-stack / SIMD-queue. I don't know its proper term, but its "obvious" to anyone who programs GPUs.

The following is probably wrong, but hopefully correct enough to demonstrate the idea...

    active-lane-workgroup(){
        wavefront_activelane = prefix-sum(execution-mask);      
        // Execution mask is 0 if not currently executing, and 1 if currently executing in your wavefront. 
        // wavefront_activelane is available at assembly level in one clock tick in both NVidia and AMD GPUs.
        __shared__ int workgroup_prefix_sum[wavefronts_in_workgroup];
        if(wavefront_activelane == 0){
            workgroup_prefix_sum[wavefront_id] = wavefront_horizontal_max(wavefront_activelane);
        }
        __syncthreads();
        return workgroup_prefix_sum[my_wavefronts_idx] + wavefront_activelane;
    }

    SIMD-push(stack, data):
        stack[stack.ptr + active-lane-workgroup()] = data;
        __syncthreads();
        if(active-lane-workgroup() == 0){
            stack.ptr += workgroup-prefix-max(active-lane-workgroup());
        }
        __syncthreads();


    SIMD-pop(stack):
        toReturn = stack[stack.ptr - active-lane-workgroup()];
        __syncthreads();
        if(active-lane-workgroup() == 0){
            stack.ptr += workgroup-prefix-max(active-lane-workgroup());
        }
        __syncthreads();
        return toReturn;

------

Now that you have a stack, its simply a matter of pushing your DFS into the SIMD-stack, and popping it off to traverse.

You're somewhat BFS, because on a 1024-wide workgroup, your threads will visit the top 1024 items of the stack each step. But the stack overall behaves in a DFS manner.

The CUDA cub library (and ROCm hipCUB / ROCm rocPRIM libraries) implement these horizontal operations (glorified prefix-sum). But its not too hard to write workgroup prefix-sum or workgroup prefix-max yourself. (Indeed: I suggest beginners write their own prefix-sum to get a feel of how simple and efficient the prefix sum / scan operations can be)

--------

The "pattern" is that you use horizontal operations and __syncthreads() for thread synchronization and communication. In effect, this SIMD-stack is performing load-balancing simultaneously with the DFS. That is to say, it "reaches breadth-wise" and pulls extra nodes to visit when more lanes are available, while it prefers depth to minimize memory usage.

Const-me wrote at 2021-12-02 01:25:08:

> The GPU-straightforward implementation is to use a SIMD-stack / SIMD-queue

I agree. Haven’t personally did that exact thing because wave intrinsics require D3D 12.2 (or new enough CUDA like in your example), but I did similar stuff with group shared memory which is more compatible, only requires feature level 11.0 hardware.

However, that GPU-straightforward implementation is not that straightforward for programmers with CPU background starting to use GPUs. Even simpler things like reduction (dot product, or matrix*vector in linear algebra) are rather tricky to implement efficiently on GPUs, very different than even manually-vectorized SIMD CPU code.

dragontamer wrote at 2021-12-01 20:37:56:

Oh, and CUDA's "cooperative groups" are a nice abstraction for this "Execution-mask" handling.

But if you read older PRAM stuff from the 1980s or 1990s, they talk about "execution-masks" and manipulating it directly like this. So its helpful to know how this older technique relates to modern APIs like cooperative groups.

krona wrote at 2021-12-01 15:14:07:

Nvidia have an entire library of books dedicated to GPGPU, which can now be (to some extent) repurposed for WebGPU in ways not possible with WebGL.

The point of the matrix multiplication example is that it's easy to implement in WebGL without having to be creative; it's a largely apples-to-apples comparison between WebGL/WebGPU/CPU.

dahart wrote at 2021-12-01 16:02:21:

Program size limits are a thing of the past. Today you don’t need to worry about program size, and (like on the CPU) you’ll probably bump into long compile times way before running out of space. Memory access is whatever RAM is on the board, plus whatever you want to stream/transfer from CPU RAM and/or disk.

Complexity doesn’t matter, your GPU programs can be arbitrarily complex. what matters is that your workload is fairly uniform - GPUs are SIMT machines that will realize performance gains any time you have lots of threads (thousands, millions, billions) that all mostly do the same thing and can execute mostly independently.

Yes, you can do Monte Carlo simulations, decision trees, and ML without Neural Networks. All of these things are extremely common.

dragontamer wrote at 2021-12-01 15:57:10:

> Now for someone who rarely needs to multiply matrices, are there other good applications the GPU pipeline tends to be useful for?

The study of GPU algorithms is completely different and independent of regular CPU algorithms.

Sorting: Parallel Bitonic Sort networks. Radix Sort. Mergepath.

Hashing: Parallel Cuckoo Hash.

Graphs: Strangely similar to CPUs. Depth-first search to save space, breadth-first to generate parallelism. You'll need the parallel SIMD-stack / SIMD-queue with prefix-sums to efficiently push/pull items off the stack/queue, but its not too difficult actually (but non-obvious to the CPU-programmer).

Trees: BVH traversal (aka: Raytracing) is extremely common, showing that high-throughput / parallel tree traversal is possible on GPUs.

--------

The problem is that GPUs are designed for 10,000 "threads" (really, SIMD-lanes). In fact, I can almost certainly say that any program with less than 1000 threads will run faster on a CPU than a GPU. (unless you're a rare, purely memory bandwidth-bound problem, like memcpy or memset, because GPU-RAM is way faster than standard DDR4)

There are all sorts of techniques to discover more "threads" (or parallel streams of computation) through the usage of prefix sum. But ultimately, a "practical" GPU application must perform tens-of-thousands of calculations in parallel to beat a CPU.

Indeed: something like parallel-Bitonic Sort is less efficient technically. However, because it creates so many parallel threads of compute, its a great fit as a simple GPU-sorting algorithm. (GPU Mergepath sort is within log2(processors) amount of total work as the sequential version, and is therefore considered almost as efficient as the original sequential algorithm)

That's the thing: you end up generating slightly more work with any parallel program than the original sequential algorithm in most cases.

Matrix Multiplication is "great" because the parallel version has exactly the same number of additions and multiplications at the original sequential version. So everyone loves it as a simple parallelism example. Its even stronger: the GPU-algorithm not only has the same number of computations, but also the same number of memory reads/writes. Its the most ideal algorithm to run in parallel.

Other cases, such as sorting, graph algorithms or whatever, will end up using slightly more overall compute (so there's a tradeoff), or traversing in a different order (CPUs favor depth first search. GPUs will go far more breadth-first before their tens-of-thousands of threads fill up, so you end up traversing the graph / tree in a different order and getting slightly different results)

As such, other algorithms require the programmer to make a judgement: is it worth gaining log2(processors) amount of total work, in order to use a GPU and perform the calculation in parallel? Almost certainly. But other algorithms are O(processors) or worse, at which point a purely sequential (or a low-thread count: like 32-threads) would be better than 10,000 threads.

GPU lanes are much slower: they access memory at far higher latenciess, and GPUs are in-order processors (lacking branch prediction, out-of-order or superscalar tricks from the CPU world). As such, GPU threads are much much slower individually than CPU threads. You just get a ton of GPU "threads" to make up the difference.

spekcular wrote at 2021-12-01 18:00:11:

Thanks for the very detailed and helpful reply.

Do you know of any good books or references where I could learn more about these things?

The books that are usually recommended seems to be CUDA-centric and out of date. I'm interested in learning the more general concepts you talk about in your answer, so that I can effectively write e.g. monte carlo simulations on arbitrary GPU hardware. (I don't have an Nvidia GPU!)

dragontamer wrote at 2021-12-01 18:07:41:

> The books that are usually recommended seems to be CUDA-centric and out of date.

The CUDA-ones are usually best, because they're at least with a modern API. The other recommendations I got are the 80s stuff on vector computers and the CM-2.

It turns out that a huge community of high-performance programmers have experimented with all of these concepts in the 1970s, 80s, 90s, and 00s, long before GPUs. All of their concepts still work today on modern GPUs.

Looking up the right keywords: such as "CREW-PRAM algorithms" (Concurrent-Read Exclusive Write, Parallel RAM model) immediately gives you plenty of results for some things. (Ex: I just searched on 'CREW-PRAM DFS' and got:

https://core.ac.uk/download/pdf/82490222.pdf

).

The key is understanding what the "old word" for GPU was. That's PRAM, the Parallel-RAM model. That's how programmers from the 1970s, 1980s, and 1990s talked about algorithms written for the GPU-style called it back then.

Newer articles / books talk about GPUs directly.

--------------

I'd say the fundamentals are covered in the 1970s, such as "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations" by Kogge, which leads to the development of Prefix sum (as it is called today).

Of course, CUDA code is clearer than a theoretical CREW-PRAM discussion. So perhaps its easier if you just read GPU Compute Gems by NVidia to cover the same material. Still, I find that the 1970s writing is often times better written for a beginner (back then, fewer programmers knew how parallel programming worked, but were clearly more math-heavy than today. So I find that reading old articles like that one helps my personal brain and personal thinking pattern)

---------

Ah right, the recommendation. I'd say "Data Parallel Algorithms" by Hillis / Steele (ACM Communications 1986) is an excellent introduction to general-purpose SIMD compute. It was written for CM-2, an ancient supercomputer that no longer exists, but the PRAM style applies to GPU algorithms today.

Its like 15 pages, but it really opens your eyes to the possibilities. A lot of CUDA stuff is very specific to NVidia GPUs (important specific details: like bank conflicts and shared memory, which you absolutely should learn about... but such details should be studied after you learned the way of parallel-thinking / PRAM model / etc. etc.)

dragontamer wrote at 2021-12-01 20:33:36:

Oh: in case it isn't clear... CUDA is just the most popular current GPGPU programming language right now. There's ipsc, OpenCL, OpenACC, DPC++, SYCL and many others.

They are all closely related to the PRAM-model however. So algorithms study isn't really about learning CUDA-details or whatever, but learning the generic (and cross-language) concepts of parallel programming.

------

So it really doesn't matter if you're reading CUDA, or C-star (1980s code for the old CM-2 supercomputer). They're both PRAM in concept and therefore somewhat compatible in your brain.

It helps to know GPU-quirks for highest levels of optimization (wavefront programming, uniform branches, __shared__ memory), but you can learn these details after learning the generic PRAM stuff known for the past decades.

ArtWomb wrote at 2021-12-02 14:06:25:

Hey, dragon, thanks so much for your thoughtful replies. This thread turned into a bonanza of revelations into the future of graphics apis. And I have to concur, cpus are getting so inexpensive and powerful, proprietary renderers (designed to run on farms as well) simply target vector extensions for parallelism.

Regarding learning GPU Architectures and Programming, usually in the Graphics Gems books there is an introductory section devoted to compute. But you are on own regarding streaming, tracing, tuning, multi gpu. All still very much dark arts ;)

bsenftner wrote at 2021-12-01 15:29:25:

The reason matrix multiplication is the focus is because linear algebra is generic, and creates a pipeline for whatever that generic computation requiring task happens to be. The obvious case being things like rendering imagery and transforming point clouds - which if you think about it are completely generic operations.

alfalfasprout wrote at 2021-12-01 18:24:19:

tl;dr problems that are compute bound but trivially parallelizable tend to be good choices for GPU computation.

Eg; when you're running a series of small computations on a lot of data. Or a lot of small computations on a moderate amount of data.

GPUs tend to be pretty ineffective for computations where there are a lot of data dependencies since the individual shader units are slow compared to traditional CPU cores... so most of the GPU will be idle.

Or for ultra low latency applications (since moving data between to and from the GPU is costly even with direct memory access).

rough-sea wrote at 2021-12-01 14:46:39:

It's a little known fact that Deno has WebGPU built-in already:

https://doc.deno.land/builtin/stable#GPU

danielvaughn wrote at 2021-12-01 14:48:59:

I haven't followed Deno since I first heard of it, back when it was initially announced. If you're following it, how's it going? Do you like it?

ksec wrote at 2021-12-01 21:01:16:

Does anyone know of a high level comparison of features support between Direct X, Metal, WebGPU, OpenGL etc ?

I know WebGPU is an subset, but I always wanted to how many features are missing compared to modern Direct X 12 and Metal ( As well as possibly GNMX, Vulkan etc )

invalidname wrote at 2021-12-01 14:22:22:

Accessing device low level capabilities implemented by device driver manufacturer?

Exploitable vulnerabilities will probably come in faster than the performance...

flohofwoe wrote at 2021-12-01 15:13:25:

How to you get that from the blog post? WebGPU doesn't allow direct access to the GPU just as WebGL doesn't, it's just a different programming model which moves more state management into the initialization phase instead of the render loop.

542458 wrote at 2021-12-01 14:37:13:

I recall seeing similar predictions of terrible security vulnerabilities with the WebGL APIs, but AFAIK we never saw real-world attacks here (other than fingerprinting). Much like JIT or video, it does increase the attack surface significantly and you should turn it off if you’re very security conscious… but it’s not apocalyptic for most users.

krona wrote at 2021-12-01 15:09:18:

It was possible to read framebuffers (via the WebGL API) containing the rendering of other tabs at one point in (if memory serves) Firefox.

est wrote at 2021-12-01 14:50:21:

> you should turn it off if you’re very security conscious

You can't easily turn it off in Chrome/Blink/Edge

Same as WebRTC.

bhouston wrote at 2021-12-01 16:21:41:

How if there was only a consistent way to access WebGL, WebGL2 and WebGPU from Node.JS. Was thinking of pushing Headless-GL in that direction:

https://twitter.com/BenHouston3D/status/1466054449898659842

There is a separate project from Google for just WebGPU in Node.js here but it isn't production ready:

https://twitter.com/DaKangz/status/1466063165947527177

XCSme wrote at 2021-12-01 16:24:50:

I think the main gist of WebGL and WebGPU is that they run in the browser. If you are going to run it on the server and care about performance, why not use native drivers instead of headless WebGL access? Is code-sharing between browser and server the main reason?

fulafel wrote at 2021-12-01 16:34:07:

Calling into native drivers directly is unportable so not suitable for eg a published Node module that's supposed to be generally usable, and the APIs are very unsafe so they'll turn your managed language app into a segfault and security hole prone basket of surprises.

kevingadd wrote at 2021-12-01 16:38:43:

Using OpenGL (or Vulkan) to talk to the GPU is one of the most portable things around. It works more places than Node does, even

fulafel wrote at 2021-12-01 16:43:39:

This is technically true, if you don't count WebGL, but the bar is very low. Nontrivial OpenGL apps are epic battles requiring you to carefully & constantly test and debug zillions of configuration combos, with implementation specific OpenGL bugs, workarounds thereof, and delaing with myriad combos of feature and OpenGL version availability. A bit like SQL is portable for some meaning of "portable" (ANSI SQL!), only much worse.

Also Apple keeps threatening to kill OpenGL so any day now all the people developing on Mac might lose the ability to dev this headless component locally. (And this was always the situation with Vulkan.)

jatone wrote at 2021-12-01 16:35:47:

good thing you're running on a server then.

fulafel wrote at 2021-12-01 16:39:58:

Because portability, robustness or security are not concerns in server side components...?

jatone wrote at 2021-12-01 18:05:40:

reality is opengl and vulkan are portable. but the fact youre on a server means you have control on the environment. not like you need to deal with every tom dick and harry system configuration.

fulafel wrote at 2021-12-01 19:05:17:

I guess you are thinking of a situation where you would run this on a on-prem server system that you had a hand in spec'ing out? Yes it's possible to do this, but most Node apps in use run in various cloudy environments, possibly containerized, that you have less control over, and of course on random dev setups that people have. If you put out an open source Node module, you'll have devs trying to run it on systems with plentiful variability.

jatone wrote at 2021-12-01 22:05:39:

even in cloud environments you control the environment by choosing to use that service.

dangerbird2 wrote at 2021-12-02 01:39:14:

WebGPU also works as a simplified alternative to Vulkan or DirectX-12 on native platforms. I could see nodejs libraries using WebGPU compute shaders as an alternative to CUDA kernels that run on non-NVIDIA GPUs.

jdashg wrote at 2021-12-02 00:40:30:

You can totally do async downloads out of webgl, and pixels shaders are no more synchronous than compute shaders are.

https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API/W...

danielvaughn wrote at 2021-12-01 14:51:35:

Naive question here. Since WebGPU offers compute, shouldn't it be possible to create a web app where you can join a mining pool directly from your browser?

gbrown wrote at 2021-12-01 16:59:59:

More than possible, it's common as an attack: cryptojacking.

greggman3 wrote at 2021-12-01 17:37:56:

You can already mine coins from your browser today

https://coinwebmining.com

danielvaughn wrote at 2021-12-01 18:19:39:

Yeah true. I guess what I'm wondering is whether explicit support for GPU compute shaders makes it more of a viable product. Supposedly WebGPU is based on the new Vulkan-based API instead of OpenGL, which should allow for much greater performance. I'm not really knowledgeable enough to know for sure, though.

proto-n wrote at 2021-12-01 14:56:11:

Years back I had the idea of a patreon-like site, where people could support a person/project by keeping a tab open, mining. Nothing like it materialized afaik, guess none of the usual pow algos really shine in browser tabs.

dljsjr wrote at 2021-12-01 15:19:12:

Most modern browsers throttle or suspend JS execution in tabs that don't have focus anyway.

danielvaughn wrote at 2021-12-01 15:23:03:

Ah this is a good point. So it would literally need to be visible, focused, and the computer likely shouldn't enter sleep mode during the mining process. That's a lot of hoops.

hutzlibu wrote at 2021-12-01 18:04:46:

Visible is enough. It does not have to be focused. It can mostly also be covered by other (browser) windows, but if it is minimized or in another tab, it gets throttled.

That hit me a few times, but ads and malware ruined that for everyone. (and there is no simple permission dialog possible, where you ask the user to allow it)

pjmlp wrote at 2021-12-01 15:13:23:

It is already possible in WebGL using regular shaders, with textures as buffers.

danielvaughn wrote at 2021-12-01 15:14:44:

You can, but my understanding is that they are significantly less performant than using proper compute shaders. Maybe I'm wrong though?

flohofwoe wrote at 2021-12-01 15:23:29:

The more computation happens per 'item/pixel' the less the WebGL-induced overhead should matter, WebGL adds overhead getting the data into and out of the shaders, but the raw pixel- vs compute-shader performance shouldn't differ much.

kevingadd wrote at 2021-12-01 16:39:50:

I would expect the real problems are the lack of integer operations and random memory access

GhettoComputers wrote at 2021-12-01 18:29:24:

Do you enable or disable this in your browser? I turned off webGL and webGPU to make it less likely to have processes hogging my device.

hutzlibu wrote at 2021-12-01 18:26:57:

"We do not do expensive and synchronous getPixelsData"

Sounds good (along with the performance graph).

But I would also just like a asynchronous getPixelData for the canvas now.

Does anyone know if there is a chance of this coming anytime soon? From my user perspective, it does not seem too complicated to implement, but the potential benefits seem great.

Because WebGL is working quite stable now and WebGPU not yet.

soylentgraham wrote at 2021-12-01 14:32:26:

"Then the result of computation we have as a set of pixels on a <canvas> element and we have to read it synchronously with getPixelsData then color codes to be converted back to your data. Looks like an inefficient mess, right?"

Well, not the fastest path for reading back the rendertarget, so I can see why not doing it would be more efficient

ttoinou wrote at 2021-12-01 23:34:10:

Can we transfer the result of the WebGPU compute onto a WebGL context to display on canvas afterwards ? So that intermediary computations are fast but we can still use it to display something in WebGL like before

tominated wrote at 2021-12-01 23:51:54:

If you have WebGPU compute support available, you'd probably wanna go straight to using WebGPU instead of WebGL for rendering too. But I don't see why it wouldn't be possible to use the result in WebGL afterwards - there's probably going to be a bit of data conversion involved though

modeless wrote at 2021-12-02 01:55:46:

WebGPU can display things on canvases by itself. You don't need WebGL. It is significantly more flexible than WebGL because a single context can present to multiple canvases.

brrrrrm wrote at 2021-12-01 14:20:17:

Is the benchmark code open source?