💾 Archived View for capsule.adrianhesketh.com › 2022 › 03 › 31 › use-m1-gpu-with-go captured on 2023-04-19 at 22:35:21. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-06-03)
-=-=-=-=-=-=-
I bought one of the new M1 Macbok pro machines when they came out. I'm really happy with it, everything works great, and they got rid of the Touchbar (which I hated) and put Magsafe, HDMI, and the SD card slot back.
But, the chip is the special bit. I went for the M1 Max with 64GB RAM, 10 CPU cores and 32 GPU cores.
Since I paid for those extra GPU cores, I thought I should know how to use them...
I wanted to be able to carry out file operations, network operations, and data processing or file handling in Go, and still be able to run specific data processing tasks on the GPU.
Here's what I ended up with.
Apple's example code shows you how to add 2 arrays of numbers together.
In this version, the numbers are generated and printed using Go.
kernel void process(device const Params* p, device const float* input, device float* output, uint3 gridSize[[threads_per_grid]], uint3 gid[[thread_position_in_grid]]) { // Only process once per row of data. if(gid.x != 0) { return; } // Since we know we're in the first column... // we can process the whole row. int input_index = idx(gid.x, gid.y, gid.z, p->w_in, p->h_in, p->d_in); float a = input[input_index]; float b = input[input_index+1]; int output_index = idx(0, gid.y, 0, p->w_out, p->h_out, p->d_out); output[output_index] = a + b; }
I've made the Go side as simple as I could. Compile the shader, create the input matrix and populate it, create the output matrix, run `gpu.Run(input, output)` and collect the results.
//go:embed add.metal var source string func main() { gpu.Compile(source) input := gpu.NewMatrix[float32](2, 10, 1) for y := 0; y < input.H; y++ { for x := 0; x < input.W; x++ { input.Set(x, y, 0, float32(y)) } } output := gpu.NewMatrix[float32](1, input.H, 1) gpu.Run(input, output) for y := 0; y < output.H; y++ { fmt.Printf("Summed: %v\n", output.Get(0, y, 0)) } }
Here, the Metal Shader Language is expecting an input image consisting of a 2D matrix of `uint8` values with each 4 group of 4 values representing a single pixel's red, green, blue and alpha values.
The shader code calculates the average of red, green and blue, and uses that as the value for all 3, to create a grayscale image.
kernel void process(device const Params* p, device uint8_t* input, device uint8_t* output, uint3 gridSize[[threads_per_grid]], uint3 gid[[thread_position_in_grid]]) { // Only process once per pixel of data (4 uint8_t) if(gid.x % 4 != 0) { return; } int input_index = idx(gid.x, gid.y, gid.z, p->w_in, p->h_in, p->d_in); uint8_t r = input[input_index+0]; uint8_t g = input[input_index+1]; uint8_t b = input[input_index+2]; uint8_t a = input[input_index+3]; uint8_t avg = uint8_t((int(r) + int(g) + int(b)) / 3); output[input_index+0] = avg; output[input_index+1] = avg; output[input_index+2] = avg; output[input_index+3] = 255; }
The Go code is similar to the first example, except that it has to load from the input JPEG into the matrix instead of generating numbers.
f, err := os.Open("puppy-g7b38fec9b_1920.jpg") if err != nil { log.Fatalf("failed to read puppy JPEG: %v", err) } defer f.Close() jpg, err := jpeg.Decode(f) if err != nil { log.Fatalf("failed to decode JPEG: %v", err) } bounds := jpg.Bounds() stride := 4 input := gpu.NewMatrix[uint8](bounds.Dx()*stride, bounds.Dy(), 1) for y := 0; y < bounds.Dy(); y++ { for x := 0; x < bounds.Dx(); x++ { r, g, b, a := jpg.At(x, y).RGBA() input.Set((x*stride)+0, y, 0, uint8(r/257)) input.Set((x*stride)+1, y, 0, uint8(g/257)) input.Set((x*stride)+2, y, 0, uint8(b/257)) input.Set((x*stride)+3, y, 0, uint8(a/257)) } }
Then, configure the output to be the same size as the input.
output := gpu.NewMatrix[uint8](bounds.Dx()*stride, bounds.Dy(), 1)
And run!
gpu.Run(input, output)
As before, the output is populated with the image.
fo, err := os.Create("gray-puppy.jpg") if err != nil { log.Fatalf("failed to create grayscale puppy: %v", err) } img := image.NewRGBA(jpg.Bounds()) img.Pix = output.Data err = jpeg.Encode(fo, img, &jpeg.Options{ Quality: 100, })
And out comes the gray puppy.
Here the Metal shader code is more complex [0]. Fractal generation matches GPU rendering quite well, because each pixel is calculated independently from those around it.
Running this example renders the familiar shape to PNG.
The Metal APIs are reasonably accessible as a means of adding more parallel processing of data than is possible on the CPU on the M1 Macs, however, gains made by this are offset by the time spent transferring data to / from the GPU.
Writing shader code is fairly straightforward, but it seems to be a very dependency-free environment, which turns using things like cryptographic functions a big job.
The CGO capability of Go made it fairly easy to read and write data to and from the GPU, although I'm sure there's a cost to this.
The code is available at [1]
If you're interested in the details of how I got to the simple code above, read on.
I was hoping there'd be something I could pick up and use, because I thought that lots of people would like to run bits and pieces of code on the GPU but the closest thing I found was [2] - a wrapper around Apple's Metal graphics APIs.
When the author posted this on Hacker News, they got a suprisingly hostile reaction, with the top comment saying to use Vulcan instead [3].
Vulcan [4] is supposed to provide a set of cross-platform APIs so that you can "write code once" and, hopefully, it will run anywhere.
Well, I tried, but I was put off pretty quickly. The tutorials I saw were complex, and a "Hello World" project I found [5] was huge. Even the first commit was massive [6].
I'd read that Vulcan's APIs were hard to understand, verbose, and complex, and it seemed to match what I was seeing, so I gave up on that and went back to `dmitri.shuralyov.com/gpu/mtl`
Unfortunately, it became clear that it doesn't support compute operations [7].
I tried to implement the missing functionality in the library, but my knowledge of how it all worked wasn't at the level where I could contribute effectively, so I headed to Apple's native APIs to learn from the ground up.
Apple's documentation is pretty good, and ships with clear sample code [8].
I started playing around to get the examples working and was up and running quick enough, since Apple has a fairly straightforward example [9] of adding two arrays together.
Aside from the hassle of learning more about Objective C syntax and its terminology, it seemed OK.
Go code can't be compiled to run on the GPU (although there is an interesting looking project that's aiming to achieve that for Rust [10]), so the code running on the GPU is written off Metal Shader Language [11], which is based on C++ 14.
These compute functions are called shaders because they were originally designed to be ran against individual pixels in a graphics frame, to execute changes in brightness, e.g. a fade. The key benefit over CPUs is the quantity of parallelisation that they provide, with my GPU capable of xecuting 32 operations in parallel, vs only 10 in parallel on the CPU.
Shader code itself can be fairly simple. The `add.metal` below simply adds two numbers together.
To do this, it receives some parameters:
The code is then instructed to run on each element within the input array.
This particular shader code is aware that it only needs to add two numbers in column A and column B together. Column A is `x == 0`, so if `x != 0`, we must be in Column B, so no processing is carried out.
I could have taken some shortcuts with this code, but I wanted to establish a pattern in the code of:
#include <metal_stdlib> using namespace metal; typedef struct Params { int w_in, h_in, d_in; int w_out, h_out, d_out; } Params; int idx(int x, int y, int z, int w, int h, int d) { int i = z * w * h; i += y * w; i += x; return i; } kernel void process(device const Params* p, device const float* input, device float* output, uint3 gridSize[[threads_per_grid]], uint3 gid[[thread_position_in_grid]]) { // Only process once per row of data. if(gid.x != 0) { return; } // Since we know we're in the first column... // we can process the whole row. int input_index = idx(gid.x, gid.y, gid.z, p->w_in, p->h_in, p->d_in); float a = input[input_index]; float b = input[input_index+1]; int output_index = idx(0, gid.y, 0, p->w_out, p->h_out, p->d_out); output[output_index] = a + b; }
With Metal shader code ready, I needed to do the rest of the work.
This is where CGO comes in [10].
Using special CGO comments, I could get `go build` to compile the `mtl.h` file and `mtl.m` Objective C file along with my Go code.
//go:build darwin // +build darwin package gpu /* #cgo LDFLAGS: -framework Metal -framework CoreGraphics -framework Foundation #include <stdlib.h> #include <stdbool.h> #include "mtl.h"
In the `mtl.h` file I've got a function:
void compile(char* source);
To call this from Go, all I needed to write is `C.compile` which looks like this in practice:
// Compile the shader. Only needs to be done once. func Compile(shaderCode string) { src := C.CString(shaderCode) defer C.free(unsafe.Pointer(src)) C.compile(src) }
In the `mtl.h` file the `Params` type is defined, and use in the `run` function.
typedef struct Params { int w_in, h_in, d_in; int w_out, h_out, d_out; } Params; void* run(Params *params);
In the Go code, there's a similar implementation of the C `Params` struct. The order and data sizes of the structures must match exactly the C implementation, so the C `int` becomes a Go `int32` rather than a Go `int`, to ensure that they're exactly the same size, but the variable names can be different.
type params struct { // Size of input matrix. WIn, HIn, DIn int32 // Size of output matrix. WOut, HOut, DOut int32 }
That mapping allows Go code to call the C code with custom types:
p := params{ WIn: int32(input.W), HIn: int32(input.H), DIn: int32(input.D), WOut: int32(output.W), HOut: int32(output.H), DOut: int32(output.D), } cp := (*C.Params)(unsafe.Pointer(&p)) // Run. ptr := C.run(cp)
I was able to keep the C implementation simple by using the idea that a matrix can be flattened, much like a bitmap is just a stream of pixel values with a set width.
This allows the `mtl.m` C code to support all the input shapes of data I thought I'd be working with, e.g. 1D array, 2D array, 3D array of a single type without having to have different versions for different inputs.
The next problem was supporting different types for the input and output matrices. For example, I thought it likely that I would have `int` and `float` types.
In the shader C code, I used `void*` and left it to the code to blow up if the types were incorrect.
In Go, I used generics to set the types which both the GPU and Go can support (I'm sure it could support more, but this is all I've tested with so far).
type GPUType interface { uint8 | uint32 | int32 | float32 }
Then created a convenience API around the flat underlying array to allow read and write of xyz coordinates within in the matrix.
func NewMatrix[T GPUType](w, h, d int) *Matrix[T] { // Store matrix like a display buffer, i.e. // get a whole y row, and index it with x. // so, d, y, x m := &Matrix[T]{ W: w, H: h, D: d, init: &sync.Once{}, } return m } type Matrix[T GPUType] struct { W, H, D int Data []T init *sync.Once } func (m *Matrix[T]) Populate() { m.init.Do(func() { if m.Data == nil { m.Data = make([]T, m.W*m.H*m.D) } }) } func (m Matrix[T]) Index(x, y, z int) (i int) { i += z * m.W * m.H i += y * m.W i += x return i } func (m *Matrix[T]) Set(x, y, z int, v T) { m.Populate() m.Data[m.Index(x, y, z)] = v } func (m Matrix[T]) Get(x, y, z int) T { m.Populate() return m.Data[m.Index(x, y, z)] } func (m Matrix[T]) Size() int { return m.W * m.H * m.D }
Will I start writing more stuff for the GPU, I don't know. Maybe. It's not very useful for work I currently do because the Metal APIs don't run in AWS or GCP.
Alerting on errors in CloudWatch Logs, AWS Lambda, and API Gateway with Go CDK