Table of Contents

Introduction

In a previous post I shared a Game of Life I made to try out Ebitengine and WASM with Go. I decided to dive further by implementing a shader to make the game calculations. I’ll go through some iterations I went through.

CPU Version

This is the CPU-bound version I implemented the last time:

// NextGrid calculates the next grid state given a grid and its width (X) and height (Y)
func NextGrid(X, Y int, grid []bool) []bool {
    if len(grid) != X*Y {
        panic("grid doesn't match dimensions")
    }
    next := make([]bool, X*Y)
    for p := 0; p < X*Y; p++ {
        neighs := CalcNeighs(p, X, Y, grid)
        next[p] = grid[p]
        if neighs < 2 {
            next[p] = false
        } else if neighs > 3 {
            next[p] = false
        } else if neighs == 3 {
            next[p] = true
        }
    }
    return next
}

var deltas = [3]int{-1, 0, 1}

func CalcNeighs(p, X, Y int, grid []bool) int {
    neighs := 0
    for _, yd := range deltas {
        if p < Y && yd == -1 {
            continue // skip top
        }
        if p >= X*Y-X && yd == 1 {
            continue // skip bottom
        }
        for _, xd := range deltas {
            if p%X == 0 && xd == -1 {
                continue // skip left
            }
            if p%X == X-1 && xd == 1 {
                continue // skip right
            }
            neighIx := p + xd + yd*X
            if neighIx < 0 || neighIx >= X*Y || neighIx == p {
                continue
            }
            if grid[neighIx] {
                neighs += 1
            }
        }
    }
    return neighs
}

Not a difficult problem, it’s not the best solution but it does the job. I decided to use []bool instead of [][]bool as the grid type, which probably made this code harder to follow for all the index to coordinate conversions.

Idea

I’ve made some games a few years ago with different tools, and I wanted to revisit some of the concepts so I decided to add the following features to the game:

A bigger grid
The ability to zoom in and out
The ability to move around the grid

I coded some features like a camera position that offsets the grid at draw time, a scaling factor to zoom in and out, and even some on-screen detection to avoid drawing stuff the camera isn’t looking at.

With the CPU processing of the grid I couldn’t scale the grid very much, so I decided to play with shaders for the first time.

Introduction to Shaders

Shaders for Ebitengine are written in a language called Kage, while researching I found this is controversial, but I don’t really care right now.

Its syntax is go with less features, some custom functions, and Fragment replaces main as the entry point. If you care enough, you can read about it here and here.

Without going into much detail, a Shader is a standalone program that gets compiled and passed to the GPU for execution. Most of the time it’s used to work with images, but it can be abused for other tasks.

For a shader that calculates the next grid position in a Game of Life I need to:

Encode the current grid in an image
Send the image to a shader
Decode the next grid from the shader output (which is an image)

Shader V1

For the first version I decided to create an image of the same amount of pixels as cells in the grid and encode if they are live or not in the Alpha channel of each pixel.

Code

The shader code looks like this in Kage:

//go:build ignore

//kage:unit pixels

package gol

func isAlive(pos vec2) float {
    color := imageSrc0At(pos)
    if color.a > 0 {  // check alpha channel
        return 1
    }
    return 0
}

func Fragment(_ vec4, srcPos vec2, _ vec4) vec4 {
    alive := 0.0
    pos := srcPos.xy

    deltas := [...]float{-1, 0, 1}

    for x := 0; x < len(deltas); x += 1 {
        for y := 0; y < len(deltas); y += 1 {
            if deltas[x] == 0 && deltas[y] == 0 {
                continue // skip self
            }
            alive += isAlive(pos + vec2(deltas[x], deltas[y]))
        }
    }

    status := imageSrc0At(pos).a // check alpha channel
    if alive < 2 {
        status = 0.0
    } else if alive > 3 {
        status = 0.0
    } else if alive == 3 {
        status = 1.0
    }

    return vec4(0, 0, 0, status)
}

I won’t dive deep into it, all you need to know is that the Fragment function recieves a single pixel, checks the neighboring pixels alpha channels to count the live ones, and returns the new status of this pixel encoded in the alpha channel of the output. imageSrc0At is a Kage built-in function that returns the color of a pixel in the given position.

The above shader is used in Go in the following way:

//go:embed shader.kage
var shaderProgram []byte
var shaderInstance *ebiten.Shader

func init() {
    // load the shader
    shader, err := ebiten.NewShader(shaderProgram)
    if err != nil {
        log.Fatal(err)
    }
    shaderInstance = shader
}

func NextGridShader(X, Y int, grid []bool) []bool {
    outImage := ebiten.NewImage(X, Y)
    gridImage := ebiten.NewImage(X, Y)

    // encode current grid
    for y := range Y {
        for x := range X {
            if grid[y*Y+x] {
                gridImage.Set(x, y, color.RGBA{A: 255})
            }
        }
    }

    // calculate with shader
    opts := &ebiten.DrawRectShaderOptions{}
    opts.Images[0] = gridImage
    outImage.DrawRectShader(X, Y, shaderInstance, opts)

    // decode shader output
    data := make([]byte, 4*Y*X)
    outImage.ReadPixels(data)
    nextGrid := make([]bool, X*Y)
    gix := 0
    for i := 0; i < 4*X*Y; i += 4 {
        alphaIx := i + 3 // value is encoded in alpha channel
        nextGrid[gix] = data[alphaIx] > 0
        gix++
    }

    return nextGrid
}

First we load the shader in an init function to ensure it only happens once.

Then the NextGridShader function calculates the next state of a grid with X columns and Y rows. The current grid is encoded in the alpha channel of gridImage, the image is passed to the shader, and the output that the shader writes to outImage is decoded by reading the alpha channel of each pixel.

Performance

Here are some stats from a micro-benchmark:

goos: windows
goarch: amd64
pkg: github.com/manuelpepe/gol/gol
cpu: AMD Ryzen 5 3600X 6-Core Processor
                                           │  bench.txt   │
                                           │    sec/op    │
NextGridShader/window-(H:10-W:10)-12         436.2µ ± 71%
NextGridShader/window-(H:50-W:50)-12         38.22m ±  6%
NextGridShader/window-(H:100-W:100)-12       36.33m ±  6%
NextGridShader/window-(H:1000-W:1000)-12     42.75m ±  6%
NextGridShader/window-(H:10000-W:10000)-12    1.051 ± 23%

NextGrid/window-(H:10-W:10)-12               3.242µ ±  0%
NextGrid/window-(H:50-W:50)-12               90.41µ ±  1%
NextGrid/window-(H:100-W:100)-12             374.2µ ±  2%
NextGrid/window-(H:1000-W:1000)-12           41.63m ±  0%
NextGrid/window-(H:10000-W:10000)-12          4.432 ±  1%

geomean                                      7.483m

For small grids, the shader is slower as many things have to happen to run a shader. Still, a speed-up can be seen on bigger sizes such as 1000x1000 and even more at 10000x10000. This doesn’t tell the whole story as there are more B/op and allocs/op in the shader version, but empirically it works well with my setup when running the game.

Shader V2

For the next version, I wanted to avoid wasting 3 bytes per pixel.

I wanted to go from something like this:

Where 3 bytes per pixel are unused, to something like this:

This would make better use of available memory, at the expense of complicating the shader code, as some checks will be in the same pixels and others in the adjacent ones. Here’s a visualization:

Now, checks in the X coordinate will imply both a channel switch, along with a possible pixel switch. Checks in the Y coordinate remain the same, it’s always ±W (width).

This also adds a limitation, in that the amount of columns must be divisible by 4 so we can always use complete pixels.

Code

The code for this shader is more complex than the last one:

package gol

// returns the 3 channels that must be checked when iterating in the X coordinate.
// RGBA = 0123
// R will check A at the left, G at the right and R in the middle
func sidesForChannel(ch int) [3]int {
    if ch == 0 {
        return [3]int{3, 0, 1}
    } else if ch == 1 {
        return [3]int{0, 1, 2}
    } else if ch == 2 {
        return [3]int{1, 2, 3}
    } else if ch == 3 {
        return [3]int{2, 3, 0}
    }
    return [3]int{5, 5, 5} // this should cause a panic if it happens
}

// returns wether a cell is alive
func isAlive(pos vec2, ch int) int {
    c := imageSrc0At(pos)
    if c[ch] > 0 {
        return 1
    }
    return 0
}

// xDir is the offset being applied to X, one of {-1, 0, 1}
// ch will be the RGBA channel being counted for, one of {0,1,2,3}
// returns the offset in the X coordinate to get the correct pixel.
// it’s used to move to either the left or right pixel when checking at the edges.
func getPacketOffsetForX(xDir int, ch int) float {
    if xDir == 1 && ch == 3 {
        return 1.0
    }
    if xDir == -1 && ch == 0 {
        return -1.0
    }
    return 0.0
}

// returns the count of alive neighbors
func countNeighs(pos vec2, chCur int) int {
    channels := sidesForChannel(chCur)
    deltas := [...]int{-1, 0, 1}
    alive := 0
    for y := 0; y < 3; y++ {
        yOffset := float(deltas[y])
        for x := 0; x < 3; x++ {
            if x == 1 && y == 1 {
                continue // skip self
            }
            xOffset := getPacketOffsetForX(deltas[x], chCur)
            offset := vec2(xOffset, yOffset)
            alive += isAlive(pos.xy+offset, channels[x])
        }
    }
    return alive
}

// returns a set of 4 horizontal cell positions that are either alive (255) or not (0)
func Fragment(_ vec4, pos vec2, _ vec4) vec4 {
    out := vec4(0, 0, 0, 0)
    for ch := 0; ch < 4; ch++ {  // for each RGBA
        alive := countNeighs(pos, ch)
        status := imageSrc0At(pos)[ch]
        if alive < 2 {
            status = 0.0
        } else if alive > 3 {
            status = 0.0
        } else if alive == 3 {
            status = 255
        }
        out[ch] = status
    }
    return out
}

And the code to run it:

// skipping embed and init() for simplicity

// buffer to avoid reallocating each iteration
var shaderOutputBuffer []byte
var shaderOutputBufferX, shaderOutputBufferY int

// in this version grid is modified
func NextGridShaderV2(X, Y int, grid []bool) []bool {
    if X%4 != 0 {
        panic("grid length must be divisible by 4 to use shader v2")
    }

    // create buffer if empty or outdated
    if shaderOutputBuffer == nil || X != shaderOutputBufferX || Y != shaderOutputBufferY {
        shaderOutputBuffer = make([]byte, X*Y)
        shaderOutputBufferX = X
        shaderOutputBufferY = Y
    }

    outImage := ebiten.NewImage(X/4, Y)
    gridImage := ebiten.NewImage(X/4, Y)

    // encode current grid
    for y := range Y {
        xPacketIx := 0
        xPacketColor := color.RGBA{}
        for x := range X {
            if grid[y*Y+x] {
                switch x % 4 {
                case 0:
                    xPacketColor.R = 255
                case 1:
                    xPacketColor.G = 255
                case 2:
                    xPacketColor.B = 255
                case 3:
                    xPacketColor.A = 255
                }
            }
            if x%4 == 3 {
                gridImage.Set(xPacketIx, y, xPacketColor)
                xPacketIx += 1
                xPacketColor = color.RGBA{}
            }
        }
    }

    // calculate with shader
    opts := &ebiten.DrawRectShaderOptions{}
    opts.Images[0] = gridImage
    outImage.DrawRectShader(X/4, Y, shaderV2Instance, opts)

    // decode shader output
    outImage.ReadPixels(shaderOutputBuffer)
    for ix, b := range shaderOutputBuffer {
        grid[ix] = b > 0
    }

    return grid
}

Performance

The results from this V2 where… surprisingly underwhelming. The performance was way worse than V1. So much worse in fact, that the 10000x10000 simulation timed out making it difficult to get an accurate benchmark. I reduced it to 3000x3000 for the following results:

goos: windows
goarch: amd64
pkg: github.com/manuelpepe/gol/gol
cpu: AMD Ryzen 5 3600X 6-Core Processor
                                           │ bench/v2-bench-2.txt │
                                           │        sec/op        │
NextGridShaderV2/window-(H:12-W:12)-12               6.226m ± 44%
NextGridShaderV2/window-(H:100-W:100)-12             12.39m ± 11%
NextGridShaderV2/window-(H:1000-W:1000)-12           57.82m ±  1%
NextGridShaderV2/window-(H:3000-W:3000)-12           408.8m ±  8%

NextGridShader/window-(H:12-W:12)-12                 7.116m ±  3%
NextGridShader/window-(H:100-W:100)-12               7.167m ±  2%
NextGridShader/window-(H:1000-W:1000)-12             11.65m ± 10%
NextGridShader/window-(H:3000-W:3000)-12             76.41m ± 34%

NextGrid/window-(H:12-W:12)-12                       4.812µ ±  1%
NextGrid/window-(H:100-W:100)-12                     381.3µ ±  3%
NextGrid/window-(H:1000-W:1000)-12                   41.83m ±  2%
NextGrid/window-(H:3000-W:3000)-12                   385.5m ±  1%

geomean                                              10.78m

Again, we are disregarding B/op and allocs/op.

Shader V2-fixed

I wasn’t sure what was happening. At first, I hypothesized that using a single GPU core to process 4 cells negated the concurrency benefits of using the GPU. In hindsight, this was a bad hypothesis.

I ran pprof over 10 seconds of execution for both shaders and noticed a difference:

The NextGridShaderV2 function is clearly visible in the second image, this is strange as the processing on the GPU doesn’t happen below this function, but rather a separate thread. Zooming into this function reveals the culprit:

Flamegraph ShadersV2 Zoom on NextGridShader

A lot of time is being spent in ebiten.(*Image).Set, with this information the fix was simple to find:

if x%4 == 3 {
-   gridImage.Set(xPacketIx, y, xPacketColor)
+   if xPacketColor != (color.RGBA{}) {
+       gridImage.Set(xPacketIx, y, xPacketColor)
+       xPacketColor = color.RGBA{}
+   }
    xPacketIx += 1
-   xPacketColor = color.RGBA{}
}

In this way, unnecessary calls to ebiten.(*Image).Set are avoided.

Performance

After the fix, the V2 version was indeed faster than V1

                                             │ bench/v2-bench-4.txt │
                                             │        sec/op        │
NextGridShaderV2/window-(H:12-W:12)-12                 581.7µ ± 40%
NextGridShaderV2/window-(H:100-W:100)-12               576.5µ ± 64%
NextGridShaderV2/window-(H:1000-W:1000)-12             3.245m ±  6%
NextGridShaderV2/window-(H:10000-W:10000)-12           423.9m ± 17%

NextGridShader/window-(H:12-W:12)-12                   2.312m ± 71%
NextGridShader/window-(H:100-W:100)-12                 959.9µ ± 43%
NextGridShader/window-(H:1000-W:1000)-12               9.435m ± 26%
NextGridShader/window-(H:10000-W:10000)-12              1.051 ± 17%

NextGrid/window-(H:12-W:12)-12                         4.832µ ±  4%
NextGrid/window-(H:100-W:100)-12                       378.8µ ±  2%
NextGrid/window-(H:1000-W:1000)-12                     42.24m ±  1%
NextGrid/window-(H:10000-W:10000)-12                    4.488 ±  1%

geomean                                                6.246m

Last Improvement

The last improvement I made was not necessarily about the shader, but how I handled the images being sent to it. As the docs specified, reusing images with .Clear() is more efficient than recreating them each frame.

I’ll leave the code to do this as an exercise to the reader (or go look at the repository linked at the end)

This change did increase the performance a fair bit, this is the last benchmark after all of this:

goos: windows
goarch: amd64
pkg: github.com/manuelpepe/gol/gol
cpu: AMD Ryzen 5 3600X 6-Core Processor
                                             │ bench/v2-bench-7.txt │
                                             │        sec/op        │
NextGridShaderV2/window-(H:12-W:12)-12                 249.9µ ± 12%
NextGridShaderV2/window-(H:100-W:100)-12               269.1µ ±  8%
NextGridShaderV2/window-(H:1000-W:1000)-12             2.971m ± 10%
NextGridShaderV2/window-(H:10000-W:10000)-12           216.2m ±  8%

NextGridShader/window-(H:12-W:12)-12                   191.8µ ±  5%
NextGridShader/window-(H:100-W:100)-12                 240.3µ ± 11%
NextGridShader/window-(H:1000-W:1000)-12               4.171m ± 14%
NextGridShader/window-(H:10000-W:10000)-12             513.9m ±  8%

NextGrid/window-(H:12-W:12)-12                         4.907µ ±  2%
NextGrid/window-(H:100-W:100)-12                       382.2µ ±  1%
NextGrid/window-(H:1000-W:1000)-12                     42.70m ±  1%
NextGrid/window-(H:10000-W:10000)-12                    4.543 ±  0%

geomean                                                3.280m

You can look at more bench data in the repository.

Conclusion

For me, this was a fun way to refresh some old concepts and introduce myself to shaders. Even if it’s not in the most widely used shader language, nor any complicated math was involved, I believe that learning the basic concepts and gaining the experience of optimizing something using the GPU is worth it by itself.

Nonetheless, it’s notable that Kage shaders are very difficult to debug. The tooling to debug them is pretty much zero as far as I know. I don’t know if this is the case with other shader languages.

You can look at all the code here: https://github.com/manuelpepe/gol