Game of Life Shader
Table of Contents
Introduction
In a previous post I shared a Game of Life I made to try out Ebitengine and WASM with Go. I decided to dive further by implementing a shader to make the game calculations. I’ll go through some iterations I went through.
CPU Version
This is the CPU-bound version I implemented the last time:
// NextGrid calculates the next grid state given a grid and its width (X) and height (Y)
func NextGrid(X, Y int, grid []bool) []bool {
if len(grid) != X*Y {
panic("grid doesn't match dimensions")
}
next := make([]bool, X*Y)
for p := 0; p < X*Y; p++ {
neighs := CalcNeighs(p, X, Y, grid)
next[p] = grid[p]
if neighs < 2 {
next[p] = false
} else if neighs > 3 {
next[p] = false
} else if neighs == 3 {
next[p] = true
}
}
return next
}
var deltas = [3]int{-1, 0, 1}
func CalcNeighs(p, X, Y int, grid []bool) int {
neighs := 0
for _, yd := range deltas {
if p < Y && yd == -1 {
continue // skip top
}
if p >= X*Y-X && yd == 1 {
continue // skip bottom
}
for _, xd := range deltas {
if p%X == 0 && xd == -1 {
continue // skip left
}
if p%X == X-1 && xd == 1 {
continue // skip right
}
neighIx := p + xd + yd*X
if neighIx < 0 || neighIx >= X*Y || neighIx == p {
continue
}
if grid[neighIx] {
neighs += 1
}
}
}
return neighs
}
Not a difficult problem, it’s not the best solution but it does the job. I decided to use []bool
instead of [][]bool
as the grid type, which probably made this code harder to follow for all the index to coordinate conversions.
Idea
I’ve made some games a few years ago with different tools, and I wanted to revisit some of the concepts so I decided to add the following features to the game:
- A bigger grid
- The ability to zoom in and out
- The ability to move around the grid
I coded some features like a camera position that offsets the grid at draw time, a scaling factor to zoom in and out, and even some on-screen detection to avoid drawing stuff the camera isn’t looking at.
With the CPU processing of the grid I couldn’t scale the grid very much, so I decided to play with shaders for the first time.
Introduction to Shaders
Shaders for Ebitengine are written in a language called Kage, while researching I found this is controversial, but I don’t really care right now.
Its syntax is go with less features, some custom functions, and Fragment
replaces main
as the entry point. If you care enough, you can read about it here and here.
Without going into much detail, a Shader is a standalone program that gets compiled and passed to the GPU for execution. Most of the time it’s used to work with images, but it can be abused for other tasks.
For a shader that calculates the next grid position in a Game of Life I need to:
- Encode the current grid in an image
- Send the image to a shader
- Decode the next grid from the shader output (which is an image)
Shader V1
For the first version I decided to create an image of the same amount of pixels as cells in the grid and encode if they are live or not in the Alpha channel of each pixel.
Code
The shader code looks like this in Kage:
//go:build ignore
//kage:unit pixels
package gol
func isAlive(pos vec2) float {
color := imageSrc0At(pos)
if color.a > 0 { // check alpha channel
return 1
}
return 0
}
func Fragment(_ vec4, srcPos vec2, _ vec4) vec4 {
alive := 0.0
pos := srcPos.xy
deltas := [...]float{-1, 0, 1}
for x := 0; x < len(deltas); x += 1 {
for y := 0; y < len(deltas); y += 1 {
if deltas[x] == 0 && deltas[y] == 0 {
continue // skip self
}
alive += isAlive(pos + vec2(deltas[x], deltas[y]))
}
}
status := imageSrc0At(pos).a // check alpha channel
if alive < 2 {
status = 0.0
} else if alive > 3 {
status = 0.0
} else if alive == 3 {
status = 1.0
}
return vec4(0, 0, 0, status)
}
I won’t dive deep into it, all you need to know is that the Fragment function recieves a single pixel, checks the neighboring pixels alpha channels to count the live ones, and returns the new status of this pixel encoded in the alpha channel of the output. imageSrc0At
is a Kage built-in function that returns the color of a pixel in the given position.
The above shader is used in Go in the following way:
//go:embed shader.kage
var shaderProgram []byte
var shaderInstance *ebiten.Shader
func init() {
// load the shader
shader, err := ebiten.NewShader(shaderProgram)
if err != nil {
log.Fatal(err)
}
shaderInstance = shader
}
func NextGridShader(X, Y int, grid []bool) []bool {
outImage := ebiten.NewImage(X, Y)
gridImage := ebiten.NewImage(X, Y)
// encode current grid
for y := range Y {
for x := range X {
if grid[y*Y+x] {
gridImage.Set(x, y, color.RGBA{A: 255})
}
}
}
// calculate with shader
opts := &ebiten.DrawRectShaderOptions{}
opts.Images[0] = gridImage
outImage.DrawRectShader(X, Y, shaderInstance, opts)
// decode shader output
data := make([]byte, 4*Y*X)
outImage.ReadPixels(data)
nextGrid := make([]bool, X*Y)
gix := 0
for i := 0; i < 4*X*Y; i += 4 {
alphaIx := i + 3 // value is encoded in alpha channel
nextGrid[gix] = data[alphaIx] > 0
gix++
}
return nextGrid
}
First we load the shader in an init
function to ensure it only happens once.
Then the NextGridShader
function calculates the next state of a grid
with X
columns and Y
rows.
The current grid
is encoded in the alpha channel of gridImage
, the image is passed to the shader, and the output
that the shader writes to outImage
is decoded by reading the alpha channel of each pixel.
Performance
Here are some stats from a micro-benchmark:
goos: windows
goarch: amd64
pkg: github.com/manuelpepe/gol/gol
cpu: AMD Ryzen 5 3600X 6-Core Processor
│ bench.txt │
│ sec/op │
NextGridShader/window-(H:10-W:10)-12 436.2µ ± 71%
NextGridShader/window-(H:50-W:50)-12 38.22m ± 6%
NextGridShader/window-(H:100-W:100)-12 36.33m ± 6%
NextGridShader/window-(H:1000-W:1000)-12 42.75m ± 6%
NextGridShader/window-(H:10000-W:10000)-12 1.051 ± 23%
NextGrid/window-(H:10-W:10)-12 3.242µ ± 0%
NextGrid/window-(H:50-W:50)-12 90.41µ ± 1%
NextGrid/window-(H:100-W:100)-12 374.2µ ± 2%
NextGrid/window-(H:1000-W:1000)-12 41.63m ± 0%
NextGrid/window-(H:10000-W:10000)-12 4.432 ± 1%
geomean 7.483m
For small grids, the shader is slower as many things have to happen to run a shader. Still, a speed-up can be seen on bigger sizes such as 1000x1000 and even more at 10000x10000. This doesn’t tell the whole story as there are more B/op and allocs/op in the shader version, but empirically it works well with my setup when running the game.
Shader V2
For the next version, I wanted to avoid wasting 3 bytes per pixel.
I wanted to go from something like this:
Where 3 bytes per pixel are unused, to something like this:
This would make better use of available memory, at the expense of complicating the shader code, as some checks will be in the same pixels and others in the adjacent ones. Here’s a visualization:
Now, checks in the X coordinate will imply both a channel switch, along with a possible pixel switch. Checks in the Y coordinate remain the same, it’s always ±W (width).
This also adds a limitation, in that the amount of columns must be divisible by 4 so we can always use complete pixels.
Code
The code for this shader is more complex than the last one:
package gol
// returns the 3 channels that must be checked when iterating in the X coordinate.
// RGBA = 0123
// R will check A at the left, G at the right and R in the middle
func sidesForChannel(ch int) [3]int {
if ch == 0 {
return [3]int{3, 0, 1}
} else if ch == 1 {
return [3]int{0, 1, 2}
} else if ch == 2 {
return [3]int{1, 2, 3}
} else if ch == 3 {
return [3]int{2, 3, 0}
}
return [3]int{5, 5, 5} // this should cause a panic if it happens
}
// returns wether a cell is alive
func isAlive(pos vec2, ch int) int {
c := imageSrc0At(pos)
if c[ch] > 0 {
return 1
}
return 0
}
// xDir is the offset being applied to X, one of {-1, 0, 1}
// ch will be the RGBA channel being counted for, one of {0,1,2,3}
// returns the offset in the X coordinate to get the correct pixel.
// it’s used to move to either the left or right pixel when checking at the edges.
func getPacketOffsetForX(xDir int, ch int) float {
if xDir == 1 && ch == 3 {
return 1.0
}
if xDir == -1 && ch == 0 {
return -1.0
}
return 0.0
}
// returns the count of alive neighbors
func countNeighs(pos vec2, chCur int) int {
channels := sidesForChannel(chCur)
deltas := [...]int{-1, 0, 1}
alive := 0
for y := 0; y < 3; y++ {
yOffset := float(deltas[y])
for x := 0; x < 3; x++ {
if x == 1 && y == 1 {
continue // skip self
}
xOffset := getPacketOffsetForX(deltas[x], chCur)
offset := vec2(xOffset, yOffset)
alive += isAlive(pos.xy+offset, channels[x])
}
}
return alive
}
// returns a set of 4 horizontal cell positions that are either alive (255) or not (0)
func Fragment(_ vec4, pos vec2, _ vec4) vec4 {
out := vec4(0, 0, 0, 0)
for ch := 0; ch < 4; ch++ { // for each RGBA
alive := countNeighs(pos, ch)
status := imageSrc0At(pos)[ch]
if alive < 2 {
status = 0.0
} else if alive > 3 {
status = 0.0
} else if alive == 3 {
status = 255
}
out[ch] = status
}
return out
}
And the code to run it:
// skipping embed and init() for simplicity
// buffer to avoid reallocating each iteration
var shaderOutputBuffer []byte
var shaderOutputBufferX, shaderOutputBufferY int
// in this version grid is modified
func NextGridShaderV2(X, Y int, grid []bool) []bool {
if X%4 != 0 {
panic("grid length must be divisible by 4 to use shader v2")
}
// create buffer if empty or outdated
if shaderOutputBuffer == nil || X != shaderOutputBufferX || Y != shaderOutputBufferY {
shaderOutputBuffer = make([]byte, X*Y)
shaderOutputBufferX = X
shaderOutputBufferY = Y
}
outImage := ebiten.NewImage(X/4, Y)
gridImage := ebiten.NewImage(X/4, Y)
// encode current grid
for y := range Y {
xPacketIx := 0
xPacketColor := color.RGBA{}
for x := range X {
if grid[y*Y+x] {
switch x % 4 {
case 0:
xPacketColor.R = 255
case 1:
xPacketColor.G = 255
case 2:
xPacketColor.B = 255
case 3:
xPacketColor.A = 255
}
}
if x%4 == 3 {
gridImage.Set(xPacketIx, y, xPacketColor)
xPacketIx += 1
xPacketColor = color.RGBA{}
}
}
}
// calculate with shader
opts := &ebiten.DrawRectShaderOptions{}
opts.Images[0] = gridImage
outImage.DrawRectShader(X/4, Y, shaderV2Instance, opts)
// decode shader output
outImage.ReadPixels(shaderOutputBuffer)
for ix, b := range shaderOutputBuffer {
grid[ix] = b > 0
}
return grid
}
Performance
The results from this V2 where… surprisingly underwhelming. The performance was way worse than V1. So much worse in fact, that the 10000x10000 simulation timed out making it difficult to get an accurate benchmark. I reduced it to 3000x3000 for the following results:
goos: windows
goarch: amd64
pkg: github.com/manuelpepe/gol/gol
cpu: AMD Ryzen 5 3600X 6-Core Processor
│ bench/v2-bench-2.txt │
│ sec/op │
NextGridShaderV2/window-(H:12-W:12)-12 6.226m ± 44%
NextGridShaderV2/window-(H:100-W:100)-12 12.39m ± 11%
NextGridShaderV2/window-(H:1000-W:1000)-12 57.82m ± 1%
NextGridShaderV2/window-(H:3000-W:3000)-12 408.8m ± 8%
NextGridShader/window-(H:12-W:12)-12 7.116m ± 3%
NextGridShader/window-(H:100-W:100)-12 7.167m ± 2%
NextGridShader/window-(H:1000-W:1000)-12 11.65m ± 10%
NextGridShader/window-(H:3000-W:3000)-12 76.41m ± 34%
NextGrid/window-(H:12-W:12)-12 4.812µ ± 1%
NextGrid/window-(H:100-W:100)-12 381.3µ ± 3%
NextGrid/window-(H:1000-W:1000)-12 41.83m ± 2%
NextGrid/window-(H:3000-W:3000)-12 385.5m ± 1%
geomean 10.78m
Again, we are disregarding B/op and allocs/op.
Shader V2-fixed
I wasn’t sure what was happening. At first, I hypothesized that using a single GPU core to process 4 cells negated the concurrency benefits of using the GPU. In hindsight, this was a bad hypothesis.
I ran pprof over 10 seconds of execution for both shaders and noticed a difference:
The NextGridShaderV2
function is clearly visible in the second image, this is strange as the processing on the GPU doesn’t happen below this function, but rather a separate thread. Zooming into this function reveals the culprit:
A lot of time is being spent in ebiten.(*Image).Set
, with this information the fix was simple to find:
if x%4 == 3 {
- gridImage.Set(xPacketIx, y, xPacketColor)
+ if xPacketColor != (color.RGBA{}) {
+ gridImage.Set(xPacketIx, y, xPacketColor)
+ xPacketColor = color.RGBA{}
+ }
xPacketIx += 1
- xPacketColor = color.RGBA{}
}
In this way, unnecessary calls to ebiten.(*Image).Set
are avoided.
Performance
After the fix, the V2 version was indeed faster than V1
│ bench/v2-bench-4.txt │
│ sec/op │
NextGridShaderV2/window-(H:12-W:12)-12 581.7µ ± 40%
NextGridShaderV2/window-(H:100-W:100)-12 576.5µ ± 64%
NextGridShaderV2/window-(H:1000-W:1000)-12 3.245m ± 6%
NextGridShaderV2/window-(H:10000-W:10000)-12 423.9m ± 17%
NextGridShader/window-(H:12-W:12)-12 2.312m ± 71%
NextGridShader/window-(H:100-W:100)-12 959.9µ ± 43%
NextGridShader/window-(H:1000-W:1000)-12 9.435m ± 26%
NextGridShader/window-(H:10000-W:10000)-12 1.051 ± 17%
NextGrid/window-(H:12-W:12)-12 4.832µ ± 4%
NextGrid/window-(H:100-W:100)-12 378.8µ ± 2%
NextGrid/window-(H:1000-W:1000)-12 42.24m ± 1%
NextGrid/window-(H:10000-W:10000)-12 4.488 ± 1%
geomean 6.246m
Last Improvement
The last improvement I made was not necessarily about the shader, but how I handled the images being sent to it.
As the docs specified, reusing images with .Clear()
is more efficient than recreating them each frame.
I’ll leave the code to do this as an exercise to the reader (or go look at the repository linked at the end)
This change did increase the performance a fair bit, this is the last benchmark after all of this:
goos: windows
goarch: amd64
pkg: github.com/manuelpepe/gol/gol
cpu: AMD Ryzen 5 3600X 6-Core Processor
│ bench/v2-bench-7.txt │
│ sec/op │
NextGridShaderV2/window-(H:12-W:12)-12 249.9µ ± 12%
NextGridShaderV2/window-(H:100-W:100)-12 269.1µ ± 8%
NextGridShaderV2/window-(H:1000-W:1000)-12 2.971m ± 10%
NextGridShaderV2/window-(H:10000-W:10000)-12 216.2m ± 8%
NextGridShader/window-(H:12-W:12)-12 191.8µ ± 5%
NextGridShader/window-(H:100-W:100)-12 240.3µ ± 11%
NextGridShader/window-(H:1000-W:1000)-12 4.171m ± 14%
NextGridShader/window-(H:10000-W:10000)-12 513.9m ± 8%
NextGrid/window-(H:12-W:12)-12 4.907µ ± 2%
NextGrid/window-(H:100-W:100)-12 382.2µ ± 1%
NextGrid/window-(H:1000-W:1000)-12 42.70m ± 1%
NextGrid/window-(H:10000-W:10000)-12 4.543 ± 0%
geomean 3.280m
You can look at more bench data in the repository.
Conclusion
For me, this was a fun way to refresh some old concepts and introduce myself to shaders. Even if it’s not in the most widely used shader language, nor any complicated math was involved, I believe that learning the basic concepts and gaining the experience of optimizing something using the GPU is worth it by itself.
Nonetheless, it’s notable that Kage shaders are very difficult to debug. The tooling to debug them is pretty much zero as far as I know. I don’t know if this is the case with other shader languages.
You can look at all the code here: https://github.com/manuelpepe/gol