Table of Contents

Introduction

Recently I’ve been implementing a card game some friends showed me in Go as a side-project. In this journey I ended up implementing Bots of various levels of difficulty and a simulation package to make them play against each other and validate that they indeed play better (or at least win more games) than their previous levels. The performance of the simulation was improved from roughly 2000 games in 30s to 10000 games in 3s.

The game is called Tincho and if you wish to play it’s available (with low resources) at tincho.manuelpepe.com

In this entry I’ll showcase some iterations I went through improving the performance of the simulation routine. I won’t show the full implementation to avoid explaining a lot of useless context, but we’ll go through the main ideas. At the end I’ll add the results of various simulations performed on the existing bots.

One game

I started simply with a Result type and a Play function that can match just two bot strategies against each other:

type Result struct {
    Winner      int  // index of the winning strategy
    TotalRounds int
    TotalTurns  int
}

func Play(ctx context.Context, strat, strat2 bots.Strategy) (Result, error) {
    // setup to start game
    return Result{ /* ... */ }, nil
}

This allowed me to code a simple test like:

func TestEasyVsMedium(t *testing.T) {
    winsForMedium := 0
    for i := 0; i < 100; i++ {
        res, err := Play(context.Background(), bots.NewEasyStrategy(), bots.NewMediumStrategy())
        assert.NoError(t, err)
        winsForMedium += res.Winner
    }
    // medium should win 80% of the time
    assert.GreaterOrEqual(t, winsForMedium, 80)
}

Multiple games

I didn’t want to write the loop each time, also I need some kind of summary of the data to avoid scrolling through hundreds of games.

So I added the Compete function to play a bunch of games, returning all their results:

// `func() bots.Strategy` is used to build a clean strategy each game instead of 
// reusing the same struct as they most likely hold game state.
func Compete(ctx context.Context, iters int, strat, strat2 func() bots.Strategy) ([]Result, error) {
    results := make([]Result, iters)

    for i := 0; i < iters; i++ {
        result, err := Play(ctx, strat(), strat2())
        if err != nil {
            return nil, fmt.Errorf("error on round %d: %w", i, err)
        }
        results[i] = result
    }

    return results, nil
}

And a separate function that summarizes this list of results into a Summary struct:

type Summary struct {
    Strat1Summary StratSummary
    Strat2Summary StratSummary

    TotalRounds int
    TotalTurns  int
}

type StratSummary struct {
    Wins   int
    Rounds MinMaxMean
    Turns  MinMaxMean
}

type MinMaxMean struct {
    Min  int
    Max  int
    Mean int
    Sum  int
}

func Summarize(results []Result) Summary {
    // a big loop aggregating the results
    return Summary{ /* ... */ }
}

Which allows me to write tests like:

func TestEasyVsMediumCompeteSummary(t *testing.T) {
    easy := func() bots.Strategy {
        return bots.NewEasyStrategy()
    }
    medium := func() bots.Strategy {
        return bots.NewMediumStrategy()
    }

    res, err := Compete(context.Background(), 1000, easy, medium)
    assert.NoError(t, err)

    summary := Summarize(res)
    fmt.Printf("Summary: %+v\n", summary)
}

At this point, running 2000 iterations (games) takes about 30 seconds. It’s not that bad, but it got better.

Concurrency

You may notice that the games are being run sequentially. For the next iteration, I introduced concurrency with goroutines:

func Compete(ctx context.Context, iters int, strat, strat2 func() bots.Strategy) ([]Result, error) {
    ctx, cancelPendingGames := context.WithCancel(ctx)

    outs := make(chan Result)
    errs := make(chan error)

    // start every game in a different goroutine
    for i := 0; i < iters; i++ {
        go func() {
            result, err := Play(ctx, strat(), strat2())
            if err != nil {
                select {
                case <-ctx.Done():
                    return
                case errs <- fmt.Errorf("error on round: %w", err):
                    return
                }
            }

            select {
            case <-ctx.Done():
                return
            case outs <- result:
                return
            }
        }()
    }

    // collect all results
    results := make([]Result, iters)
    for i := 0; i < iters; i++ {
        select {
        case result := <-outs:
            results[i] = result
        case err := <-errs:
            cancelPendingGames()
            return nil, err
        }
    }

	cancelPendingGames()
    return results, nil
}

The test code doesn’t change, but we can now run 2000 games in 10 seconds from 30 seconds previously.

More concurrency

The next thing I went for was the Summary processing, it happens at the end of the routine but may very well be calculated as we gather the results. This also has the side-effect of reducing the memory footprint of our program, as we don’t need to keep the results for N games in memory at the same time.

So I removed the Summarize() function and modified the way results are gathered in Compete():

func Compete(ctx context.Context, iters int, strat, strat2 func() bots.Strategy) (Summary, error) {
    ctx, cancelPendingGames := context.WithCancel(ctx)

    outs := make(chan Result)
    errs := make(chan error)

    // start every game in a different goroutine
    for i := 0; i < iters; i++ {
        // this code is the same as above
    }

    // collect all results in a goroutine and summarize them
    var finalResChan = make(chan Summary)
    var finalErrChan = make(chan error)
    go func() {
        summary := Summary{}
        for i := 0; i < iters; i++ {
            select {
            case result := <-outs:
                // update summary
            case err := <-errs:
                finalErrChan <- err
                return 
            }
        }
        finalResChan <- summary
    }()

    // wait for the final result
    select {
    case res := <-finalResChan:
        cancelPendingGames()
        return res, nil
    case err := <-finalErrChan:
        cancelPendingGames()
        return OldSummary{}, err
    }
}

This change doesn’t increase the time performance that much, still at rougly 2000 games per 10 seconds, but it should decrease the memory consumption (I really haven’t measured this yet)

Limiting concurrency

Now, goroutines are cheap right? Yes, but they are not free. At this point, all goroutines are started at the same time which at around 10k iterations in my computer, results in an OOM (Out Of Memory) kill of the process.

To fix this, I introduced a limit to the amount of workers that can be spawned on Compete():

func Compete(ctx context.Context, iters int, strat, strat2 func() bots.Strategy) (Summary, error) {
    ctx, cancelPendingGames := context.WithCancel(ctx)

    outs := make(chan Result)
    errs := make(chan error)

    // start worker goroutines
    pending := make(chan struct{})
    workerCount := min(iters, 10000)
    for i := 0; i < workerCount; i++ {
        go func() {
            for {
                select {
                case <-ctx.Done():
                    return
                case <-pending:
                    result, err := Play(ctx, strat(), strat2())
                    if err != nil {
                        select {
                        case <-ctx.Done():
                            return
                        case errs <- fmt.Errorf("error on round: %w", err):
                        }
                    }

                    select {
                    case <-ctx.Done():
                        return
                    case outs <- result:
                    }

                }
            }
        }()
    }

    // start queueing goroutine
    go func() {
        for i := 0; i < rounds; i++ {
            select {
            case <-ctx.Done():
                return
            case pending <- struct{}{}:
            }
        }
    }()

    // the remaining code is the same as above for summarizing and returning the results
}

With the last change, we limited the amount of worker goroutines to an arbitrary max of 10000 goroutines, and made them long lived by looping and consuming from a pending channel that is being filled by a separate queueing goroutine.

This last change allows us to run simulations of any number of games without fear of being OOM killed.

Analysis with pprof

Now the modifications of the Compete procedure stop. The last two modifications did not yield significant improvements specifically on performance. The next step was to dig through the application logic for inefficiencies outside of the simulation routine. This falls out of scope of this entry, but I’ll mention a few more things.

The first step to optimize the application was to run go tool pprof while running a simulation of a fair number of games. In this case, a lot of time was being occupied by json.Unmarshall given how I initially structured some messages with json.RawMessage. I replaced that with an implementation that uses generics for an extra performance boost. I go into a bit more detail in this other entry if you’re interested. This in fact yielded a big performance boost.

Testing for leaks and races

When working with goroutines, remember to always run go test with the -race parameter to check for data races. They are easy to introduce and could cause subtile bugs.

I also introduced uber-go/goleak to my tests to check for goroutine leaks.

Running the simulation outside of `go test`

Going back to the first iteration, I started with one match between two bots and wrote a test with a loop in it. From that point on I moved the loop to the library implementation, but kept running the simulations inside test functions.

When using go test the execution of any program will be a lot slower than it would be with go run, as go has to keep track of a lot more things, mostly to be able to work with debuggers.

If you’r running simulations from test functions, moving them to their own package that can be compiled and run without the test suite on will drastically improve performance.

Results

Finally, here are the results of simulating multiple matches agains the three bot difficulties available at tincho.manuelpepe.com:

# E = Easy
# M = Medium
# H = Hard

$ go run cmd/sim/main.go --all
=== RUN: EvE
0: {Wins:5036 Rounds:{Min:2 Max:8 Mean:3 Sum:19291} Turns:{Min:1 Max:392 Mean:73 Sum:369592}}
1: {Wins:4964 Rounds:{Min:2 Max:8 Mean:3 Sum:18792} Turns:{Min:1 Max:279 Mean:71 Sum:356501}}
Total Games: 10000
Total Rounds: {Min:2 Max:8 Mean:3 Sum:38083}
Total Turns: {Min:1 Max:392 Mean:72 Sum:726093}
--- OK (3.622077403s)

=== RUN: EvM
0: {Wins:731 Rounds:{Min:2 Max:7 Mean:4 Sum:2979} Turns:{Min:1 Max:127 Mean:36 Sum:26643}}
1: {Wins:9269 Rounds:{Min:2 Max:8 Mean:3 Sum:36684} Turns:{Min:0 Max:174 Mean:51 Sum:480269}}
Total Games: 10000
Total Rounds: {Min:2 Max:8 Mean:3 Sum:39663}
Total Turns: {Min:0 Max:174 Mean:50 Sum:506912}
--- OK (2.637042901s)

=== RUN: EvH
0: {Wins:6 Rounds:{Min:4 Max:7 Mean:4 Sum:29} Turns:{Min:10 Max:51 Mean:28 Sum:171}}
1: {Wins:9994 Rounds:{Min:2 Max:9 Mean:4 Sum:40156} Turns:{Min:3 Max:125 Mean:52 Sum:520547}}
Total Games: 10000
Total Rounds: {Min:2 Max:9 Mean:4 Sum:40185}
Total Turns: {Min:3 Max:125 Mean:52 Sum:520718}
--- OK (2.317444913s)

=== RUN: MvM
0: {Wins:5170 Rounds:{Min:2 Max:10 Mean:4 Sum:25649} Turns:{Min:1 Max:206 Mean:58 Sum:303541}}
1: {Wins:4830 Rounds:{Min:2 Max:10 Mean:5 Sum:24202} Turns:{Min:0 Max:235 Mean:60 Sum:290082}}
Total Games: 10000
Total Rounds: {Min:2 Max:10 Mean:4 Sum:49851}
Total Turns: {Min:0 Max:235 Mean:59 Sum:593623}
--- OK (2.937989917s)

=== RUN: MvH
0: {Wins:127 Rounds:{Min:3 Max:12 Mean:6 Sum:877} Turns:{Min:5 Max:157 Mean:64 Sum:8243}}
1: {Wins:9873 Rounds:{Min:2 Max:14 Mean:5 Sum:53097} Turns:{Min:1 Max:191 Mean:70 Sum:698592}}
Total Games: 10000
Total Rounds: {Min:2 Max:14 Mean:5 Sum:53974}
Total Turns: {Min:1 Max:191 Mean:70 Sum:706835}
--- OK (3.042727078s)

=== RUN: EvMvH
0: {Wins:165 Rounds:{Min:2 Max:6 Mean:3 Sum:582} Turns:{Min:1 Max:95 Mean:29 Sum:4948}}
1: {Wins:1373 Rounds:{Min:2 Max:6 Mean:3 Sum:5063} Turns:{Min:1 Max:138 Mean:46 Sum:64351}}
2: {Wins:8462 Rounds:{Min:2 Max:7 Mean:3 Sum:32076} Turns:{Min:1 Max:152 Mean:59 Sum:507158}}
Total Games: 10000
Total Rounds: {Min:2 Max:7 Mean:3 Sum:37721}
Total Turns: {Min:1 Max:152 Mean:57 Sum:576457}
--- OK (3.137018959s)

As you can see, after all optimizations we got it down from 30s for 2k to 3s for 10k.

You might also notice that there is no HvH simulation. Currently, when I match two Hard bots against each other, there is a good chance that they get into a loop of winning one -10 round each with low score added to the other’s bot hand, eventually timing out the simulation as the game never ends.

As a last point, I went back and made the strategy inputs variadic to allow for simulations with more than two bots, that is the EvMvH simulation.

This was a fun journey and I may come back to this at some point.

Thank you for reading, hope you found something interesting here, and let me know if you have any comments.